Zindi and AI4D build language datasets for African NLP
When AI researcher and language lover Dr Amelia Taylor first moved to Malawi, she soon began learning Chichewa, Malawi’s indigenous official language. She found that there were few resources for teaching Chichewa that covered pronunciation and grammar, or made use of sound and video.
“I’m not a Chichewa speaker,” Taylor explains. “So one of the first things I did when I moved to Malawi was to learn about the language and the culture. As my husband and I were learning the language, we started to develop a course based on our learning experience.”
As part of tNyasa Ltd, she and her husband developed a Chichewa course together with Paul, their Chichewa teacher.
Starting with lessons familiarising the student with the pronunciation of the sounds of the language and basic…
So when the AI4D Africa Language Challenge was launched on Zindi at the end of 2019, Taylor and her colleagues were excited to make a submission.
Looking for African NLP data
The challenge called on natural language processing (NLP) researchers and data scientists across Africa to develop and submit datasets for underserved or underrepresented African languages, for use in future NLP applications like automated translation or speech recognition.
“We wanted to see what African NLP practitioners could do if we empowered them with the right resources,” says Kathleen Siminyu, Regional Network Coordinator for AI for Development (AI4D) Africa. “We planned to host a traditional NLP challenge on Zindi, but we realised that access to good African language datasets was a hurdle we could not ignore.”
Funding to create great NLP datasets
Between November 2019 and March 2020, Zindi received more than 40 submissions to the challenge, representing more than 30 African languages. Ten winners were selected (two per month), and AI4D has now chosen five of these to be funded as AI4D Dataset Creation Fellows. The fellows are:
- Amelia Taylor (Chichewa — spoken in Malawi and Zambia)
- Kevin Degila (Fongbe — spoken in Benin, Nigeria and Togo)
- David Adelani (Yoruba — spoken Nigeria)
- Thierno Diop (Wolof — spoken in Senegal, Gambia and Mauritania)
- Hatem Haddad (Tunisian Arabizi — Arabic dialect spoken in Tunisia)
“The heart of this project is about creating a more inclusive AI sector. We are thrilled to host this challenge on Zindi because it means offering everyone, irrespective of their background, the chance to learn about and contribute to this growing field. This challenge allowed us to showcase the work of people across the continent that are capable of making amazing contributions — now or in the future- but may not have seen themselves as ‘qualified’ NLP researchers,” says Celina Lee, CEO of Zindi.
Next steps on Zindi and beyond
Over the next three months, the fellows will get financial and other support to develop their datasets further. They will focus on expanding data sources, ensuring that the datasets are developed with their final purpose in mind, and making sure that the data is sourced in a legal, ethical, and unbiased way. The fellows will have support from partners Knowledge 4 All and CIPIT at Strathmore University.
At the same time, Zindi is pleased to be hosting a second African Language Challenge in partnership with AI4D and GIZ, where NLP practitioners from South Africa, Ghana and Uganda can submit their datasets for a chance to be included in this fellowship programme. The competition closes on 3 August.
GIZ AI4D Africa Language Challenge | Zindi
Calling on the African NLP community to uncover and create Afrcian language datasets for improved representation in the field of NLP.
Meanwhile, the AI4D fellows and their teams will build out their datasets and work towards publishing the dataset in October 2020, along with an academic article detailing their work. Once published, these eight datasets will be put to work as NLP challenges on Zindi, and eventually help ensure that African languages are adequately represented in digital spaces. This work will also inform UNESCO AI policy and be included in UNESCO’s World Atlas of Languages platform.
“Working in AI on the African continent, there is a lot of high-level optimism, which is great,” says Siminyu. “But it’s the groundwork that matters. Getting this right will mean wide exploitation, which will drive more funding and more support. This kind of work makes that optimism actually mean something.”