List of all the resources we developed in collaboration with LSV and Masakhane during my doctoral studies and beyond
- African News corpus: Please cite our MAFT paper if you use it
- AfroMAFT Corpus: Language Adaptation Corpus for 17 African languages, English, French and Arabic. Please cite the MAFAND paper if you use it. We use this corpus to train all the multilingual PLMs listed below
The models below are created using multilingual adaptive fine-tuning (MAFT) on XLMR-distilled model, XLM-R, mT5, ByT5 and mBART. We list the model, model size (in millions), and architecture. We cover the following 20 languages: afr, amh, ara, eng, fra, hau, ibo, mlg, nya, orm, pcm, kin, run, sna, som, sot, swa, xho, yor, zul
Model | Size (M) | architecture |
---|---|---|
AfroXLMR-mini | 117M | Masked LM |
AfroXLMR-small | 140M | Masked LM |
AfroXLMR-base | 270M | Masked LM |
AfroXLMR-large | 550M | Masked LM |
AfriMT5 | 580M | Seq-to-Seq |
AfriByT5 | 580M | Seq-to-Seq |
AfriMBART | 610M | Seq-to-Seq |
The following PLMs are created by language adaptation to a language using monolingual corpus in that language. The monolingual corpus used to create them are described in the MasakhaNER paper and MAFT paper
We provide better quality word embeddings than the pre-trained FastText embeddings trained on Common crawl and Wikipedia. While we did not evaluate the quality on all the languages, our evaluation on Yoruba and Twi shows that they give better performance on word similarity tasks. The FastText embeddings are trained on curated data from JW300, Bible, VOA, BBC, and other news websites. Details of the data sources are in my PhD dissertation.
We trained the FastText embeddings using Gensim 3.8.1. All embedding models can be downloaded from Zenodo. Please, find the links below.
Language | Link to Model |
---|---|
amh | Amharic FastText |
bam | Bambara FastText |
bbj | Ghomala FastText |
ewe | Ewe FastText |
fon | Fon FastText |
hau | Hausa FastText |
ibo | Igbo FastText |
kin | Kinyarwanda FastText |
lug | Luganda FastText |
luo | Luo FastText |
mos | Mossi FastText |
nya | Chichewa FastText |
pcm | Nigerian-Pidgin FastText |
sna | Setswana FastText |
swa | Swahili FastText |
tsn | Setswana FastText |
twi | Twi FastText |
wol | Wolof FastText |
xho | Xhosa FastText |
yor | Yoruba FastText |
zul | Zulu FastText |