SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects
This repository contains the annotated English dataset, the script to extend annotation to other languages and code to run baseline text classification models.
- python
- transformers : state-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch.
- sklearn
- evaluate
- datasets
- pandas
pip install -r code/requirements.txt
sh get_flores_and_annotate.sh
or
Download it from huggingface dataset: Davlan/sib200
cd code/
sh xlmr_all.sh
@misc{adelani2023sib200,
title={SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects},
author={David Ifeoluwa Adelani and Hannah Liu and Xiaoyu Shen and Nikita Vassilyev and Jesujoba O. Alabi and Yanke Mao and Haonan Gao and Annie En-Shiun Lee},
year={2023},
eprint={2309.07445},
archivePrefix={arXiv},
primaryClass={cs.CL}
}