GitHub - dadelani/sib-200: SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects

SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects

This repository contains the annotated English dataset, the script to extend annotation to other languages and code to run baseline text classification models.

Required dependencies

python
- transformers : state-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch.
- sklearn
- evaluate
- datasets
- pandas

pip install -r code/requirements.txt

Create SIB dataset

sh get_flores_and_annotate.sh

or

Download it from huggingface dataset: Davlan/sib200

Run our baseline model using XLM-R

cd code/
sh xlmr_all.sh

BibTeX entry and citation info

@misc{adelani2023sib200,
      title={SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects}, 
      author={David Ifeoluwa Adelani and Hannah Liu and Xiaoyu Shen and Nikita Vassilyev and Jesujoba O. Alabi and Yanke Mao and Haonan Gao and Annie En-Shiun Lee},
      year={2023},
      eprint={2309.07445},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
code		code
data		data
LICENSE		LICENSE
README.md		README.md
create_sib_data.py		create_sib_data.py
get_flores_and_annotate.sh		get_flores_and_annotate.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects

Required dependencies

Create SIB dataset

Run our baseline model using XLM-R

BibTeX entry and citation info

About

Releases

Packages

Languages

License

dadelani/sib-200

Folders and files

Latest commit

History

Repository files navigation

SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects

Required dependencies

Create SIB dataset

Run our baseline model using XLM-R

BibTeX entry and citation info

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages