layout	title	nav_order
default	Datasets	4

Dataset importers

Dataset importers can be used in datasets sections of the training config.

Example:

  train:
    - opus_ada83/v1
    - mtdata_newstest2014_ruen

Data source	Prefix	Name examples	Type	Comments
MTData	mtdata	newstest2017_ruen	corpus	Supports many datasets. Run `mtdata list -l ru-en` to see datasets for a specific language pair.
OPUS	opus	ParaCrawl/v7.1	corpus	Many open source datasets. Go to the website, choose a language pair, check links under Moses column to see what names and version is used in a link.
SacreBLEU	sacrebleu	wmt20	corpus	Official evaluation datasets available in SacreBLEU tool. Recommended to use in `datasets:test` config section. Look up supported datasets and language pairs in `sacrebleu.dataset` python module.
Flores	flores	dev, devtest	corpus	Evaluation dataset from Facebook that supports 100 languages.
Custom parallel	url	`https://storage.googleapis.com/releng-translations-dev/data/en-ru/pytest-dataset.[LANG].zst`	corpus	A custom zst compressed parallel dataset, for instance uploaded to GCS. The language pairs should be split into two files. the `[LANG]` will be replaced with the `to` and `from` language codes.
Paracrawl	paracrawl-mono	paracrawl8	mono	Datasets that are crawled from the web. Only mono datasets are used in this importer. Parallel corpus is available using opus importer.
News crawl	news-crawl	news.2019	mono	Some news monolingual datasets from WMT21
Common crawl	commoncrawl	wmt16	mono	Huge web crawl datasets. The links are posted on WMT21
Custom mono	url	`https://storage.googleapis.com/releng-translations-dev/data/en-ru/pytest-dataset.ru.zst`	mono	A custom zst compressed monolingual dataset, for instance uploaded to GCS.

You can also use find-corpus tool to find all datasets for an importer and get them formatted to use in config.

Set up a local poetry environment.

task find-corpus -- en ru

Make sure to check licenses of the datasets before using them.

Adding a new importer

Just add a shell script to corpus or mono which is named as <prefix>.sh and accepts the same parameters as the other scripts from the same folder.