layout | title | nav_order |
---|---|---|
default |
Datasets |
4 |
Dataset importers can be used in datasets
sections of the training config.
Example:
train:
- opus_ada83/v1
- mtdata_newstest2014_ruen
Data source | Prefix | Name examples | Type | Comments |
---|---|---|---|---|
MTData | mtdata | newstest2017_ruen | corpus | Supports many datasets. Run mtdata list -l ru-en to see datasets for a specific language pair. |
OPUS | opus | ParaCrawl/v7.1 | corpus | Many open source datasets. Go to the website, choose a language pair, check links under Moses column to see what names and version is used in a link. |
SacreBLEU | sacrebleu | wmt20 | corpus | Official evaluation datasets available in SacreBLEU tool. Recommended to use in datasets:test config section. Look up supported datasets and language pairs in sacrebleu.dataset python module. |
Flores | flores | dev, devtest | corpus | Evaluation dataset from Facebook that supports 100 languages. |
Custom parallel | url | https://storage.googleapis.com/releng-translations-dev/data/en-ru/pytest-dataset.[LANG].zst |
corpus | A custom zst compressed parallel dataset, for instance uploaded to GCS. The language pairs should be split into two files. the [LANG] will be replaced with the to and from language codes. |
Paracrawl | paracrawl-mono | paracrawl8 | mono | Datasets that are crawled from the web. Only mono datasets are used in this importer. Parallel corpus is available using opus importer. |
News crawl | news-crawl | news.2019 | mono | Some news monolingual datasets from WMT21 |
Common crawl | commoncrawl | wmt16 | mono | Huge web crawl datasets. The links are posted on WMT21 |
Custom mono | url | https://storage.googleapis.com/releng-translations-dev/data/en-ru/pytest-dataset.ru.zst |
mono | A custom zst compressed monolingual dataset, for instance uploaded to GCS. |
You can also use find-corpus tool to find all datasets for an importer and get them formatted to use in config.
Set up a local poetry environment.
task find-corpus -- en ru
Make sure to check licenses of the datasets before using them.
Just add a shell script to corpus or mono which is named as <prefix>.sh
and accepts the same parameters as the other scripts from the same folder.