The corpus as a huggingface dataset: https://huggingface.co/datasets/olpa/jbo-corpus.
Usage (requirement: datasets library):
>>> import datasets
>>> ds = datasets.load_dataset(path='olpa/jbo-corpus')
>>> print(ds)
DatasetDict({
train: Dataset({
features: ['id', 'jb', 'jb_tok', 'en', 'en_tok', 'source'],
num_rows: 8844
})
test: Dataset({
features: ['id', 'jb', 'jb_tok', 'en', 'en_tok', 'source'],
num_rows: 2688
})
validation: Dataset({
features: ['id', 'jb', 'jb_tok', 'en', 'en_tok', 'source'],
num_rows: 2681
})
})
>>> ds['train'][124]
{
'id': 'Conlang:72',
'jb': "la batman jo'u la robin se steci lo ka balpre bu'u la gotam",
'jb_tok': "la batci## manku jo'u la ro bi n se steci lo ka banli## prenu bu'u la go ta m",
'en': 'Batman and Robin are the only heros in Gotham.',
'en_tok': 'batman and robin are the only hero ##s in gotham .',
'source': 'conlang'
}
The fields jb_tok
and en_tok
are the tokenized versions of jb
and en
correspondingly. Lojban tokenizer. English tokenizer: from huggingface transformers, "bert-base-uncased".
How to use datasets:
- Course -> 3.Fine-tuning a pretrained model -> Processing the data
- Course -> 5.The datasets library -> Time to slice and dice
At the moment, the corpus is taken from zmifanva. More sources will eventually come. To mention few:
Understand the output and read the sources of zmifanva process:
make zmifanva_get
make zmifanva_convert
- note the use of
seed
parameter
Implement the same process for another data source and create a pull request.