Skip to content

Latest commit

 

History

History

data

Parallel corpus of Lojban sentences

The corpus as a huggingface dataset: https://huggingface.co/datasets/olpa/jbo-corpus.

Usage (requirement: datasets library):

>>> import datasets

>>> ds = datasets.load_dataset(path='olpa/jbo-corpus')
>>> print(ds)
DatasetDict({
    train: Dataset({
        features: ['id', 'jb', 'jb_tok', 'en', 'en_tok', 'source'],
        num_rows: 8844
    })
    test: Dataset({
        features: ['id', 'jb', 'jb_tok', 'en', 'en_tok', 'source'],
        num_rows: 2688
    })
    validation: Dataset({
        features: ['id', 'jb', 'jb_tok', 'en', 'en_tok', 'source'],
        num_rows: 2681
    })
})
>>> ds['train'][124]
{
  'id': 'Conlang:72',
  'jb': "la batman jo'u la robin se steci lo ka balpre bu'u la gotam",
  'jb_tok': "la batci## manku jo'u la ro bi n se steci lo ka banli## prenu bu'u la go ta m",
  'en': 'Batman and Robin are the only heros in Gotham.',
  'en_tok': 'batman and robin are the only hero ##s in gotham .',
  'source': 'conlang'
}

The fields jb_tok and en_tok are the tokenized versions of jb and en correspondingly. Lojban tokenizer. English tokenizer: from huggingface transformers, "bert-base-uncased".

How to use datasets:

Sources

At the moment, the corpus is taken from zmifanva. More sources will eventually come. To mention few:

Contributing

Understand the output and read the sources of zmifanva process:

  • make zmifanva_get
  • make zmifanva_convert
  • note the use of seed parameter

Implement the same process for another data source and create a pull request.