Name		Name	Last commit message	Last commit date
parent directory ..
sutysisku-2022-03-08		sutysisku-2022-03-08
tatoeba-2022-03-08		tatoeba-2022-03-08
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
interpret_apply_split.py		interpret_apply_split.py
requirements.txt		requirements.txt
zmifanva_convert.py		zmifanva_convert.py
zmifanva_fetch.sh		zmifanva_fetch.sh
zmifanva_solr_xml_to_bitext.py		zmifanva_solr_xml_to_bitext.py

README.md

Parallel corpus of Lojban sentences

The corpus as a huggingface dataset: https://huggingface.co/datasets/olpa/jbo-corpus.

Usage (requirement: datasets library):

>>> import datasets

>>> ds = datasets.load_dataset(path='olpa/jbo-corpus')
>>> print(ds)
DatasetDict({
    train: Dataset({
        features: ['id', 'jb', 'jb_tok', 'en', 'en_tok', 'source'],
        num_rows: 8844
    })
    test: Dataset({
        features: ['id', 'jb', 'jb_tok', 'en', 'en_tok', 'source'],
        num_rows: 2688
    })
    validation: Dataset({
        features: ['id', 'jb', 'jb_tok', 'en', 'en_tok', 'source'],
        num_rows: 2681
    })
})
>>> ds['train'][124]
{
  'id': 'Conlang:72',
  'jb': "la batman jo'u la robin se steci lo ka balpre bu'u la gotam",
  'jb_tok': "la batci## manku jo'u la ro bi n se steci lo ka banli## prenu bu'u la go ta m",
  'en': 'Batman and Robin are the only heros in Gotham.',
  'en_tok': 'batman and robin are the only hero ##s in gotham .',
  'source': 'conlang'
}

The fields jb_tok and en_tok are the tokenized versions of jb and en correspondingly. Lojban tokenizer. English tokenizer: from huggingface transformers, "bert-base-uncased".

How to use datasets:

Sources

At the moment, the corpus is taken from zmifanva. More sources will eventually come. To mention few:

Contributing

Understand the output and read the sources of zmifanva process:

make zmifanva_get
make zmifanva_convert
note the use of seed parameter

Implement the same process for another data source and create a pull request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

README.md

Parallel corpus of Lojban sentences

Sources

Contributing

Files

data

Directory actions

More options

Directory actions

More options

Latest commit

History

data

Folders and files

parent directory

README.md

Parallel corpus of Lojban sentences

Sources

Contributing