Skip to content

The objective of this kata is to label wikipedia articles with their languages.

Notifications You must be signed in to change notification settings

ldnpydojo/dojo-lang-detector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dojo_lang_detector

The objective of this dojo idea is to label wikipedia articles with their languages.

To achieve this training and tests data are provided in this repository with their schemas documented below.

To get a score on the dojo leaderboard, your script will have to be able to take a filename of a test dataset and generate an answer file.

i.e:

python label_articles.py < test_200.json > team_n_answers.json

Then the grading.py script is going to be used to get the official dojo score.

The score is computed by adding 1 for correct guesses, -1 for incorrect guesses and 0 for no guess.

Files schema

lang_train.json

Label training dataset as a jsonl (JSON Lines) file containing objects with the following schema:

  1. text UTF-8 extract from wikipedia articles, cleared of HTML tags.
  2. lang iso code of the language in the extract.
  3. subject the wikipedia subject of the language of the extract

train_*.json

Another label training dataset as a jsonl file containing objects with the same schema as lang_train.json but in which the text is only 100 or 200 characters from the middle of the article.

test_100.json

Unlabel label test dataset as a jsonl file containing objects with the following schema:

text
100 characters long UTF-8 extract from wikipedia articles, cleared of HTML tags
example
Example identifier.

random_solution.py

Example solution, to demonstrate the expected output.

random_solution_answers.json

An example answer file as generated by random_solution.py, following schema:

example
Example identifier.
lang
Code of the language guessed by the solution. null denotes no guess.

languages.json

A json object containing the mapping between the languages iso codes and their human names.

jsonl

The jsonl format contains newline delimited json objects. For example:

    {"lang": "it", "text": "ico del Nord...", "subject": "Atlantic_Ocean"}
    {"lang": "be", "text": "га да гораду ...", "subject": "New_York_City"}

A typical way to decode those files in Python is to use such a generator comprehension: json.loads(line) for line in open(json_l_filename).

More details at http://jsonlines.org/.

About

The objective of this kata is to label wikipedia articles with their languages.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages