Skip to content

Benchmark corpus for spelling correction in user-generated content for Brazilian Portuguese [Mendonca et al. 2016]

License

Notifications You must be signed in to change notification settings

gustavoauma/propor_2016_speller

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

NILC Benchmark Corpus to Evaluate Spelling Correction of UGC in Brazilian Portuguese

Description

This corpus consists of 1,699 sentences in Brazilian Portuguese which were manually annotated by two linguists for spelling correction purposes. The sentences contains product reviews written by users on the web (user-generated content), crawled from the Buscape's website. The mispellings are classified in 4 classes:

  1. Typo (1,027 tokens): mispellings related to typographical problems, usually related to key adjacency or fast keystroking.
  2. Phono (732 tokens: 683 not contextual and 49 contextual): cognitive mispellings, which are the produced by a lack of understanding of letter-to-sound correspondences in written language. Contextual phonological errors are mispellings which generate a character sequence that corresponds to another existing word in the dictionary such as: "eu vou compra" / "eu vou comprar".
  3. Diac (2,037 tokens: 1,625 not contextual and 412 contextual): this class identifies misspellings which are related to the inserting, removing or replacing diacritics from a given word, e.g. "organizacao" / "organização".
  4. Int_slang (201 tokens): use of internet slang.
  5. Other (86 tokens): other types of errors/ spurious ortography that do not belong to any of the above classes, such as abbreviations, loanwords, proper nouns, technical jargon, etc.

Number of words in the corpus: 38,128
Number of mispellings: 4,083 (10,7%)

Last update: December 1st, 2014

Contributors

Error Analysis: Magali Duran, Gustavo Mendonça
Annotators: Erick Fonseca, Graça Volpe-Nunes, Gustavo Mendonça, Lucas Avanço, Magali Duran, Sandra Aluísio

About

Benchmark corpus for spelling correction in user-generated content for Brazilian Portuguese [Mendonca et al. 2016]

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published