This corpus consists of 1,699 sentences in Brazilian Portuguese which were manually annotated by two linguists for spelling correction purposes. The sentences contains product reviews written by users on the web (user-generated content), crawled from the Buscape's website. The mispellings are classified in 4 classes:
- Typo (1,027 tokens): mispellings related to typographical problems, usually related to key adjacency or fast keystroking.
- Phono (732 tokens: 683 not contextual and 49 contextual): cognitive mispellings, which are the produced by a lack of understanding of letter-to-sound correspondences in written language. Contextual phonological errors are mispellings which generate a character sequence that corresponds to another existing word in the dictionary such as: "eu vou compra" / "eu vou comprar".
- Diac (2,037 tokens: 1,625 not contextual and 412 contextual): this class identifies misspellings which are related to the inserting, removing or replacing diacritics from a given word, e.g. "organizacao" / "organização".
- Int_slang (201 tokens): use of internet slang.
- Other (86 tokens): other types of errors/ spurious ortography that do not belong to any of the above classes, such as abbreviations, loanwords, proper nouns, technical jargon, etc.
Number of words in the corpus: 38,128
Number of mispellings: 4,083 (10,7%)
Last update: December 1st, 2014
Error Analysis: Magali Duran, Gustavo Mendonça
Annotators: Erick Fonseca, Graça Volpe-Nunes, Gustavo Mendonça, Lucas Avanço, Magali Duran, Sandra Aluísio