Skip to content

List of resources and tools developed with focus on Portuguese.

Notifications You must be signed in to change notification settings

christianbaptista/Portuguese-NLP

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 

Repository files navigation

Portuguese-NLP

List of resources and tools developed with focus on Portuguese.

Datasets

Multilingual datasets

  • A Multilingual Dataset for Investigating Stereotypes and Negative Attitudes Towards Migrant Groups in Large Language Models
  • askD - ELI5 dataset adapted on Medical Questions (AskDocs) subreddit.
  • English-Portuguese Sentences - English-Portuguese Sentences from the Tatoeba Project.
  • EUR-Lex - multilingual corpus in all the official languages of the European Union.
  • Europarl - European Parliament Proceedings Parallel Corpus 1996-2011.
  • Europarl-ST - Multilingual Speech Translation Corpus, that contains paired audio-text samples for Speech Translation, constructed using the debates carried out in the European Parliament in the period between 2008 and 2012.
  • mc4 - multilingual colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset.
  • mfaq - multilingual corpus of Frequently Asked Questions parsed from the Common Crawl.
  • MKQA - Multilingual Knowledge Questions & Answers (github).
  • MQA - multilingual corpus of Questions and Answers (MQA) parsed from the Common Crawl.
  • MMARCO - Multilingual version of the MS MARCO passage ranking dataset.
  • mRobust - Multilingual version of the TREC 2004 Robust passage ranking dataset
  • MultiCoNER - a large multilingual dataset for Named Entity Recognition.
  • MuST-C - multilingual speech translation corpus.
  • OpenSubtitles - collection of translated movie subtitles.
  • OSCAR - Open Super-large Crawled Aggregated coRpus.
  • Tatoeba - a large database of sentences and translations.
  • TED2020 - contains a crawl of nearly 4000 TED and TED-X transcripts from July 2020.
  • TSAR-2022-Shared-Task - TSAR2022 Shared Task on Lexical Simplification.
  • WikiANN - multilingual named entity recognition dataset consisting of Wikipedia articles annotated with LOC (location), PER (person), and ORG (organisation) tags in the IOB2 format.
  • WikiLingua - Multilingual abstractive summarization dataset extracted from WikiHow.
  • WikiMatrix - Parallel Sentences in 1620 Language Pairs from Wikipedia.
  • Wikiner - Learning multilingual named entity recognition from Wikipedia.
  • WikiNEuRal - Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2021).
  • Wikipedia - Wikipedia dataset containing cleaned articles of all languages.
  • XFORMAL - A Benchmark for Multilingual Formality Style Transfer.
  • XLSUM - 1.35 million professionally annotated article-summary pairs from BBC.

Lexicon

  • BATS-PT - manual translation of the lexicographic portion of the Bigger Analogy Test Set (BATS) to Portuguese
  • br.ispell - Ispell dictionary for brazilian portuguese (github).
  • Conceptnet - an open, multilingual knowledge graph.
  • DicSin - Dictionary of synonyms and antonyms.
  • lexiconPT - R package that provides lexicons for Portuguese Text Analysis.
  • lexicons - Dictionaries of names, surnames, acronyms and it's extensions, stop-words, etc.
  • LIWC - Linguistic Inquiry and Word Count (dictionary)
  • Onto.PT - Ontologia Lexical para o Português.
  • OpenWordnet-PT - an open access wordnet for Portuguese (site).
  • OpLexicon - a sentiment lexicon for the Portuguese language.
  • palavras - Word list of Brazillian Portuguese.
  • PAPEL.
  • pt-br - Wordlist, verbs, conjugations, term frequencies.
  • PT-LKB - Large Portuguese Lexical-Semantic Knowledge Base
  • PULO - Portuguese Unified Lexical Ontology.
  • SentiLex-PT - a sentiment lexicon for Portuguese.
  • Stopwords - Portuguese stopwords collection.
  • Tep2.
  • Unitex-PB - lexical resources.
  • VaLexPB - a lexicon of Brazilian Portuguese verb valences.
  • VerbNet.Br 1.0 - verbal lexicon of Brazilian Portuguese.
  • wikidict-dsl-pt - Wikidata Bilingual DSL Dictionaries.
  • Wordnetaffectbr - vocabulary of emotions words.
  • Wordnet.Br - Portuguese WordNet.

Models

  • Albertina PT-BR - It is an encoder of the BERT family for the Portuguese language - the American variant from Brazil.
  • Albertina PT-PT - It is an encoder of the BERT family for the Portuguese language - the European variant from Portugal.
  • Alpaca-LoRA-PTBR - Low-Rank LLaMA Instruct-Tuning.
  • BART - BART pre-treinado em português.
  • BERTimbau - BERTimbau Base is a pretrained BERT model for Brazilian Portuguese that achieves state-of-the-art performances on three downstream NLP tasks: Named Entity Recognition, Sentence Textual Similarity and Recognizing Textual Entailment (Github).
  • BioBERTpt - fine-tuned BERT models trained on the clinical domain for Portuguese language (Github).
  • Cabrita - A portuguese finetuned instruction LLaMA (Github).
  • DeBERTinha - A DeBERTa V3 XSmall adapted to the Brazilian Portuguese language (Github).
  • Electra - Electra model trained on BRWAC.
  • Gervasio-PT-BR - It is a decoder of the GPT family for the Portuguese language - the American variant from Brazil.
  • Gervasio-PT-PT - It is a decoder of the GPT family for the Portuguese language - the European variant from Portugal.
  • GlórIA 1.3B - A Portuguese European-focused Large Language Model (HuggingFace)
  • GPT2 small - GPorTuguese-2 (Portuguese GPT-2 small) is a state-of-the-art language model for Portuguese based on the GPT-2 small model.
  • GPT-Neo small - a finetuned version from GPT-Neo 125M by EletheurAI to Portuguese language.
  • GPT2-Bio-PT - a biomedical finetuned version from GPorTuguese-2 (Github).
  • NERDE-base - BERTimbau finetuned to NER on Judicial Documents.
  • roberta-pt-br
  • RoBERTaCrawlPT-base - RoBERTaCrawlPT-base is a generic Portuguese Masked Language Model pretrained from scratch from the CrawlPT corpora
  • RoBERTaLexPT-base - Portuguese Masked Language Model pretrained from scratch from the LegalPT and CrawlPT corpora
  • Sabiá - Sabiá-7B is Portuguese language model developed by Maritaca AI.
  • Sabiá 2 - Language model trained on Portuguese text, especially in the Brazilian domain.
  • T5 - T5 model on Brazilian Portuguese data.
  • tgf-xlm-roberta-base-pt-br (Github)
  • Wav2vec - Fine-tuned facebook/wav2vec2-large-xlsr-53 on Portuguese using the train and validation splits of Common Voice 6.1.

Multilingual Models

  • Bloom - BigScience Large Open-science Open-access Multilingual Language Model.
  • mBert - Pretrained model on the top 104 languages with the largest Wikipedia using a masked language modeling (MLM) objective.
  • mDeBERTa
  • mGPT - Multilingual GPT model. An autoregressive GPT-like model.
  • mMiniLM - mMiniLM-L6-v2 Reranker finetuned on mMARCO
  • mT5 - Multilingual T5. A massively multilingual pre-trained text-to-text transformer.
  • XLM-RoBERTa - XLM-RoBERTa model pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages.
  • LaBSE - Language-agnostic BERT Sentence Encoder (LaBSE) is a BERT-based model trained for sentence embedding for 109 languages.

Word Embeddings

  • fastText - Multi-lingual word vectors.
  • LASER - Language-Agnostic SEntence Representations.
  • NILC-Embeddings - Word embeddings trained in Portuguese by USP.
  • MUSE - Multilingual Unsupervised and Supervised Embeddings.
  • word vectors - Pre-trained word vectors of 30+ languages.

Metrics

  • Coh-Metrix-Port - an adaptation of the Coh-Metrix text analysis tool to the Brazilian Portuguese language.
  • NILC-Metrix - it gathers the metrics developed over more than a decade in NILC Lab.

Leaderboards

  • Open PT LLM Leaderboard - Open PT LLM Leaderboard aims to provide a benchmark for the evaluation of Large Language Models (LLMs) in the Portuguese language across a variety of tasks and datasets.

Frameworks

Institutions

Tools

  • Apertium-por - Apertium linguistic data for Portuguese.
  • Autocorrect - Spelling corrector in python.
  • BrGram - Computational grammar fragment of Brazilian Portuguese in the LFG formalism implemented in XLE.
  • Dicio API - Portuguese dictionary API.
  • dict-pt-br - dictionary for Brazilian Portuguese.
  • Languagetool - Style and Grammar Checker for 25+ Languages.
  • LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language.
  • LexML Parser - parser for legal documents.
  • LX parser - statistical constituency parser for Portuguese.
  • metaphone-ptbr - Metaphone algorithm for the Portuguese language.
  • mlconjug3 - a Python library to conjugate verbs in Portuguese and other languages.
  • MorphoBr - Resources for morphological analysis of Portuguese.
  • OpCluster - Automatic extraction and clustering of fine-grained opinions.
  • Phonemizer - Simple text to phones converter for multiple languages.
  • PorGram - Open source computational grammar for Portuguese in the HPSG formalism.
  • pymetaphone-br - Metaphone algorithm package for the Portuguese language.
  • pysentimiento - Multilingual toolkit for Sentiment Analysis and Social NLP tasks.
  • pyspellchecker - Multilingual Spell Checking.
  • RBAMR - A Rule-Based AMR Parser for Portuguese.
  • Verbecc - Complete Conjugation of any Verb using Machine Learning for French, Spanish, Portuguese, Italian and Romanian.

Other lists

Other links

Visitor Badge

About

List of resources and tools developed with focus on Portuguese.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published