Portuguese BERT

** This is a work in progress **

Portuguese BERT

This repository contains pre-trained BERT models trained on the Portuguese language. BERT-Base and BERT-Large Cased variants were trained on the BrWaC (Brazilian Web as Corpus), a large Portuguese corpus, for 1,000,000 steps, using whole-word mask. Model artifacts for TensorFlow and PyTorch can be found below.

The models are a result of an ongoing Master's Program. The text submission for Qualifying Exam is also included in the repository in PDF format, which contains more details about the pre-training procedure, vocabulary generation and downstream usage in the task of Named Entity Recognition.

Download

Model	TensorFlow checkpoint	PyTorch checkpoint	Vocabulary
`bert-base-portuguese-cased`	Download	Download	Download
`bert-large-portuguese-cased`	Download	Download	Download

NER Benchmarks

The models were benchmarked on the Named Entity Recognition task and compared to previous published results and Multilingual BERT. Reported results are for BERT or BERT-CRF architectures, while other results comprise distinct methods.

Test Dataset	BERT-Large Portuguese	BERT-Base Portuguese	BERT-Base Multilingual	Previous SOTA
MiniHAREM (5 classes)	83.30	83.03	79.44	82.26 [1], 76.27[2]
MiniHAREM (10 classes)	78.67	77.98	74.15	74.64 [1], 70.33[2]

PyTorch usage example

Our PyTorch artifacts are compatible with the 🤗Huggingface Transformers library and are also available on the Community models:

from transformers import AutoModel, AutoTokenizer

# Using the community model
# BERT Base
tokenizer = AutoTokenizer.from_pretrained('neuralmind/bert-base-portuguese-cased')
model = AutoModel.from_pretrained('neuralmind/bert-base-portuguese-cased')

# BERT Large
tokenizer = AutoTokenizer.from_pretrained('neuralmind/bert-large-portuguese-cased')
model = AutoModel.from_pretrained('neuralmind/bert-large-portuguese-cased')

# or, using BertModel and BertTokenizer directly
from transformers import BertModel, BertTokenizer

tokenizer = BertTokenizer.from_pretrained('path/to/vocab.txt')
model = BertModel.from_pretrained('path/to/bert_dir')  # Or other BERT model class

Acknowledgement

We would like to thank Google for Cloud credits under a research grant that allowed us to train these models.

References

[1] Assessing the Impact of Contextual Embeddings for Portuguese Named Entity Recognition

[2] Portuguese Named Entity Recognition using LSTM-CRF

How to cite this work

@article{souza2019portuguese,
    title={Portuguese Named Entity Recognition using BERT-CRF},
    author={Souza, Fabio and Nogueira, Rodrigo and Lotufo, Roberto},
    journal={arXiv preprint arXiv:1909.10649},
    url={http://arxiv.org/abs/1909.10649},
    year={2019}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md
qualifying_exam-portuguese_named_entity_recognition_using_bert_crf.pdf		qualifying_exam-portuguese_named_entity_recognition_using_bert_crf.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Portuguese BERT

Download

NER Benchmarks

PyTorch usage example

Acknowledgement

References

How to cite this work

About

Releases

Packages

License

walterwsmf/portuguese-bert

Folders and files

Latest commit

History

Repository files navigation

Portuguese BERT

Download

NER Benchmarks

PyTorch usage example

Acknowledgement

References

How to cite this work

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages