** This is a work in progress **
This repository contains pre-trained BERT models trained on the Portuguese language. BERT-Base and BERT-Large Cased variants were trained on the BrWaC (Brazilian Web as Corpus), a large Portuguese corpus, for 1,000,000 steps, using whole-word mask. Model artifacts for TensorFlow and PyTorch can be found below.
The models are a result of an ongoing Master's Program. The text submission for Qualifying Exam is also included in the repository in PDF format, which contains more details about the pre-training procedure, vocabulary generation and downstream usage in the task of Named Entity Recognition.
Model | TensorFlow checkpoint | PyTorch checkpoint | Vocabulary |
---|---|---|---|
bert-base-portuguese-cased |
Download | Download | Download |
bert-large-portuguese-cased |
Download | Download | Download |
The models were benchmarked on the Named Entity Recognition task and compared to previous published results and Multilingual BERT. Reported results are for BERT or BERT-CRF architectures, while other results comprise distinct methods.
Test Dataset | BERT-Large Portuguese | BERT-Base Portuguese | BERT-Base Multilingual | Previous SOTA |
---|---|---|---|---|
MiniHAREM (5 classes) | 83.30 | 83.03 | 79.44 | 82.26 [1], 76.27[2] |
MiniHAREM (10 classes) | 78.67 | 77.98 | 74.15 | 74.64 [1], 70.33[2] |
Our PyTorch artifacts are compatible with the 🤗Huggingface Transformers library and are also available on the Community models:
from transformers import AutoModel, AutoTokenizer
# Using the community model
# BERT Base
tokenizer = AutoTokenizer.from_pretrained('neuralmind/bert-base-portuguese-cased')
model = AutoModel.from_pretrained('neuralmind/bert-base-portuguese-cased')
# BERT Large
tokenizer = AutoTokenizer.from_pretrained('neuralmind/bert-large-portuguese-cased')
model = AutoModel.from_pretrained('neuralmind/bert-large-portuguese-cased')
# or, using BertModel and BertTokenizer directly
from transformers import BertModel, BertTokenizer
tokenizer = BertTokenizer.from_pretrained('path/to/vocab.txt')
model = BertModel.from_pretrained('path/to/bert_dir') # Or other BERT model class
We would like to thank Google for Cloud credits under a research grant that allowed us to train these models.
[1] Assessing the Impact of Contextual Embeddings for Portuguese Named Entity Recognition
[2] Portuguese Named Entity Recognition using LSTM-CRF
@article{souza2019portuguese,
title={Portuguese Named Entity Recognition using BERT-CRF},
author={Souza, Fabio and Nogueira, Rodrigo and Lotufo, Roberto},
journal={arXiv preprint arXiv:1909.10649},
url={http://arxiv.org/abs/1909.10649},
year={2019}
}