Code Monkey home page Code Monkey logo

portuguese-bert's Introduction

BERTimbau - Portuguese BERT

This repository contains pre-trained BERT models trained on the Portuguese language. BERT-Base and BERT-Large Cased variants were trained on the BrWaC (Brazilian Web as Corpus), a large Portuguese corpus, for 1,000,000 steps, using whole-word mask. Model artifacts for TensorFlow and PyTorch can be found below.

The models are a result of an ongoing Master's Program. The text submission for Qualifying Exam is also included in the repository in PDF format, which contains more details about the pre-training procedure, vocabulary generation and downstream usage in the task of Named Entity Recognition.

Download

The base and large models are available at Hugging Face

Evaluation benchmarks

The models were benchmarked on three tasks (Sentence Textual Similarity, Recognizing Textual Entailment and Named Entity Recognition) and compared to previous published results and Multilingual BERT. Metrics are: Pearson's correlation for STS and F1-score for RTE and NER.

Task Test Dataset BERTimbau-Large BERTimbau-Base mBERT Previous SOTA
STS ASSIN2 0.852 0.836 0.809 0.83 [1]
RTE ASSIN2 90.0 89.2 86.8 88.3 [1]
NER MiniHAREM (5 classes) 83.7 83.1 79.2 82.3 [2]
NER MiniHAREM (10 classes) 78.5 77.6 73.1 74.6 [2]

NER experiments code

Code and instructions to reproduce the Named Entity Recognition experiments are in ner_evaluation/ directory.

PyTorch usage example

Our PyTorch artifacts are compatible with the 🤗Huggingface Transformers library and are also available on the Community models:

from transformers import AutoModel, AutoTokenizer

# Using the community model
# BERT Base
tokenizer = AutoTokenizer.from_pretrained('neuralmind/bert-base-portuguese-cased')
model = AutoModel.from_pretrained('neuralmind/bert-base-portuguese-cased')

# BERT Large
tokenizer = AutoTokenizer.from_pretrained('neuralmind/bert-large-portuguese-cased')
model = AutoModel.from_pretrained('neuralmind/bert-large-portuguese-cased')

# or, using BertModel and BertTokenizer directly
from transformers import BertModel, BertTokenizer

tokenizer = BertTokenizer.from_pretrained('path/to/vocab.txt', do_lower_case=False)
model = BertModel.from_pretrained('path/to/bert_dir')  # Or other BERT model class

Acknowledgement

We would like to thank Google for Cloud credits under a research grant that allowed us to train these models.

References

[1] Multilingual Transformer Ensembles for Portuguese Natural Language Task

[2] Assessing the Impact of Contextual Embeddings for Portuguese Named Entity Recognition

How to cite this work

@inproceedings{souza2020bertimbau,
    author    = {Souza, F{\'a}bio and Nogueira, Rodrigo and Lotufo, Roberto},
    title     = {{BERT}imbau: pretrained {BERT} models for {B}razilian {P}ortuguese},
    booktitle = {9th Brazilian Conference on Intelligent Systems, {BRACIS}, Rio Grande do Sul, Brazil, October 20-23 (to appear)},
    year      = {2020}
}

@article{souza2019portuguese,
    title={Portuguese Named Entity Recognition using BERT-CRF},
    author={Souza, F{\'a}bio and Nogueira, Rodrigo and Lotufo, Roberto},
    journal={arXiv preprint arXiv:1909.10649},
    url={http://arxiv.org/abs/1909.10649},
    year={2019}
}

portuguese-bert's People

Contributors

fabiocapsouza avatar robertoalotufo avatar rodrigonogueira4 avatar alcidesmig avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.