Code Monkey home page Code Monkey logo

ner-pt's Introduction

Assessing the Impact of Contextual Embeddings for Portuguese Named Entity Recognition

Modern approaches to Named Entity Recognition (NER) use neural networks (NN) to automatically extract features from text and seamlessly integrate them with sequence taggers in an end-to-end fashion. Word embeddings, which are a side product of pretrained neural language models (LMs), are key ingredients to boost the performance of NER systems. More recently, contextual word embeddings, which adapt according to the context where the word appears, have proved to be an invaluable resource to improve NER systems. In this work, we assess how different combinations of (shallow) word embeddings and contextual embeddings impact NER for the Portuguese Language. We show a comparative study of 16 different combinations of shallow and contextual embeddings and explore how textual diversity and the size of training corpora used in LMs impact our NER results. We evaluate NER performance using the HAREM corpus. Our best NER system outperforms the state-of-the-art in Portuguese NER by 5.99 in absolute percentage points. State-of-The-Art results evaluated by CoNLL-2002 Script.

Results for the Total Scenario (HAREM)

Approach Precision Recall F1
BiLSTM-CRF+FlairBBP 74.91% 74.37% 74.64%
BiLSTM-CRF (Castro, et al.) 72.28% 68.03% 70.33%
CharWNN (dos Santos, et al.) 67.16% 63.74% 65.41%

Results for the Selective Scenario (HAREM)

Approach Precision Recall F1
BiLSTM-CRF+FlairBBP 83.38% 81.17% 82.26%
BiLSTM-CRF (Castro, et al.) 78.26% 74.39% 76.27%
CharWNN (dos Santos, et al.) 73.98% 68.68% 65.41%

Reproduce our tests for NER

Before you begin, you should download the Flair library. Flair is a powerful NLP library with state-of-the-art results. Flair was developed by Zalando Research. You can see all details in this github link.

  • Paper: Contextual String Embeddings for Sequence Labeling (Akbik, et al.)

STEP 1: Download our language model FlairBBP (backward and forward);

STEP 2: Clone this repository;

STEP 3: Install Flair 0.4.1. See how to install here;

STEP 4: Download NILC's Word Embedding. You must download Word2Vec-Skip-Gram with 300 dimensions; Put the file inside the cloned folder;

STEP 5: Run our script python3.6 ner_flair.py

Tagging your portuguese text with our NER model

Tag your text using our best model for NER. The model is formed by FlairBBP + NILC-Word2Vec-Skpg-300d. It is possible to recognize the following categories: PERSON, LOCATION, ORGANIZATION, TIME and VALUE. You need install Flair 0.4.1.

STEP 1: Download our NER model Download Here!;

STEP 2: Clone this repository;

STEP 3: Run our script python3.6 tagging_ner.py [input_file_name.txt] [output_file_name.txt] [mode] modes:

  • conll - input text in conll formart
  • plain - input text in plain formart

Language Models

Flair Embeddings - FlairBBP

You can download our Flair Embeddings models (FlairBBP) in the following links:

Word Embeddings

You can download our Word Embedding models in the following links, note that all models were trained in 300 dimensions:

Algorithm Architecture Downloads
Word2Vec Skip-Gram Word2Vec_skpg_300d
Word2Vec CBOW Word2Vec_cbow_300d
FastText Skip-Gram Fasttext_skpg_300d
FastText CBOW Fasttext_cbow_300d

NILC Word Embeddings

You can download the Word Embeddings provided by NILC in the following link: http://nilc.icmc.usp.br/embeddings

  • Paper: Portuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks (Hartmann, et al.)

Language Models Corpora

BlogSet-BR

BlogSet-BR is a large corpus built from millions of sentences taken from Brazilian Portuguese web blogs.

brWaC

brWaC is another portuguese large corpus.

ptwiki-20190301

ptwiki-20190301 is a corpus formed by texts from wikipedia in Portuguese.

Language Model Corpora Size Details (after pre-processing):

Corpus Sentences Tokens
brWaC 127,272,109 2,930,573,938
BlogSet-BR 58,494,090 1,807,669,068
ptwiki-20190301 7,053,954 162,109,057
All Corpora 192,820,153 4,900,352,063

Citing our Paper

@inproceedings{santos2019assessing,
  author    = {Joaquim Santos and
               Bernardo Consoli and
               Cicero dos Santos and
               Juliano Terra and
               Sandra Collonini and
               Renata Vieira},
  title     = {Assessing the Impact of Contextual Embeddings for Portuguese Named Entity Recognition},
  booktitle = {8th Brazilian Conference on Intelligent Systems, {BRACIS}, Bahia, Brazil, October 15-18},
  pages     = {437--442},
  year      = {2019}
}

ner-pt's People

Contributors

jneto04 avatar ballharar avatar

Stargazers

A. Willian Sousa avatar

Watchers

James Cloos avatar paper2code - bot avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.