Code Monkey home page Code Monkey logo

portuguese-nlp's Introduction

Portuguese-NLP

List of resources and tools developed with focus on Portuguese.

Datasets

  • Aspect-based annotated - the corpus consist of implicit and explicit annotated aspects and groups of (hierarchically organized) opinion aspects for aspect-based sentiment analysis applications, as well as text summarization.
  • ASSIN - a dataset with semantic similarity score and entailment annotations.
  • ASSIN 2 - sequence of ASSIN.
  • BlogSet-BR - a collection of posts gathered from Blogspot platform written by Brazillian users.
  • br-quad-2.0 - Stanford Question Answering Dataset (SQuAD) 2.0 translated to Brazilian Portuguese (PT-BR) language.
  • Brazilian E-Commerce - Brazilian E-Commerce Public Dataset by Olist store.
  • Brazilian Headlines Sentiments - Dataset containing sentiment analysis of Brazilian news agencies headlines.
  • Brazilian Portuguese Literature Corpus - 3.7 million word corpus of Brazilian literature published between 1840-1908.
  • Brazilian Portuguese Sentiment Analysis Datasets.
  • Brazilian TCU's judgments - Judgments of Federal Court of Accounts - Brazil (TCU).
  • BrWaC - Brazilian Portuguese Web as Corpus.
  • BrWac2Wiki - a dataset for multi-document summarization in Portuguese.
  • B2W-Reviews01 - product reviews.
  • Carolina - Corpus Geral do Português Brasileiro Contemporâneo.
  • Capes - parallel corpus of theses and dissertations abstracts in English and Portuguese.
  • CC100-Portuguese - Created by Conneau & Wenzek et al. at 2020. This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository.
  • CETENFolha - news from the newspaper Folha de S. Paulo.
  • CHAVE - collection for Information Retrieval and Question Answering.
  • CINTIL Corpus - a linguistically interpreted corpus of Portuguese.
  • Complexidade Textual para Estágios Escolares do Sistema Educacional Brasileiro
  • CORAA - dataset for Automatic Speech Recognition.
  • CORAA SER - Emotion Recognition from Brazilian Portuguese Informal Spontaneous Speech.
  • CSTNews - a corpus with 50 clusters of news texts with their multi-document summaries, as well as several discourse and semantic annotations.
  • C-ORAL-BRASIL - This project is dedicated to the study of Brazilian Portuguese spontaneous speech and, more broadly, to the compilation of spoken corpora.
  • DEEPAGÉ - Answering Questions in Portuguese about the Brazilian Environment.
  • DNLT-BP - Datasets of Neuropsychological Language Tests in Brazilian Portuguese.
  • Essay-BR - Essay-BR: a corpus of essays for the Brazilian Portuguese language.
  • Extended Essay-BR - Extended version of the Essay-BR corpus.
  • FACTCK.BR - A dataset to study Fake News in Portuguese.
  • Fake.Br - aligned true and fake news written in Brazilian Portuguese.
  • Fakepedia-Corpus.
  • FakeRecogna - dataset comprised of real and fake news.
  • FakeWhatsApp.Br - An annotated Corpus of WhatsApp messages in PT-BR for automatic detection of textual misinformation.
  • FCN
  • Floresta Sintá(c)tica - treebank for Portuguese.
  • HAREM first - evaluation contest for named entity recognizers in Portuguese.
  • HAREM second - evaluation contest for named entity recognizers in Portuguese.
  • HateBR - large-scale expert annotated corpus of Brazilian Instagram comments for hate speech and offensive language detection on the web and social media.
  • Historical Portuguese Corpora - tools and resources for manipulation of historical corpora and management of historical dictionaries.
  • Iudicium Textum Dataset - contains legal documents created by Brazilian Federal Supreme Court in its integral composition (paper).
  • LeNER-Br - a Dataset for Named Entity Recognition in Brazilian Legal Text.
  • Lex2Kids - lexicon in Portuguese most heard by children.
  • Mac-Morpho - Brazilian Portuguese texts annotated with part-of-speech tags.
  • MilkQA - a dataset of dense questions for the task of answer selection.
  • Minutes of Central Bank of Brazil - Minutes of the Monetary Policy Committee of the Central Bank of Brazil.
  • NER in Brazilian Portuguese tweets - Twitter messages in pt-br annotated for the entities PER, LOC and ORG.
  • News-Crawl-PT - Monolingual News Crawl used for WMT.
  • News of the site Folha de São Paulo
  • News published in Brazil
  • Parallel Corpora from Revista Pesquisa FAPESP - Portuguese-English and Portuguese-Spanish bilingual collections of the online issues of the scientific news Brazilian magazine Revista Pesquisa FAPESP.
  • Pirá - A Bilingual Portuguese-English Dataset for Question-Answering about the Ocean.
  • PLUE - Portuguese translation of the GLUE benchmark and Scitail dataset.
  • POeTiSA - POrtuguese processing - Towards Syntactic Analysis and parsing.
  • PorSimplesSent - of aligned sentences pairs to investigate sentence readability assessment.
  • PortiLexicon-UD - a lexicon for Brazilian Portuguese according to Universal Dependencies.
  • Portuguese Legal Sentences - Collection of Legal Sentences from the Portuguese Supreme Court of Justice.
  • Portuguese Presidential Elections - This dataset contains tweets and users mostly from the Portuguese Twittersphere.
  • PraCegoVer - multi-modal dataset containing images associated to Portuguese captions based on posts from Instagram.
  • Priberam Fine-Grained Opinion Corpus - a Portuguese fine-grained dependency opinion mining corpus.
  • Propbank - Contains instances annotated with semantic role labels (SRL).
  • Projeto ACDC - Internet Access to Corpora.
  • QA-Portuguese - Adaptation from MQA dataset Portuguese split (QA entailment pairs).
  • REBEL-Portuguese - Datasets de relações a partir da Wikipedia.
  • ReLi - REsenha de LIvros.
  • Rhetalho - corpus annotated with Daniel Marcu's RSTTool.
  • SemClinBr - multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks.
  • SESAME - corpus for NER in portuguese.
  • SIGARRA News Corpus - SIGARRA information system at the University of Porto.
  • SIMPLEX-PB - A Lexical Simplification Database and Benchmark for Portuguese.
  • SIMPLEX-PB-2.0 - improved version of SIMPLEX-PB.
  • SIMPLEX-PB-3.0 - new version of SIMPLEX-PB.
  • SQUAD-PT v1.1 - Portuguese translation of the SQuAD dataset.
  • SQUAD-PT v2.0 - Portuguese translation of SQuAD 2.0 dataset.
  • TeMário - news texts and the corresponding human summaries for summarization purposes.
  • Textual Complexity Corpus - Textual Complexity Corpus for School Internships in the Brazilian Educational System.
  • ToLD-Br - Toxic Language Detection in Social Media for Brazilian Portuguese (github).
  • TTS-Portuguese Corpus - Text To Speech Portuguese.
  • TweetSentBR - Tweets in Brazilian Portuguese.
  • Tweets for Sentiment Analysis
  • UD_Portuguese-Bosque - Universal Dependencies (UD) Portuguese treebank.
  • UD_Portuguese-CINTIL - Universal Dependencies (UD) Portuguese treebank.
  • UD_Portuguese-GSD - Universal Dependencies (UD) Portuguese treebank.
  • UD_Portuguese-PetroGold - Universal Dependencies (UD) Portuguese treebank.
  • UD_Portuguese-PUD - Universal Dependencies (UD) Portuguese treebank.
  • UTLCorpus - a corpus of online reviews in Brazilian Portuguese annotated with helpfulness classification.
  • Winograd Schema Challenge - Solver for the Portuguese-based Winograd Schema Challenge.

Multilingual datasets

  • askD - ELI5 dataset adapted on Medical Questions (AskDocs) subreddit.
  • English-Portuguese Sentences - English-Portuguese Sentences from the Tatoeba Project.
  • EUR-Lex - multilingual corpus in all the official languages of the European Union.
  • Europarl - European Parliament Proceedings Parallel Corpus 1996-2011.
  • Europarl-ST - Multilingual Speech Translation Corpus, that contains paired audio-text samples for Speech Translation, constructed using the debates carried out in the European Parliament in the period between 2008 and 2012.
  • mc4 - multilingual colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset.
  • mfaq - multilingual corpus of Frequently Asked Questions parsed from the Common Crawl.
  • MKQA - Multilingual Knowledge Questions & Answers (github).
  • MQA - multilingual corpus of Questions and Answers (MQA) parsed from the Common Crawl.
  • MMARCO - Multilingual version of the MS MARCO passage ranking dataset.
  • MultiCoNER - a large multilingual dataset for Named Entity Recognition.
  • MuST-C - multilingual speech translation corpus.
  • OSCAR - Open Super-large Crawled Aggregated coRpus.
  • OpenSubtitles - collection of translated movie subtitles.
  • Tatoeba - a large database of sentences and translations.
  • TED2020 - contains a crawl of nearly 4000 TED and TED-X transcripts from July 2020.
  • TSAR-2022-Shared-Task - TSAR2022 Shared Task on Lexical Simplification.
  • WikiANN - multilingual named entity recognition dataset consisting of Wikipedia articles annotated with LOC (location), PER (person), and ORG (organisation) tags in the IOB2 format.
  • WikiLingua - Multilingual abstractive summarization dataset extracted from WikiHow.
  • WikiMatrix - Parallel Sentences in 1620 Language Pairs from Wikipedia.
  • Wikiner - Learning multilingual named entity recognition from Wikipedia.
  • WikiNEuRal - Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2021).
  • Wikipedia - Wikipedia dataset containing cleaned articles of all languages.
  • XFORMAL - A Benchmark for Multilingual Formality Style Transfer.
  • XLSUM - 1.35 million professionally annotated article-summary pairs from BBC

Lexicon

Models

  • Albertina PT-BR
  • BERTimbau - BERTimbau Base is a pretrained BERT model for Brazilian Portuguese that achieves state-of-the-art performances on three downstream NLP tasks: Named Entity Recognition, Sentence Textual Similarity and Recognizing Textual Entailment (Github).
  • BioBERTpt - fine-tuned BERT models trained on the clinical domain for Portuguese language
  • Cabrita - A portuguese finetuned instruction LLaMA (Github).
  • Electra - Electra model trained on BRWAC.
  • GPT2 small - GPorTuguese-2 (Portuguese GPT-2 small) is a state-of-the-art language model for Portuguese based on the GPT-2 small model.
  • GPT-Neo small - a finetuned version from GPT-Neo 125M by EletheurAI to Portuguese language.
  • mMiniLM - mMiniLM-L6-v2 Reranker finetuned on mMARCO
  • roberta-pt-br
  • T5
  • tgf-xlm-roberta-base-pt-br (Github)
  • Wav2vec

Multilingual Models

Word Embeddings

Metrics

  • Coh-Metrix-Port - an adaptation of the Coh-Metrix text analysis tool to the Brazilian Portuguese language.
  • NILC-Metrix - it gathers the metrics developed over more than a decade in NILC Lab.

Frameworks

Institutions

Tools

  • Autocorrect - Spelling corrector in python.
  • BrGram - Computational grammar fragment of Brazilian Portuguese in the LFG formalism implemented in XLE.
  • Dicio API - Portuguese dictionary API.
  • dict-pt-br - dictionary for Brazilian Portuguese.
  • Languagetool - Style and Grammar Checker for 25+ Languages.
  • LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language.
  • LexML Parser - parser for legal documents.
  • LX parser - statistical constituency parser for Portuguese.
  • metaphone-ptbr - Metaphone algorithm for the Portuguese language.
  • mlconjug3 - a Python library to conjugate verbs in Portuguese and other languages.
  • MorphoBr - Resources for morphological analysis of Portuguese.
  • OpCluster - Automatic extraction and clustering of fine-grained opinions.
  • Phonemizer - Simple text to phones converter for multiple languages.
  • PorGram - Open source computational grammar for Portuguese in the HPSG formalism.
  • pymetaphone-br - Metaphone algorithm package for the Portuguese language.
  • RBAMR - A Rule-Based AMR Parser for Portuguese.
  • Verbecc - Complete Conjugation of any Verb using Machine Learning for French, Spanish, Portuguese, Italian and Romanian.

Other lists

Other links

Visitor Badge

portuguese-nlp's People

Contributors

ajdavidl avatar ju-resplande avatar

Stargazers

L0gic_b0mb avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.