Code Monkey home page Code Monkey logo

unitex-pt-br's Introduction

  goodtables.io   datapackage preview

Note: dictionary data in this repo is a read-only mirror (translated to open formats for data interchange) of the official Unitex repository, where active development is ongoing.

unitex-pt-br

The Brazilian Portuguese (pt-BR language), Unitex primary sources for the vocabulary and its morphological definitions, in a open data (FrictionlessData) interchange format.

Controlled primary sources:

  • pt-BR Alphabet: Alphabet.csv and Alphabet_sort.csv

  • pt-BR DELAS: DELA for Simple words, "Dicionário de Palavras Simples para o Português Brasileiro". ~67500 canonic words and its inflection rules. DELAS.csv.

  • pt-BR DELACF: DELA for Compound Forms, "Dicionário de Palavras Compostas Flexionadas para o Português Brasileiro". ~4000 compound words and its morphological classification. DELACF.csv.

  • pt-BR Inflections: all *.fst2 (finite state transducer v2) files, the compiled format for inflection graphs (see chapter 14.3 of the Unitex Manual). Each file contains only the basic representations of transitions of the graph — not changes by Graph-layout editing, changes only when topology or classification is modified. Under construction (JSON format), see dumps folder.

References

Updating sources

See spreadsheets do download here as data/*.csv.

Any other file must be validated by software (see SQL back-end).

License

  • Unitex sources: LGPLLR - Lesser General Public License For Linguistic Resources.

  • Other texts and sources: CC-BY-4.0 - Attribution 4.0 International.

unitex-pt-br's People

Contributors

ppkrauss avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

unitex-pt-br's Issues

Check generated forms

See V005 graph, using "cortar" as sample.

... onde foram parar as 600

grep abafar, DELAS.csv 
	abafar,N004
	abafar,V005
grep ,abafar Delaf2015v04.dic
	abafar,abafar.V:U1s
	abafar,abafar.V:U3s
	abafar,abafar.V:W
	abafar,abafar.V:W1s
	abafar,abafar.V:W3s


grep cortar, DELAS.csv 
	cortar,V005
	entrecortar,V005
	intercortar,V005
	recortar,V005
grep cortar, Delaf2015v04.dic 
	cortar,cortar.V:U1s
	cortar,cortar.V:U3s
	cortar,cortar.V:W
	cortar,cortar.V:W1s
	cortar,cortar.V:W3s


grep ,beber Delaf2015v04.dic   
  ~2522 linhas! 
	beba-a,beber.V+PRO:Y3s
	beba-as,beber.V+PRO:Y3s
	beba,beber.V:S1s
	beba,beber.V:S3s
	beba,beber.V:Y3s
	bebais,beber.V:S2p
	beba-lhe,beber.V+PRO:Y3s
	beba-lhes,beber.V+PRO:Y3s
	bebam,beber.V:S3p
	bebam,beber.V:Y3p
	beba-me,beber.V+PRO:Y3s
	bebam-lhe,beber.V+PRO:Y3p
	bebam-lhes,beber.V+PRO:Y3p
	bebam-me,beber.V+PRO:Y3p
	bebam-na,beber.V+PRO:Y3p
	...
	bebera,beber.V:Q1s
	bebera,beber.V:Q3s
	bebendo-te,beber.V+PRO:G

	beberada,beberar.V:Kfs
	beberadas,beberar.V:Kfp
	beberado,beberar.V:Kms
	beberados,beberar.V:Kmp
	beberagem,beberagem.N:fs
	beberagens,beberagem.N:fp
	beberai,beberar.V:Y2p
	beberai-la,beberar.V+PRO:P2p
	beberai-las,beberar.V+PRO:P2p
	beberai-lo,beberar.V+PRO:P2p
	beberai-los,beberar.V+PRO:P2p
	beberai-nos,beberar.V+PRO:P2p

Split DELAS into DELAS and DELAS-Pr

There are a lot of "pure named entity" as proper nom, that are not real "dictionary words".

Examples: abel,N004+Pr, abelson,N004+Pr, abélson,N004+Pr, abigail,N104+Pr, abília,N104+Pr, abílio,N004+Pr, abraão,N004+Pr, abraham,N004+Pr, abrantes,N306+Pr, abrão,N004+Pr, zico,N004+Pr, zilda,N104+Pr, zimbábue,N304+Pr, zingarelli,N306+Pr, zoroastro,N004+Pr, zucolotto,N306+Pr, zurique,N104+Pr

Many are usual human given names (modern as zico, zilda or classic as zoroastro) or surnames (zingarelli, zucolotto). Other are commom toponyms, as country names, city names (abrantes,zurique), etc.

So, at DELAS-pr must include a column indicating the type of entity where the name is usually used (ex. Italy is a country-name but in Brasil there is also a female name).

There are other sources of names and its use-statistics, see here datasets-br/prenomes or datasets-br/city-codes, for confirmed Brazilian names, and world-cities, etc. for international.

Get only transducers from Graph

The aim of this respository (unitex-pt-br) is to control versions, to offer all in open formats and to compare dictionaries... Not to be used as source or to produce alternative dictionaries.

This repository not need the layout of the Unitex Graphs, only the compiled transducers from it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.