The file was generated by combining the following data:
-
Princeton WordNet 3.0 was used to obtain English glosses and English terms for synset IDs.
-
The unreleased 2010-12 version UWN and MENTA provided candidate terms in Portuguese, candidate glosses in Portuguese (from Wikipedia), and candidate terms in Spanish.
-
The EuroWordNet base concept list (
5000_bc.xml
) provides the base concept numbers. The original file was mapped from WordNet 2.0 to 3.0 using the mappings from WN-Map. When multiple mappings for a WordNet 2.0 synset existed, all possible WordNet 3.0 synsets were kept. Hence, there may be multiple entries with the same base concept number.
- Read the English gloss and the English words.
- Come up with Portuguese words that express the same meaning as the English gloss and have the part-of-speech indicated by the first letter of the WordNet synset identifer (n: noun, v: verb, a: adjective, r: adverb) and write them into "PT-Words-Man".
- Optionally: Write a Portuguese gloss into the "PT-Gloss" field. This may be shorter than the English gloss. If the gloss contains English example sentences, then only translate them if their translations sound natural in Portuguese and if the translation actually contains the Portuguese words added to the synset.
Example:
<row>
<BC>4</BC>
<WN-3.0-Synset>n6269</WN-3.0-Synset>
<PT-Words-Man>vida</PT-Words-Man>
<PT-Words-Candidates>vida</PT-Words-Candidates>
<EN-Gloss>living things collectively; "the oceans are teeming with life"</EN-Gloss>
<EN-Words>life</EN-Words>
<PT-Gloss>coisas vivas, tomadas coletivamente; "os oceanos estão repletos de vida"</PT-Gloss>
<PT-Gloss-Prop />
<Spa-Words-Prop>vida</Spa-Words-Prop>
<Comments />
</row>
Additional considerations for Step 2:
- Be careful not to be misguided by English words with multiple meanings. You can use the Portuguese and Spanish candidates as a guide, but keep in mind that they were automatically generated and may be entirely wrong. The main criterion is whether Portuguese word corresponds to the English gloss.
- The PT-Words-Man field can contain multiple words, separated by comma, or alternatively you can also have more than one element. In either case, the words should ideally be sorted by relevance (the most commonly used ones first).
- If an entry has been checked and it seems that there are no relevant Portuguese words to express a concept then use If there are expressions that could be used to express the concept, but these expressions are not real words or lexicalized expressions that would appear in a dictionary, then use the following syntax: mover reflexivamente
It might be a good idea to have a WordNet browser open as well when doing the annotation, so that you can check hyponyms/hypernyms (or subclasses/parent classes):
http://www.lexvo.org/uwn/entity/s/n2084071
Another good page to leave open is some online English-Portuguese translation dictionary.
- Alexandre Rademaker
- Gerard de Melo
- Valeria de Paiva
- Rafael Haeusler
OpenWN-PT by EMAp, Getulio Vargas Foundation is licensed under a Creative Commons Attribution-ShareAlike 3.0 Brazil License.
Based on a work at github.com.
Take a look in the file LICENSE.
First step is to run the to-francis.lisp code to generate the wn-data-por.tab. After that, the following steps will fix some problems with this file.
grep -v \" wn-data-por.tab > wn-data-por-1.tab
sed 's/_/ /g' wn-data-por-1.tab > wn-data-por-2.tab
mv wn-data-por-2.tab wn-data-por.tab
rm wn-data-por-?.tab
Para usar o DTD executar no prompt:
$ xmllint --noout --dtdvalid wordnet.dtd uwn-pt-sorted-aa.xml
Apos resolver problemas problemas atuais com esta versao do DTD, modificar o DTD conforme cada item abaixo e corrigir os XML a partir dos novos erros de validacao que irao surgir.
- Remover as variacoes de PT-Words-Cand e PT-Word-Cand
- Remover variacoes de Spa-Words-Sug e SPA-Words-Sug
- Remover variacoes nos atributos do PT-Words-Man