cltl / opendutchwordnet Goto Github PK

This repo provides a python module to work with Open Dutch WordNet. It was created using python 3.4.

License: Other

Python 10.17% Shell 0.28% HTML 83.41% CSS 1.48% JavaScript 0.98% TeX 0.03% Jupyter Notebook 3.65%

opendutchwordnet's Introduction

#Global WordNet Grid LMF parser

This repo provides a python module to work with Open Dutch WordNet. Please first check the Issues to see if your question has already been answered. It was created using python 3.4. The most recent version (1.3) of the resource can be found here. Three pdf files in this repository document the resource:

odwn_documentation.pdf: technical report of the creation of version 1.0
gwc2016_odwn13.pdf: paper accepted at the Global WordNet Conference 2016
slides_gwc2016_odwn13.pdf: slides from presentating odwn at the Global WordNet Conference 2016

If you make use of the resource and/or this repository, please cite the following reference:

@InProceedings{Postma:Miltenburg:Segers:Schoen:Vossen:2016, author = "Marten Postma and Emiel van Miltenburg and Roxane Segers and Anneleen Schoen and Piek Vossen", title = "Open {Dutch} {WordNet}", booktitle = "Proceedings of the Eight Global Wordnet Conference", year = 2016, address = "Bucharest, Romania", }

##USAGE AND INSTALL git clone this repository.

The python module 'lxml' is needed. Hopefully, 'pip install lxml' will do the trick. If you prefer using a virtual environment, everything should be installed by calling 'bash install.sh' in the module directory. Don't forget to source your virtual environment each time you use the module.

Epydoc was used to document the code (http://epydoc.sourceforge.net/). The documentation can be found here. The general idea of the module is that it consists of a lot of classes which are inherited by the main class 'Wn_grid_parser'.

python

>>> from OpenDutchWordnet import Wn_grid_parser

#please check the attribute LICENSE before using this module
>>> print(Wn_grid_parser.LICENSE)

#the attribute 'odwn' stores the path to the most recent version
>>>print(Wn_grid_parser.odwn)

#example of how to use module
>>> instance = Wn_grid_parser(Wn_grid_parser.odwn)

>>> le_el = instance.les_find_le("havenplaats-n-1")
>>> le_el.get_id()
'havenplaats-n-1'
>>> le_el.get_lemma()
'havenplaats'
>>> le_el.get_pos()
'noun'
>>> le_el.get_sense_id()
'o_n-109910434'
>>> le_el.get_provenance()
'cdb2.2_Auto'
>>> le_el.get_synset_id()
'eng-30-08633957-n'

>>> synset_el = instance.synsets_find_synset('eng-30-00324560-v')
>>> synset_el.get_id()
'eng-30-00324560-v'
>>> synset_el.get_ili()
'i23355'
>>> relation_el = synset_el.get_relations("has_hyperonym")[0]
>>> relation_el.get_provenance()
'pwn'
>>> relation_el.get_reltype()
'has_hyperonym'
>>> relation_el.get_target()
'eng-30-00322847-v'

##Contact

Piek Vossen ([email protected])

opendutchwordnet's People

Contributors

Stargazers

Watchers

Forkers

barseghyanartur lxchen2001 lebo124 estyles stephantul proprefenetre jbdatascience dorianbrown gavinhewitt geordyvc krogager sinscerly nicolasdevops sysang cultural-ai

opendutchwordnet's Issues

Check if a word is a noun or not

Using the Wordnet for English you can look for a word, e.g., car and it will tell you that it is a noun. Is there a similar function in this module?

Problem installing/using

I tried installing OpenDutchWordnet, using the install.sh (or actually create_virtual_env.sh). The script succeeds, but if I run the example code, >>> from OpenDutchWordnet import Wn_grid_parser I get ImportError: No module named 'OpenDutchWordnet'.

If I look at the script, it seems no module is installed, it just creates a virtualenv and installs lxml from requirements.txt. Also, there is no setup.py.

What am I missing?

Can only find hyperonym relations

s = set()
for synset in instance.synsets_get_generator():
    for relation in synset.get_all_relations():
        s.add(r.get_reltype())

print(s)
{'has_hyperonym'}

Get hypernyms

Hi,

I am not able to find hypernyms of a certain lemma.
With Wordnet, it's possible to do
dog = wn.synset('dog.n.01') dog.hypernyms()
Resulting in:
[Synset('canine.n.02'), Synset('domestic_animal.n.01')]

Is there a similar method for the Dutch Wordnet?

Thank you in advance!

Change directory structure

Currently, the directory structure of the project is set up in such a way that I have to explicitly do cd .. to use the examples in the README.MD file. This could be solved by moving the sources to another directory (typically the name of the project, i.e. opendutchwordnet for this project). This would also solve the docs and the sources being jumbled together.

How do I obtain synonyms?

Add proper install

I would like to see a proper install script (i.e. a setup.py) so that I can use the wordnet in my own programs without having to explicitly copy the sources somewhere. Is there a reason this wasn't done for this project?

Can you also use ODWN within a Java program?

Inconsistent parts of speech

Compare:

>>> x = instance.synsets_get_generator()
>>> s = next(x)
>>> s.get_pos()
'n'

With:

>>> le_el = instance.les_find_le("havenplaats-n-1")
>>> le_el.get_pos()
'noun'

Compatibility with Princeton wordnet

It would be handy if the database was provided in the same format as other wordnet databases. If this were the case, already existing interfaces, such as the wordnet-cli/wn interfaces provided by the Princeton WordNet project, could be use together with this Dutch database. This will be really beneficial for the scripting community at large; e.g. to my knowledge, there is no good offline dutch traditional dictionary database (except maybe this one) but it's quite hard to properly integrate the Dutch database as it exists now into existing dictionary lookup programs. Same for other purposes you might think of. As it stands now, the current database---at a glance at least---seems kinda scattered and to be made out of different kind of files in different formats.

For example, the Princeton wordnet program contains a directory structure like this (and theoretically already allows you to use different "dictionaries"):

...
/usr/share/wordnet/dict/adj.exc
/usr/share/wordnet/dict/adv.exc
/usr/share/wordnet/dict/cntlist
/usr/share/wordnet/dict/data.adj
/usr/share/wordnet/dict/data.adv
/usr/share/wordnet/dict/data.noun
/usr/share/wordnet/dict/data.verb
/usr/share/wordnet/dict/noun.exc
...

'grondtoon' wrongly marked as hyponym of 'kleur'

8000+ synonyms for some words

Some words yield more than 8000 synonyms. Try:

from OpenDutchWordnet import Wn_grid_parser
instance = Wn_grid_parser(Wn_grid_parser.odwn)
syns = instance.les_lemma_synonyms('generiek')
print(len(syns), list(syns)[:20], '...')
8812 ['wildkamperen', 'indiscretie', 'inbegrip', 'Noordzee', 'borstwijdte', 'Turkmeense', 'Moskou', 'onvruchtbaarheid', 'augurk', 'expresweg', 'saloondeuren', 'oud-minister', 'raclette', 'samenpakken', 'Armenië', 'werkloosheidsprobleem', 'Zevengebergte', 'luchtledige', 'spellingwijziging', 'treinstaking'] ...
syns = instance.les_lemma_synonyms('kalender')
print(len(syns), list(syns)[:20], '...')
8817 ['wildkamperen', 'indiscretie', 'inbegrip', 'Noordzee', 'borstwijdte', 'Turkmeense', 'Moskou', 'onvruchtbaarheid', 'augurk', 'expresweg', 'saloondeuren', 'oud-minister', 'raclette', 'samenpakken', 'Armenië', 'werkloosheidsprobleem', 'Zevengebergte', 'luchtledige', 'spellingwijziging', 'treinstaking'] ...
syns = instance.les_lemma_synonyms('post')
print(len(syns), list(syns)[:20], '...')
8834 ['wildkamperen', 'indiscretie', 'inbegrip', 'Noordzee', 'borstwijdte', 'Turkmeense', 'Moskou', 'onvruchtbaarheid', 'augurk', 'expresweg', 'saloondeuren', 'oud-minister', 'raclette', 'samenpakken', 'Armenië', 'werkloosheidsprobleem', 'Zevengebergte', 'luchtledige', 'spellingwijziging', 'treinstaking'] ...

Most words work fine (in al list of 90 words, 10 yield the 8000+, the others not, without any apparent logic). The xml entries in odwn_orbn_gwg-LMF_1.3.xml.gz look fine (to me) too. Is this a bug in les_lemma_synonyms?

Regards,

Marc