Code Monkey home page Code Monkey logo

opendutchwordnet's Introduction

#Global WordNet Grid LMF parser

This repo provides a python module to work with Open Dutch WordNet. Please first check the Issues to see if your question has already been answered. It was created using python 3.4. The most recent version (1.3) of the resource can be found here. Three pdf files in this repository document the resource:

If you make use of the resource and/or this repository, please cite the following reference:

@InProceedings{Postma:Miltenburg:Segers:Schoen:Vossen:2016, author = "Marten Postma and Emiel van Miltenburg and Roxane Segers and Anneleen Schoen and Piek Vossen", title = "Open {Dutch} {WordNet}", booktitle = "Proceedings of the Eight Global Wordnet Conference", year = 2016, address = "Bucharest, Romania", }

##USAGE AND INSTALL git clone this repository.

The python module 'lxml' is needed. Hopefully, 'pip install lxml' will do the trick. If you prefer using a virtual environment, everything should be installed by calling 'bash install.sh' in the module directory. Don't forget to source your virtual environment each time you use the module.

Epydoc was used to document the code (http://epydoc.sourceforge.net/). The documentation can be found here. The general idea of the module is that it consists of a lot of classes which are inherited by the main class 'Wn_grid_parser'.

python

>>> from OpenDutchWordnet import Wn_grid_parser

#please check the attribute LICENSE before using this module
>>> print(Wn_grid_parser.LICENSE)

#the attribute 'odwn' stores the path to the most recent version
>>>print(Wn_grid_parser.odwn)

#example of how to use module
>>> instance = Wn_grid_parser(Wn_grid_parser.odwn)

>>> le_el = instance.les_find_le("havenplaats-n-1")
>>> le_el.get_id()
'havenplaats-n-1'
>>> le_el.get_lemma()
'havenplaats'
>>> le_el.get_pos()
'noun'
>>> le_el.get_sense_id()
'o_n-109910434'
>>> le_el.get_provenance()
'cdb2.2_Auto'
>>> le_el.get_synset_id()
'eng-30-08633957-n'

>>> synset_el = instance.synsets_find_synset('eng-30-00324560-v')
>>> synset_el.get_id()
'eng-30-00324560-v'
>>> synset_el.get_ili()
'i23355'
>>> relation_el = synset_el.get_relations("has_hyperonym")[0]
>>> relation_el.get_provenance()
'pwn'
>>> relation_el.get_reltype()
'has_hyperonym'
>>> relation_el.get_target()
'eng-30-00322847-v'

##Contact

opendutchwordnet's People

Contributors

martenpostma avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

opendutchwordnet's Issues

Problem installing/using

I tried installing OpenDutchWordnet, using the install.sh (or actually create_virtual_env.sh). The script succeeds, but if I run the example code, >>> from OpenDutchWordnet import Wn_grid_parser I get ImportError: No module named 'OpenDutchWordnet'.

If I look at the script, it seems no module is installed, it just creates a virtualenv and installs lxml from requirements.txt. Also, there is no setup.py.

What am I missing?

Can only find hyperonym relations

s = set()
for synset in instance.synsets_get_generator():
    for relation in synset.get_all_relations():
        s.add(r.get_reltype())

print(s)
{'has_hyperonym'}

Get hypernyms

Hi,

I am not able to find hypernyms of a certain lemma.
With Wordnet, it's possible to do
dog = wn.synset('dog.n.01') dog.hypernyms()
Resulting in:
[Synset('canine.n.02'), Synset('domestic_animal.n.01')]

Is there a similar method for the Dutch Wordnet?

Thank you in advance!

Change directory structure

Currently, the directory structure of the project is set up in such a way that I have to explicitly do cd .. to use the examples in the README.MD file. This could be solved by moving the sources to another directory (typically the name of the project, i.e. opendutchwordnet for this project). This would also solve the docs and the sources being jumbled together.

Add proper install

I would like to see a proper install script (i.e. a setup.py) so that I can use the wordnet in my own programs without having to explicitly copy the sources somewhere. Is there a reason this wasn't done for this project?

Inconsistent parts of speech

Compare:

>>> x = instance.synsets_get_generator()
>>> s = next(x)
>>> s.get_pos()
'n'

With:

>>> le_el = instance.les_find_le("havenplaats-n-1")
>>> le_el.get_pos()
'noun'

Compatibility with Princeton wordnet

It would be handy if the database was provided in the same format as other wordnet databases. If this were the case, already existing interfaces, such as the wordnet-cli/wn interfaces provided by the Princeton WordNet project, could be use together with this Dutch database. This will be really beneficial for the scripting community at large; e.g. to my knowledge, there is no good offline dutch traditional dictionary database (except maybe this one) but it's quite hard to properly integrate the Dutch database as it exists now into existing dictionary lookup programs. Same for other purposes you might think of. As it stands now, the current database---at a glance at least---seems kinda scattered and to be made out of different kind of files in different formats.

For example, the Princeton wordnet program contains a directory structure like this (and theoretically already allows you to use different "dictionaries"):

...
/usr/share/wordnet/dict/adj.exc
/usr/share/wordnet/dict/adv.exc
/usr/share/wordnet/dict/cntlist
/usr/share/wordnet/dict/data.adj
/usr/share/wordnet/dict/data.adv
/usr/share/wordnet/dict/data.noun
/usr/share/wordnet/dict/data.verb
/usr/share/wordnet/dict/noun.exc
...

8000+ synonyms for some words

Some words yield more than 8000 synonyms. Try:

from OpenDutchWordnet import Wn_grid_parser
instance = Wn_grid_parser(Wn_grid_parser.odwn)
syns = instance.les_lemma_synonyms('generiek')
print(len(syns), list(syns)[:20], '...')
8812 ['wildkamperen', 'indiscretie', 'inbegrip', 'Noordzee', 'borstwijdte', 'Turkmeense', 'Moskou', 'onvruchtbaarheid', 'augurk', 'expresweg', 'saloondeuren', 'oud-minister', 'raclette', 'samenpakken', 'Armenië', 'werkloosheidsprobleem', 'Zevengebergte', 'luchtledige', 'spellingwijziging', 'treinstaking'] ...
syns = instance.les_lemma_synonyms('kalender')
print(len(syns), list(syns)[:20], '...')
8817 ['wildkamperen', 'indiscretie', 'inbegrip', 'Noordzee', 'borstwijdte', 'Turkmeense', 'Moskou', 'onvruchtbaarheid', 'augurk', 'expresweg', 'saloondeuren', 'oud-minister', 'raclette', 'samenpakken', 'Armenië', 'werkloosheidsprobleem', 'Zevengebergte', 'luchtledige', 'spellingwijziging', 'treinstaking'] ...
syns = instance.les_lemma_synonyms('post')
print(len(syns), list(syns)[:20], '...')
8834 ['wildkamperen', 'indiscretie', 'inbegrip', 'Noordzee', 'borstwijdte', 'Turkmeense', 'Moskou', 'onvruchtbaarheid', 'augurk', 'expresweg', 'saloondeuren', 'oud-minister', 'raclette', 'samenpakken', 'Armenië', 'werkloosheidsprobleem', 'Zevengebergte', 'luchtledige', 'spellingwijziging', 'treinstaking'] ...

Most words work fine (in al list of 90 words, 10 yield the 8000+, the others not, without any apparent logic). The xml entries in odwn_orbn_gwg-LMF_1.3.xml.gz look fine (to me) too. Is this a bug in les_lemma_synonyms?

Regards,

Marc

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.