Code Monkey home page Code Monkey logo

pywsd's Introduction

Build Status PyPI license FOSSA Status

pywsd

Python Implementations of Word Sense Disambiguation (WSD) technologies:

  • Lesk algorithms

    • Original Lesk (Lesk, 1986)
    • Adapted/Extended Lesk (Banerjee and Pederson, 2002/2003)
    • Simple Lesk (with definition, example(s) and hyper+hyponyms)
    • Cosine Lesk (use cosines to calculate overlaps instead of using raw counts)
  • Maximizing Similarity (see also, Pedersen et al. (2003))

    • Path similarity (Wu-Palmer, 1994; Leacock and Chodorow, 1998)
    • Information Content (Resnik, 1995; Jiang and Corath, 1997; Lin, 1998)
  • Baselines
    • Random sense
    • First NLTK sense
    • Highest lemma counts

NOTE: PyWSD only supports Python 3 now (pywsd>=1.2.0). If you're using Python 2, the last possible version is pywsd==1.1.7.

Install

pip install -U nltk
python -m nltk.downloader 'popular'
pip install -U pywsd

Usage

$ python
>>> from pywsd.lesk import simple_lesk
>>> sent = 'I went to the bank to deposit my money'
>>> ambiguous = 'bank'
>>> answer = simple_lesk(sent, ambiguous, pos='n')
>>> print answer
Synset('depository_financial_institution.n.01')
>>> print answer.definition()
'a financial institution that accepts deposits and channels the money into lending activities'

For all-words WSD, try:

>>> from pywsd import disambiguate
>>> from pywsd.similarity import max_similarity as maxsim
>>> disambiguate('I went to the bank to deposit my money')
[('I', None), ('went', Synset('run_low.v.01')), ('to', None), ('the', None), ('bank', Synset('depository_financial_institution.n.01')), ('to', None), ('deposit', Synset('deposit.v.02')), ('my', None), ('money', Synset('money.n.03'))]
>>> disambiguate('I went to the bank to deposit my money', algorithm=maxsim, similarity_option='wup', keepLemmas=True)
[('I', 'i', None), ('went', u'go', Synset('sound.v.02')), ('to', 'to', None), ('the', 'the', None), ('bank', 'bank', Synset('bank.n.06')), ('to', 'to', None), ('deposit', 'deposit', Synset('deposit.v.02')), ('my', 'my', None), ('money', 'money', Synset('money.n.01'))]

To read pre-computed signatures per synset:

>>> from pywsd.lesk import cached_signatures
>>> cached_signatures['dog.n.01']['simple']
set([u'canid', u'belgian_griffon', u'breed', u'barker', ... , u'genus', u'newfoundland'])
>>> cached_signatures['dog.n.01']['adapted']
set([u'canid', u'belgian_griffon', u'breed', u'leonberg', ... , u'newfoundland', u'pack'])

>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('dog')[0]
Synset('dog.n.01')
>>> dog = wn.synsets('dog')[0]
>>> dog.name()
u'dog.n.01'
>>> cached_signatures[dog.name()]['simple']
set([u'canid', u'belgian_griffon', u'breed', u'barker', ... , u'genus', u'newfoundland'])

Cite

To cite pywsd:

Liling Tan. 2014. Pywsd: Python Implementations of Word Sense Disambiguation (WSD) Technologies [software]. Retrieved from https://github.com/alvations/pywsd

In bibtex:

@misc{pywsd14,
author =   {Liling Tan},
title =    {Pywsd: Python Implementations of Word Sense Disambiguation (WSD) Technologies [software]},
howpublished = {https://github.com/alvations/pywsd},
year = {2014}
}

References

  • Michael Lesk. 1986. Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. In Proceedings of the 5th annual international conference on Systems documentation (SIGDOC '86), Virginia DeBuys (Ed.). ACM, New York, NY, USA, 24-26. DOI=10.1145/318723.318728 http://doi.acm.org/10.1145/318723.318728

  • Satanjeev Banerjee and Ted Pedersen. 2002. An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet. In Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing (CICLing '02), Alexander F. Gelbukh (Ed.). Springer-Verlag, London, UK, UK, 136-145.

  • Satanjeev Banerjee and Ted Pedersen. 2003. Extended gloss overlaps as a measure of semantic relatedness. In Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, pages 805–810, Acapulco.

  • Jay J. Jiang and David W. Conrath. 1997. Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of International Conference on Research in Computational Linguistics, Taiwan.

  • Claudia Leacock and Martin Chodorow. 1998. Combining local context and WordNet similarity for word sense identification. In Fellbaum 1998, pp. 265–283.

  • Lee, Yoong Keok, Hwee Tou Ng, and Tee Kiah Chia. "Supervised word sense disambiguation with support vector machines and multiple knowledge sources." Senseval-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text. 2004.

  • Dekang Lin. 1998. An information-theoretic definition of similarity. In Proceedings of the 15th International Conference on Machine Learning, Madison, WI.

  • Linlin Li, Benjamin Roth and Caroline Sporleder. 2010. Topic Models for Word Sense Disambiguation and Token-based Idiom Detection. The 48th Annual Meeting of the Association for Computational Linguistics (ACL). Uppsala, Sweden.

  • Andrea Moro, Roberto Navigli, Francesco Maria Tucci and Rebecca J. Passonneau. 2014. Annotating the MASC Corpus with BabelNet. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14). Reykjavik, Iceland.

  • Zhi Zhong and Hwee Tou Ng. 2010. It makes sense: a wide-coverage word sense disambiguation system for free text. In Proceedings of the ACL 2010 System Demonstrations (ACLDemos '10). Association for Computational Linguistics, Stroudsburg, PA, USA, 78-83.

  • Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python (1st ed.). O'Reilly Media, Inc..

  • Eneko Agirre and Aitor Soroa. 2009. Personalizing PageRank for Word Sense Disambiguation. Proceedings of the 12th conference of the European chapter of the Association for Computational Linguistics (EACL-2009). Athens, Greece.

pywsd's People

Contributors

alvations avatar chrisji avatar gauravjuvekar avatar goodmami avatar kmouratidis avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pywsd's Issues

Not compatible with Python 3?

Hi,

It seems pywsd is not compatible with Python 3.

If I run from pywsd.lesk import simple_lesk in Python 3, the following error is given:

Traceback (most recent call last):
File "wsd.py", line 28, in
from pywsd.lesk import simple_lesk
File "/home/coiby/nlp/pywsd/pywsd/init.py", line 9, in
import lesk
ImportError: No module named 'lesk'

I manually add pywsd to path (sys.path.append(os.path.join(os.path.dirname(__file__), 'pywsd'))), another issue occurs:

Traceback (most recent call last):
File "wsd.py", line 50, in
answer = simple_lesk(sent, ambiguous)
File "/home/coiby/nlp/pywsd/pywsd/lesk.py", line 151, in simple_lesk
context_sentence = lemmatize_sentence(context_sentence)
File "pywsd/utils.py", line 104, in lemmatize_sentence
for word, pos in postagger(tokenizer(sentence)):
File "/usr/local/lib/python3.4/dist-packages/nltk/tokenize/init.py", line 104, in word_tokenize
return [token for sent in sent_tokenize(text, language)
File "/usr/local/lib/python3.4/dist-packages/nltk/tokenize/init.py", line 89, in sent_tokenize
return tokenizer.tokenize(text)
File "/usr/local/lib/python3.4/dist-packages/nltk/tokenize/punkt.py", line 1226, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "/usr/local/lib/python3.4/dist-packages/nltk/tokenize/punkt.py", line 1274, in sentences_from_text
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "/usr/local/lib/python3.4/dist-packages/nltk/tokenize/punkt.py", line 1265, in span_tokenize
return [(sl.start, sl.stop) for sl in slices]
File "/usr/local/lib/python3.4/dist-packages/nltk/tokenize/punkt.py", line 1265, in
return [(sl.start, sl.stop) for sl in slices]
File "/usr/local/lib/python3.4/dist-packages/nltk/tokenize/punkt.py", line 1304, in _realign_boundaries
for sl1, sl2 in _pair_iter(slices):
File "/usr/local/lib/python3.4/dist-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter
prev = next(it)
File "/usr/local/lib/python3.4/dist-packages/nltk/tokenize/punkt.py", line 1278, in _slices_from_text
for match in self._lang_vars.period_context_re().finditer(text):
TypeError: can't use a string pattern on a bytes-like object

Using Pywsd in other languages (french, or others)

Good afternoon,

I was wondering if it would be possible to adapt this tool to other languages such as French or Spanish.
If it is feasible, could you give me some indications on how to do these modifications?

Thank you very much!

simple_lesk bug

When running from pywsd.lesk import cached_signatures, simple_lesk

I get the following error

Warming up PyWSD (takes ~10 secs)... Traceback (most recent call last):
File "/Users/rreilly/anaconda3/envs/-/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2897, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'bar.n.04'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "", line 1, in
File "/Users/rreilly/anaconda3/envs/-/lib/python3.7/site-packages/pywsd/init.py", line 33, in
pywsd.lesk.simple_lesk('This is a foo bar sentence', 'bar')
File "/Users/rreilly/anaconda3/envs/-/lib/python3.7/site-packages/pywsd/lesk.py", line 251, in simple_lesk
from_cache=from_cache)
File "/Users/rreilly/anaconda3/envs/-/lib/python3.7/site-packages/pywsd/lesk.py", line 226, in simple_signatures
from_cache=from_cache)
File "/Users/rreilly/anaconda3/envs/-/lib/python3.7/site-packages/pywsd/lesk.py", line 123, in signatures
from_cache=from_cache)
File "/Users/rreilly/anaconda3/envs/-/lib/python3.7/site-packages/pywsd/lesk.py", line 48, in synset_signatures
return synset_signatures_from_cache(ss, hyperhypo, adapted, original_lesk)
File "/Users/rreilly/anaconda3/envs/-/lib/python3.7/site-packages/pywsd/lesk.py", line 35, in synset_signatures_from_cache
return cached_signatures[ss.name()][signature_type]
File "/Users/rreilly/anaconda3/envs/-/lib/python3.7/site-packages/pandas/core/frame.py", line 2980, in getitem
indexer = self.columns.get_loc(key)
File "/Users/rreilly/anaconda3/envs/-/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2899, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'bar.n.04'

Question about the sample presented in Usage

I am just looking into pywsd. It looks very interesting.

I tried the sample with Lesk as per Usage.

I get :
Python 2.7.6 (default, Sep 9 2014, 15:04:36)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

from pywsd.lesk import simple_lesk
sent = 'I went to the bank to deposit my money'
ambiguous = 'bank'
answer = simple_lesk(sent, ambiguous, nbest=True, keepscore=True)
print answer
[(2, Synset('deposit.v.02')), (2, Synset('bank.n.09')), .......
print answer[0][1].definition()
put into a bank account
print answer[1][1].definition()
a building in which the business of banking transacted

instead of

print answer.definition()
a financial institution that accepts deposits and channels the money into lending activities

I just downloaded and installed NLTK this morning :

import nltk;
print nltk.version
3.0.5

Any idea about what could be going on ?

How to cite pywsd

I have written a thesis paper about a computer system in which I used pywsd. I would like to cite the usage of it in my paper, and as of now I am citing it like this:

Alvations (2014) Pywsd. GitHub Repository. Retrieved
20 April, 2014, from https://github.com/alvations/pywsd

Now this is no good because I am citing your user name. Instead I would like to cite your actual name. It might be a good idea to include a small note in the readme with a sample citation for anyone who wants to cite it. If this is possible, please let me know asap.

lesk ImportError

I have python 2.7 and i work on windows10 operating system. I installed the lesk library as per documentation. When i try to import the module using

from pywsd.lesk import simple_lesk

I end up getting error

Warming up PyWSD (takes ~10 secs)... 
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
c:\python35\lib\site-packages\pandas\io\pickle.py in try_read(path, encoding)
     51     1    1    6
---> 52     2    2    7
     53     3    3    8

c:\python35\lib\site-packages\pandas\core\indexes\base.py in <module>()
     16 from pandas.core.accessor import CachedAccessor
---> 17 from pandas.core.arrays import ExtensionArray
     18 from pandas.core.dtypes.generic import (

c:\python35\lib\site-packages\pandas\core\arrays\__init__.py in <module>()
      1 from .base import ExtensionArray  # noqa
----> 2 from .categorical import Categorical  # noqa

c:\python35\lib\site-packages\pandas\core\arrays\categorical.py in <module>()
     13     ABCSeries, ABCIndexClass, ABCCategoricalIndex)
---> 14 from pandas.core.dtypes.missing import isna, notna
     15 from pandas.core.dtypes.inference import is_hashable

c:\python35\lib\site-packages\pandas\core\dtypes\missing.py in <module>()
      9                       ABCExtensionArray)
---> 10 from .common import (is_string_dtype, is_datetimelike,
     11                      is_datetimelike_v_numeric, is_float_dtype,

c:\python35\lib\site-packages\pandas\core\dtypes\common.py in <module>()
     16                       ABCIndexClass, ABCDateOffset)
---> 17 from .inference import is_string_like, is_list_like
     18 from .inference import *  # noqa

c:\python35\lib\site-packages\pandas\core\dtypes\inference.py in <module>()
      7 from numbers import Number
----> 8 from pandas.compat import (PY2, string_types, text_type,
      9                            string_and_binary_types, re_type)

ImportError: cannot import name 're_type'

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
c:\python35\lib\site-packages\pandas\io\pickle.py in try_read(path, encoding)
     56 
---> 57     >>> unpickled_df = pd.read_pickle("./dummy.pkl")
     58     >>> unpickled_df

c:\python35\lib\site-packages\pandas\compat\pickle_compat.py in load(fh, encoding, compat, is_verbose)
    116 
--> 117     # 19939, add timedeltaindex, float64index compat from 15998 move
    118     ('pandas.tseries.tdi', 'TimedeltaIndex'):

c:\python35\lib\pickle.py in load(self)
   1038                 assert isinstance(key, bytes_types)
-> 1039                 dispatch[key[0]](self)
   1040         except _Stop as stopinst:

c:\python35\lib\pickle.py in load_global(self)
   1333         name = self.readline()[:-1].decode("utf-8")
-> 1334         klass = self.find_class(module, name)
   1335         self.append(klass)

c:\python35\lib\pickle.py in find_class(self, module, name)
   1383                 module = _compat_pickle.IMPORT_MAPPING[module]
-> 1384         __import__(module, level=0)
   1385         if self.proto >= 4:

c:\python35\lib\site-packages\pandas\core\indexes\base.py in <module>()
     16 from pandas.core.accessor import CachedAccessor
---> 17 from pandas.core.arrays import ExtensionArray
     18 from pandas.core.dtypes.generic import (

c:\python35\lib\site-packages\pandas\core\arrays\__init__.py in <module>()
      1 from .base import ExtensionArray  # noqa
----> 2 from .categorical import Categorical  # noqa

c:\python35\lib\site-packages\pandas\core\arrays\categorical.py in <module>()
     13     ABCSeries, ABCIndexClass, ABCCategoricalIndex)
---> 14 from pandas.core.dtypes.missing import isna, notna
     15 from pandas.core.dtypes.inference import is_hashable

c:\python35\lib\site-packages\pandas\core\dtypes\missing.py in <module>()
      9                       ABCExtensionArray)
---> 10 from .common import (is_string_dtype, is_datetimelike,
     11                      is_datetimelike_v_numeric, is_float_dtype,

c:\python35\lib\site-packages\pandas\core\dtypes\common.py in <module>()
     16                       ABCIndexClass, ABCDateOffset)
---> 17 from .inference import is_string_like, is_list_like
     18 from .inference import *  # noqa

c:\python35\lib\site-packages\pandas\core\dtypes\inference.py in <module>()
      7 from numbers import Number
----> 8 from pandas.compat import (PY2, string_types, text_type,
      9                            string_and_binary_types, re_type)

ImportError: cannot import name 're_type'

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
c:\python35\lib\site-packages\pandas\io\pickle.py in read_pickle(path)
     64     4    4    9
---> 65 
     66     >>> import os

c:\python35\lib\site-packages\pandas\io\pickle.py in try_read(path, encoding)
     61     1    1    6
---> 62     2    2    7
     63     3    3    8

c:\python35\lib\site-packages\pandas\compat\pickle_compat.py in load(fh, encoding, compat, is_verbose)
    116 
--> 117     # 19939, add timedeltaindex, float64index compat from 15998 move
    118     ('pandas.tseries.tdi', 'TimedeltaIndex'):

c:\python35\lib\pickle.py in load(self)
   1038                 assert isinstance(key, bytes_types)
-> 1039                 dispatch[key[0]](self)
   1040         except _Stop as stopinst:

c:\python35\lib\pickle.py in load_global(self)
   1333         name = self.readline()[:-1].decode("utf-8")
-> 1334         klass = self.find_class(module, name)
   1335         self.append(klass)

c:\python35\lib\pickle.py in find_class(self, module, name)
   1383                 module = _compat_pickle.IMPORT_MAPPING[module]
-> 1384         __import__(module, level=0)
   1385         if self.proto >= 4:

c:\python35\lib\site-packages\pandas\core\indexes\base.py in <module>()
     16 from pandas.core.accessor import CachedAccessor
---> 17 from pandas.core.arrays import ExtensionArray
     18 from pandas.core.dtypes.generic import (

c:\python35\lib\site-packages\pandas\core\arrays\__init__.py in <module>()
      1 from .base import ExtensionArray  # noqa
----> 2 from .categorical import Categorical  # noqa

c:\python35\lib\site-packages\pandas\core\arrays\categorical.py in <module>()
     13     ABCSeries, ABCIndexClass, ABCCategoricalIndex)
---> 14 from pandas.core.dtypes.missing import isna, notna
     15 from pandas.core.dtypes.inference import is_hashable

c:\python35\lib\site-packages\pandas\core\dtypes\missing.py in <module>()
      9                       ABCExtensionArray)
---> 10 from .common import (is_string_dtype, is_datetimelike,
     11                      is_datetimelike_v_numeric, is_float_dtype,

c:\python35\lib\site-packages\pandas\core\dtypes\common.py in <module>()
     16                       ABCIndexClass, ABCDateOffset)
---> 17 from .inference import is_string_like, is_list_like
     18 from .inference import *  # noqa

c:\python35\lib\site-packages\pandas\core\dtypes\inference.py in <module>()
      7 from numbers import Number
----> 8 from pandas.compat import (PY2, string_types, text_type,
      9                            string_and_binary_types, re_type)

ImportError: cannot import name 're_type'

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
c:\python35\lib\site-packages\pandas\io\pickle.py in try_read(path, encoding)
     51     1    1    6
---> 52     2    2    7
     53     3    3    8

c:\python35\lib\site-packages\pandas\core\indexes\base.py in <module>()
     16 from pandas.core.accessor import CachedAccessor
---> 17 from pandas.core.arrays import ExtensionArray
     18 from pandas.core.dtypes.generic import (

c:\python35\lib\site-packages\pandas\core\arrays\__init__.py in <module>()
      1 from .base import ExtensionArray  # noqa
----> 2 from .categorical import Categorical  # noqa

c:\python35\lib\site-packages\pandas\core\arrays\categorical.py in <module>()
     13     ABCSeries, ABCIndexClass, ABCCategoricalIndex)
---> 14 from pandas.core.dtypes.missing import isna, notna
     15 from pandas.core.dtypes.inference import is_hashable

c:\python35\lib\site-packages\pandas\core\dtypes\missing.py in <module>()
      9                       ABCExtensionArray)
---> 10 from .common import (is_string_dtype, is_datetimelike,
     11                      is_datetimelike_v_numeric, is_float_dtype,

c:\python35\lib\site-packages\pandas\core\dtypes\common.py in <module>()
     16                       ABCIndexClass, ABCDateOffset)
---> 17 from .inference import is_string_like, is_list_like
     18 from .inference import *  # noqa

c:\python35\lib\site-packages\pandas\core\dtypes\inference.py in <module>()
      7 from numbers import Number
----> 8 from pandas.compat import (PY2, string_types, text_type,
      9                            string_and_binary_types, re_type)

ImportError: cannot import name 're_type'

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
c:\python35\lib\site-packages\pandas\io\pickle.py in try_read(path, encoding)
     56 
---> 57     >>> unpickled_df = pd.read_pickle("./dummy.pkl")
     58     >>> unpickled_df

c:\python35\lib\site-packages\pandas\compat\pickle_compat.py in load(fh, encoding, compat, is_verbose)
    116 
--> 117     # 19939, add timedeltaindex, float64index compat from 15998 move
    118     ('pandas.tseries.tdi', 'TimedeltaIndex'):

c:\python35\lib\pickle.py in load(self)
   1038                 assert isinstance(key, bytes_types)
-> 1039                 dispatch[key[0]](self)
   1040         except _Stop as stopinst:

c:\python35\lib\pickle.py in load_global(self)
   1333         name = self.readline()[:-1].decode("utf-8")
-> 1334         klass = self.find_class(module, name)
   1335         self.append(klass)

c:\python35\lib\pickle.py in find_class(self, module, name)
   1383                 module = _compat_pickle.IMPORT_MAPPING[module]
-> 1384         __import__(module, level=0)
   1385         if self.proto >= 4:

c:\python35\lib\site-packages\pandas\core\indexes\base.py in <module>()
     16 from pandas.core.accessor import CachedAccessor
---> 17 from pandas.core.arrays import ExtensionArray
     18 from pandas.core.dtypes.generic import (

c:\python35\lib\site-packages\pandas\core\arrays\__init__.py in <module>()
      1 from .base import ExtensionArray  # noqa
----> 2 from .categorical import Categorical  # noqa

c:\python35\lib\site-packages\pandas\core\arrays\categorical.py in <module>()
     13     ABCSeries, ABCIndexClass, ABCCategoricalIndex)
---> 14 from pandas.core.dtypes.missing import isna, notna
     15 from pandas.core.dtypes.inference import is_hashable

c:\python35\lib\site-packages\pandas\core\dtypes\missing.py in <module>()
      9                       ABCExtensionArray)
---> 10 from .common import (is_string_dtype, is_datetimelike,
     11                      is_datetimelike_v_numeric, is_float_dtype,

c:\python35\lib\site-packages\pandas\core\dtypes\common.py in <module>()
     16                       ABCIndexClass, ABCDateOffset)
---> 17 from .inference import is_string_like, is_list_like
     18 from .inference import *  # noqa

c:\python35\lib\site-packages\pandas\core\dtypes\inference.py in <module>()
      7 from numbers import Number
----> 8 from pandas.compat import (PY2, string_types, text_type,
      9                            string_and_binary_types, re_type)

ImportError: cannot import name 're_type'

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
<ipython-input-4-a91d9624c173> in <module>()
      1 #from pywsd.lesk import simple_lesk
----> 2 import pywsd.lesk

c:\python35\lib\site-packages\pywsd\__init__.py in <module>()
     17 start = time.time()
     18 
---> 19 import pywsd.lesk
     20 import pywsd.baseline
     21 import pywsd.similarity

c:\python35\lib\site-packages\pywsd\lesk.py in <module>()
     24 EN_STOPWORDS = set(stopwords.words('english') + list(string.punctuation) + pywsd_stopwords)
     25 signatures_picklefile = os.path.dirname(os.path.abspath(__file__)) + '/data/signatures/signatures.pkl'
---> 26 cached_signatures = pd.read_pickle(signatures_picklefile)
     27 
     28 def synset_signatures_from_cache(ss, hyperhypo=True, adapted=False, original_lesk=False):

c:\python35\lib\site-packages\pandas\io\pickle.py in read_pickle(path)
     66     >>> import os
     67     >>> os.remove("./dummy.pkl")
---> 68     """
     69     path = _stringify_path(path)
     70     inferred_compression = _infer_compression(path, compression)

c:\python35\lib\site-packages\pandas\io\pickle.py in try_read(path, encoding)
     60     0    0    5
     61     1    1    6
---> 62     2    2    7
     63     3    3    8
     64     4    4    9

c:\python35\lib\site-packages\pandas\compat\pickle_compat.py in load(fh, encoding, compat, is_verbose)
    115         ('pandas.core.arrays', 'Categorical'),
    116 
--> 117     # 19939, add timedeltaindex, float64index compat from 15998 move
    118     ('pandas.tseries.tdi', 'TimedeltaIndex'):
    119         ('pandas.core.indexes.timedeltas', 'TimedeltaIndex'),

c:\python35\lib\pickle.py in load(self)
   1037                     raise EOFError
   1038                 assert isinstance(key, bytes_types)
-> 1039                 dispatch[key[0]](self)
   1040         except _Stop as stopinst:
   1041             return stopinst.value

c:\python35\lib\pickle.py in load_global(self)
   1332         module = self.readline()[:-1].decode("utf-8")
   1333         name = self.readline()[:-1].decode("utf-8")
-> 1334         klass = self.find_class(module, name)
   1335         self.append(klass)
   1336     dispatch[GLOBAL[0]] = load_global

c:\python35\lib\pickle.py in find_class(self, module, name)
   1382             elif module in _compat_pickle.IMPORT_MAPPING:
   1383                 module = _compat_pickle.IMPORT_MAPPING[module]
-> 1384         __import__(module, level=0)
   1385         if self.proto >= 4:
   1386             return _getattribute(sys.modules[module], name)[0]

c:\python35\lib\site-packages\pandas\core\indexes\base.py in <module>()
     15 
     16 from pandas.core.accessor import CachedAccessor
---> 17 from pandas.core.arrays import ExtensionArray
     18 from pandas.core.dtypes.generic import (
     19     ABCSeries, ABCDataFrame,

c:\python35\lib\site-packages\pandas\core\arrays\__init__.py in <module>()
      1 from .base import ExtensionArray  # noqa
----> 2 from .categorical import Categorical  # noqa

c:\python35\lib\site-packages\pandas\core\arrays\categorical.py in <module>()
     12 from pandas.core.dtypes.generic import (
     13     ABCSeries, ABCIndexClass, ABCCategoricalIndex)
---> 14 from pandas.core.dtypes.missing import isna, notna
     15 from pandas.core.dtypes.inference import is_hashable
     16 from pandas.core.dtypes.cast import (

c:\python35\lib\site-packages\pandas\core\dtypes\missing.py in <module>()
      8                       ABCIndexClass, ABCGeneric,
      9                       ABCExtensionArray)
---> 10 from .common import (is_string_dtype, is_datetimelike,
     11                      is_datetimelike_v_numeric, is_float_dtype,
     12                      is_datetime64_dtype, is_datetime64tz_dtype,

c:\python35\lib\site-packages\pandas\core\dtypes\common.py in <module>()
     15                       ABCSparseArray, ABCSparseSeries, ABCCategoricalIndex,
     16                       ABCIndexClass, ABCDateOffset)
---> 17 from .inference import is_string_like, is_list_like
     18 from .inference import *  # noqa
     19 

c:\python35\lib\site-packages\pandas\core\dtypes\inference.py in <module>()
      6 from collections import Iterable
      7 from numbers import Number
----> 8 from pandas.compat import (PY2, string_types, text_type,
      9                            string_and_binary_types, re_type)
     10 from pandas._libs import lib

ImportError: cannot import name 're_type'

Return Synset Ranking

This is an amazing library, not only because it agglomerates many disparate methods, but because it's easy to read. For someone without any NLP experience its a great way to learn more about WSD algorithms. In any case I like having all these different methods in a single library. However my application needs ranked synsets, so it would be great if pywsd would return the ranking, and leaving the burden of selecting the most appropriate sense to the user (for all algorithms that return rankings). Something like:

>>> answer = simple_lesk(sent, ambiguous)
>>> print answer
>>> {Synset('...'): 0.1, Synset('...'): 0.5, Synset('...'): 0.3, Synset('...'): 0.7}

similarity_by_path arguments flipped and None returned

Hi,

I was wondering if here https://github.com/alvations/pywsd/blob/master/pywsd/similarity.py#L22-L23

the senses should be flipped in the second argument of the max function. I.e.:

return max(wn.path_similarity(sense1,sense2), wn.path_similarity(sense2,sense1))

Also, when using Python3 this call tends to crash if there exists no path. Then wn.path_similarity returns None and max(None, 1) throws an exception in Python3.

This is e.g. the case for Synset('bank.n.01') and Synset('one.s.01') which is checked when running

max_similarity('I went to the bank to deposit my money', 'bank', 'path', pos='n')

typo in bibtex citation

Even though it compiles I think there is a superfluous curly bracke in your pywsd citation.
The below version works only with \usepackage[square,sort,comma,numbers]{natbib} (but it turns all your citations into [digit] format which looks really bad)

@misc{pywsd14,
author =   {Liling Tan},
title =    {Pywsd: Python Implementations of Word Sense Disambiguation (WSD) Technologies [software]},
howpublished = {https://github.com/alvations/pywsd}},
year = {2014}
} 

Also, an alternative version that does not break the compilation in case \usepackage{natbib} used.

@misc{pywsd14,
    author = {Liling Tan},
    title = {{Pywsd: Python Implementations of Word Sense Disambiguation (WSD) Technologies [software]}},
    howpublished = "\url{https://github.com/alvations/pywsd}",
    year = {2014},
    note = "[Online; accessed 17-July-2016]"
}

disambiguate bug

from pywsd.allwords_wsd import disambiguate
disambiguate('I have five lights')

Traceback (most recent call last):
File "/Users/rreilly/anaconda3/envs/-/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2897, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'light.n.04'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "", line 1, in
File "/Users/rreilly/anaconda3/envs/-/lib/python3.7/site-packages/pywsd/allwords_wsd.py", line 51, in disambiguate
from_cache=from_cache)
File "/Users/rreilly/anaconda3/envs/-/lib/python3.7/site-packages/pywsd/lesk.py", line 251, in simple_lesk
from_cache=from_cache)
File "/Users/rreilly/anaconda3/envs/-/lib/python3.7/site-packages/pywsd/lesk.py", line 226, in simple_signatures
from_cache=from_cache)
File "/Users/rreilly/anaconda3/envs/-/lib/python3.7/site-packages/pywsd/lesk.py", line 123, in signatures
from_cache=from_cache)
File "/Users/rreilly/anaconda3/envs/-/lib/python3.7/site-packages/pywsd/lesk.py", line 48, in synset_signatures
return synset_signatures_from_cache(ss, hyperhypo, adapted, original_lesk)
File "/Users/rreilly/anaconda3/envs/-/lib/python3.7/site-packages/pywsd/lesk.py", line 35, in synset_signatures_from_cache
return cached_signatures[ss.name()][signature_type]
File "/Users/rreilly/anaconda3/envs/-/lib/python3.7/site-packages/pandas/core/frame.py", line 2980, in getitem
indexer = self.columns.get_loc(key)
File "/Users/rreilly/anaconda3/envs/-/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2899, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'light.n.04'

installation of pywsd via pip

hi
how to install pywsd for python 3 via pip? I am getting error when i run from pywsd.lesk import simple_lesk
ImportError: No module named 'lesk'

Support for other languages

In NLTK there exist classes that support loading WordNet-like objects for other languages. Where does your library depend on WordNet?

If the dependency is explicit, it might be possible to easily extend your work for other languages.

Setup.py

Could you please add a setup.py so this can be easily installed using pip?

Error running test_wsd.py after installation

I'm trying to start using wsdpy, but with running test_wsd.py or any simple example I get error like "in File lesk.py line 116, in simple_signature
signature+= list(chain(*[i.lemma_names() for i in ss_hypohypernyms]))
TypeError: 'list' object is not callable" and other errors. I don't know whether I missed something or it is a bug.

Lesk module giving errors

  1. The previous lesk version was fine before changing NLTK's Synset.definition to a function (i.e. Synset.definition()).
  2. Also the ranked synsets were giving index errors due to returning None.
>>> from pywsd.lesk import simple_lesk, original_lesk
>>> sent = "people should be able to marry a person of their choice"
>>> original_lesk(sent, 'able')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pywsd/lesk.py", line 118, in original_lesk
    in wn.synsets(ambiguous_word)}
  File "pywsd/lesk.py", line 117, in <dictcomp>
    dictionary = {ss:ss.definition.split() for ss \
AttributeError: 'function' object has no attribute 'split'
>>> simple_lesk(sent, 'able')
Synset('able.s.03')
>>> simple_lesk(sent, 'able', pos='s')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pywsd/lesk.py", line 181, in simple_lesk
    normalizescore=normalizescore)  
  File "pywsd/lesk.py", line 107, in compare_overlaps
    return ranked_synsets[0]
IndexError: list index out of range
>>> simple_lesk(sent, 'able', pos='a')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pywsd/lesk.py", line 181, in simple_lesk
    normalizescore=normalizescore)  
  File "pywsd/lesk.py", line 107, in compare_overlaps
    return ranked_synsets[0]
IndexError: list index out of range

similarity_by_infocontent is not working

Since you explicitly use nltk.wordnet calls I have no idea why you code does not work, but here you go:

from pywsd.similarity import similarity_by_infocontent as sim
sim(syn,syn1,'res')
0
from nltk.corpus import wordnet as wn
from nltk.corpus import wordnet_ic as wic
resnik = wic.ic('ic-bnc-resnik.dat')
wn.res_similarity(syn,syn1,resnik)
1.5972986298343528

Unfortunately that's true not only for "Resnik" but for any other sim method.

Is it possible to use dictionaries like Longman, Oxford as sense inventory

Hi,

I've noticed there are some problems with Wordnet, I've give two examples.

Word Context Definition by Wordnet Correct sense (Longman)
reflex Virtual assistants also require a conscious decision to stop doing the current task and actively seek out the virtual assistant, which is a reflex many users haven't developed. an automatic instinctive unlearned reaction to a stimulus something that you do without thinking, as a reaction to a situation (there's conditioned reflex and unconditioned unlearned reflex)
impetus Companies with the resources to invest in AI are already creating an impetus for others to follow suit or risk not having a competitive seat at the table. the act of applying force suddenly an influence that makes something happen or makes it happen more quickly

So I wonder if I can use dictionaries like Longman to replace Wordnet.

Thank you!

random.seed issue

Right now in baseline.py, the seed is set for the global Random() instance. Therefore the seed will be shared between all other imports of random and will cause issues if the person using pywsd is not aware of it.

random.seed(0)

Python random Documentation - https://docs.python.org/2/library/random.html

The functions supplied by this module are actually bound methods of a hidden instance of the random.Random class. You can instantiate your own instances of Random to get generators that don’t share state.

import random
custom_random = random.Random(0)

Errors in similarity.py

`def similarity_by_path(sense1, sense2, option="path"):

""" Returns maximum path similarity between two senses. """
if option.lower() in ["path", "path_similarity"]: # Path similaritys
    return max(wn.path_similarity(sense1,sense2),
               wn.path_similarity(sense1,sense2))`

Error in max(wn.path_similarity(sense1,sense2), wn.path_similarity(sense1,sense2))
finding max between same functions

[Question] Lesk vs Max Similarity

Have there been any studies quantifying the most accurate WSD algorithm over generalized content (any genre)? I've ruled out any information-content approaches since they most likely only work well on input similar to the corpus on which they were trained. Therefore I'd be interested in a comparison between any of the lesk algorithms and any of the max-path-similarity algorithms.

For a general thesaurus plugin which algorithm do you think I should use?

Improving Lesk Overlaps

This is more of a performance than a theoretical issue. In theory, it's implemented as they are presented with their respective papers, simple overlaps.

Going after the state-of-art will mean that the implementation is not as represented in the paper. Supervised learning part is going to be a long shot since feature extraction is another headache.

Possibly, improving the overlaps should be a better move for the current code.

  1. Look at the normalization https://github.com/alvations/pywsd/blob/master/pywsd/lesk.py#L41
  2. Think about the effects of lemmatized overlaps vs unlemmatized overlaps. (currently, it's lemmatized overlap by default)
  3. Handling tie-breakers when #overlaps is the same.
  4. Fallback on MFS (that involves extracting MFS from annotated corpus)

Using wup_similarity on simple_lesk output

First of all, thank you for this library. I've been using your simple_lesk implementation for a project. But now, after I've installed the latest version of your lib on a different machine, I fail to be able to call the wup_similarity on the object returned by the simple_lesk function.

example = simple_lesk("This is an example", "example")
example.wup_similarity(example)
>>>Traceback (most recent call last):
>>> File "<input>", line 1, in <module>
>>>AttributeError: 'Synset' object has no attribute 'wup_similarity'

In a previous implementation I have been able to do that. I just wanted to understand whether this is my system causing the issue or if there is a fix to that.

Thank you for your time.

Synset Pre-Processing

Since a contextual sentence is provided, it might be a good idea to run a POS tagger and filter only senses with matching POS before running any WSD algorithms.

Just comparing words by themeslves

Hi, I'm super new to github and nltk, and your project pywsd seemed like the only one that could return the lesk measure between two words, but it looks like it's only meant for comparing words to sentences? Is there a way to compare just two words and get their similarity score or does your program just not do that? If not, is there anything for python that can? Sorry, I just didn't know of any other way to contact you.

pls help install on windows 'averaged_perceptron_tagger'

as you recommend I try to install
'averaged_perceptron_tagger'
but windows 10 computer gives error

Microsoft Windows [Version 10.0.14393]
(c) 2016 Microsoft Corporation. All rights reserved.

e:\nltk_data>python -m nltk.downloader
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml

e:\nltk_data>python -m nltk.downloader 'averaged_perceptron_tagger'
[nltk_data] Error loading 'averaged_perceptron_tagger': Package
[nltk_data] "'averaged_perceptron_tagger'" not found in index
Error installing package. Retry? [n/y/e]
y
Traceback (most recent call last):
File "C:\Users\cde3\Anaconda3\lib\runpy.py", line 184, in _run_module_as_main
"main", mod_spec)
File "C:\Users\cde3\Anaconda3\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Users\cde3\Anaconda3\lib\site-packages\nltk\downloader.py", line 2268, in
halt_on_error=options.halt_on_error)
File "C:\Users\cde3\Anaconda3\lib\site-packages\nltk\downloader.py", line 677, in download
if not self.download(msg.package.id, download_dir,
AttributeError: 'NoneType' object has no attribute 'id'

e:\nltk_data>

Error with adapted_lesk

Hi,
I got this error when calling adapted_lesk

 File "concept_extraction/wordnet_extractor.py", line 78, in annotate_wordnet_concept_lesk
    synset = adapted_lesk(text_split, pos[0], 'n')
  File "/users/iris/gnguyen/miniconda3/lib/python3.5/site-packages/pywsd/lesk.py", line 197, in adapted_lesk
    signature = [lemmatize(i) for i in signature]
UnboundLocalError: local variable 'signature' referenced before assignment

Can you check what is possible reason?
Thank you,

IndexError: list index out of range

I get the following error, possibly because no synset exists.

File "lesk.py", line 67, in compare_overlaps return ranked_synsets[0]
IndexError: list index out of range

pyWSD documentations

The test_*.py are nice examples on how to use pywsd but it's time to document the toolkit.

disambiguate bug

Hi,
the function disambiguate seems to throw an exception when used like this:
disambiguate(' letters oed much co')

The exception is:

Traceback (most recent call last):
File "C:\Program Files\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2881, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 1, in
disambiguate(' letters oed much co')
File "C:\Program Files\Anaconda3\lib\site-packages\pywsd\allwords_wsd.py", line 35, in disambiguate
surface_words, lemmas, morphy_poss = lemmatize_sentence(sentence, keepWordPOS=True)
File "C:\Program Files\Anaconda3\lib\site-packages\pywsd\utils.py", line 107, in lemmatize_sentence
lemmatizer, stemmer))
File "C:\Program Files\Anaconda3\lib\site-packages\pywsd\utils.py", line 79, in lemmatize
stem = stemmer.stem(ambiguous_word)
File "C:\Program Files\Anaconda3\lib\site-packages\nltk\stem\porter.py", line 665, in stem
stem = self._step1b(stem)
File "C:\Program Files\Anaconda3\lib\site-packages\nltk\stem\porter.py", line 376, in _step1b
lambda stem: (self._measure(stem) == 1 and
File "C:\Program Files\Anaconda3\lib\site-packages\nltk\stem\porter.py", line 258, in _apply_rule_list
if suffix == '*d' and self._ends_double_consonant(word):
File "C:\Program Files\Anaconda3\lib\site-packages\nltk\stem\porter.py", line 214, in _ends_double_consonant
word[-1] == word[-2] and
IndexError: string index out of range

AttributeError: 'TreebankWordTokenizer' object has no attribute 'STARTING_QUOTES'

After installing pywsd with the command pip3 install --user pywsd I get the following error when importing the module in python3.

>>> from pywsd.similarity import max_similarity
Warming up PyWSD (takes ~10 secs)... Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/anon/.local/lib/python3.6/site-packages/pywsd/__init__.py", line 19, in <module>
    import pywsd.lesk
  File "/home/anon/.local/lib/python3.6/site-packages/pywsd/lesk.py", line 19, in <module>
    from pywsd.utils import word_tokenize
  File "/home/anon/.local/lib/python3.6/site-packages/pywsd/utils.py", line 20, in <module>
    _treebank_word_tokenizer.STARTING_QUOTES.insert(0, (improved_open_quote_regex, r' \1 '))
AttributeError: 'TreebankWordTokenizer' object has no attribute 'STARTING_QUOTES'

Using signatures computed using wordNet 3.0

Hi,

I installed pywsd using pip in an Anaconda environment. Now I am trying to use the lesk algorithm using signatures computed with WordNet 3.0. To do this I used Precompute Signatures.ipynb where I specified the wordnet_30_dir parameter to generate the signatures. I then copy the generated signatures.pkl file into the lib directory of the installed pywsd and replace the default file that came with the installation.

My code then fails at this line: from pywsd import disambiguate
With the following error: KeyError: 'simple'

Re-running the same line goes through the previous error and produces the following error:
AttributeError: module 'pywsd' has no attribute 'lesk'

Automate all-words WSD

Now pywsd is doing WSD for each word given a context sentence. To do all-words WSD, one has to do something like test_allwords_wsd.py.

Is it possible to automate this process such that users can do:

from nltk.corpus import brown
from pywsd.allwords_wsd import wsd

>>> for sent in brown.sents():
...      print wsd(sent)
...      break
[(u'the', '#STOPWORD/PUNCTUATION#'), (u'fulton', Synset('fulton.n.01')), (u'county', Synset('county.n.02')), (u'grand', Synset('thousand.n.01')), (u'jury', Synset('jury.n.01')), (u'said', Synset('state.v.01')), (u'friday', Synset('friday.n.01')), (u'an', '#STOPWORD/PUNCTUATION#'), (u'investigation', Synset('probe.n.01')), (u'of', '#STOPWORD/PUNCTUATION#'), (u'atlanta', Synset('atlanta.n.02')), (u"'s", '#NOT_IN_WN#'), (u'recent', Synset('recent.s.01')), (u'primary', Synset('primary.n.01')), (u'election', Synset('election.n.01')), (u'produced', Synset('produce.v.04')), (u'``', '#NOT_IN_WN#'), (u'no', '#STOPWORD/PUNCTUATION#'), (u'evidence', Synset('testify.v.02')), (u"''", '#NOT_IN_WN#'), (u'that', '#STOPWORD/PUNCTUATION#'), (u'any', '#STOPWORD/PUNCTUATION#'), (u'irregularity', Synset('irregularity.n.03')), (u'took', Synset('take.v.41')), (u'place', Synset('stead.n.01')), (u'.', '#STOPWORD/PUNCTUATION#')]

pos mismatch breaks similiarity

Love the tool! Super helpful. However, it bugs out if you try to run the maxsim disambiguation on a sentence where the wn.sysnet pos doesn't match the NLTK tagged pos.

Try running

sen = 'these potato chips are great'
disambiguate(sen, algorithm=maxsim)

and you get an index out of range error because result in max_similarity in similarity.py is [], because wn.synsets(ambiguous_word, pos=pos) is nothing as NLTK has (incorrectly) decided the part of speach of 'Potato' is an adjective, and there's no synset for that.

A very simple fix- change line 114 from

for i in wn.synsets(ambiguous_word, pos=pos):

to

for i in wn.synsets(ambiguous_word, pos=pos) or wn.synsets(ambiguous_word):

to provide a fallback option

Getting rid of Python2 =)

The End Of Life of Python 2.7 is 2020. https://pythonclock.org/

We're going to ahead and peel the band-aid fast... So we're going to fast forward and never support Python 2.7 from March onwards and we'll see our CI test passing again =)

From 1.2 onwards all Python 2.7 compatible code will be wiped out! ETA: 15 Mar 2019

Anyone still depending on Python 2.7, you'll be stuck with the last stable version 1.1.7

Morphy vs PorterStemmer

I propose using Morphy instead of PorterStemmer

  • More accurate (on the words I tested). I have no evidence to back this up, especially since I can't find any specific details of its implementation
  • Designed to work specifically with wordnet

It receives an optional POS, which can easily be derived from nltk's pos_tag tagger on the context sentence. I'm not sure how it works without the POS, especially since off the top of my head I'd be hard-pressed to find a word with different lemmas depending on the POS.

Problem with adapted_lesk

Hi, when I run adapted_lesk on a given sentence and word, I get an IndexError. This happens very often in my corpus and this is a disaster for my project. Could you please tell me what to do to avoid this error? Thanks in advance!
Here are two examples:

answer = adapted_lesk("Because of the attacks, in which at least 11 vehicles were gutted by flames, transportation companies suspended all cargo_shipments along the highway, police said. The raids were carried out by national_liberation_army (ELN) rebels who have ordered the suspension of all economic activity in the eastern part of Antioquia province this week, military officials said.","gutted",'v')

answer = adapted_lesk("In Kiev, foreign ministry official Stanislav Lazebnyk also said agreements were being readied on the black_sea fleet issue.","readied",'v')

And the error:

Traceback (most recent call last):
File "test-lesk.py", line 7, in
answer = adapted_lesk("In Kiev, foreign ministry official Stanislav Lazebnyk also said agreements were being readied on the black_sea fleet issue.","readied",'v')
File "/data2/REUTERS/pywsd-master/lesk.py", line 180, in adapted_lesk
normalizescore=normalizescore)
File "/data2/REUTERS/pywsd-master/lesk.py", line 74, in compare_overlaps
return ranked_synsets[0]
IndexError: list index out of range

Maxsimiliarity Algorithm

Maxsimiliarity Algorithm is not giving correct results.
Results Obtained:

disambiguate('I went to the bank to deposit my money', algorithm=maxsim, similarity_option='wup', keepLemmas=True)
[('I', 'i', None), ('went', 'go', Synset('travel.v.01')), ('to', 'to', None), ('the', 'the', None), ('bank', 'bank', None), ('to', 'to', None), ('deposit', 'deposit', Synset('deposit.v.02')), ('my', 'my', None), ('money', 'money', None)]

Expected Result:

[('I', 'i', None), ('went', u'go', Synset('sound.v.02')), ('to', 'to', None), ('the', 'the', None), ('bank', 'bank', Synset('bank.n.06')), ('to', 'to', None), ('deposit', 'deposit', Synset('deposit.v.02')), ('my', 'my', None), ('money', 'money', Synset('money.n.01'))]

It is not disambiguating nouns.

pywsd docstrings

The implementation is just great! I would love if you also provided a detailed documentation. If this is too much to ask, then writing a one line per parameter in the docstrings would be just great. I'm now fumbling around with all the combinations and understanding what they mean from the outputs.

IndexError when using disambiguate() with maxsim algorithm

I'm using Google Colab

s = "would sentiment"
disambiguate(s, algorithm=maxsim, similarity_option='path', keepLemmas=True)

the same with "may sentiment", "might sentiment", "must sentiment", ...


IndexError Traceback (most recent call last)
in ()
1 s = "would sentiment"
----> 2 disambiguate(s, algorithm=maxsim, similarity_option='path', keepLemmas=True)

1 frames
/usr/local/lib/python3.6/dist-packages/pywsd/allwords_wsd.py in disambiguate(sentence, algorithm, context_is_lemmatized, similarity_option, keepLemmas, prefersNone, from_cache, tokenizer)
43 synset = algorithm(lemma_sentence, lemma, from_cache=from_cache)
44 elif algorithm == max_similarity:
---> 45 synset = algorithm(lemma_sentence, lemma, pos=pos, option=similarity_option)
46 else:
47 synset = algorithm(lemma_sentence, lemma, pos=pos, context_is_lemmatized=True,

/usr/local/lib/python3.6/dist-packages/pywsd/similarity.py in max_similarity(context_sentence, ambiguous_word, option, lemma, context_is_lemmatized, pos, best)
125 result = sorted([(v,k) for k,v in result.items()],reverse=True)
126
--> 127 return result[0][1] if best else result

IndexError: list index out of range

max_similarity fails when results are excluded due to "pos" argument

max_similarity(context_sentence="art entertainment hobby creative art", 
                         ambiguous_word='creative', option="path", lemma=True, 
                         context_is_lemmatized=True, pos='n', best=True)

The above function call throws an exception

IndexError                                Traceback (most recent call last)
<ipython-input-63-3333bb3d5eca> in <module>()
----> 1 max_similarity(context_sentence="art entertainment hobby creative art", ambiguous_word='creative', option="path", lemma=True, context_is_lemmatized=True, pos='n', best=True)

/root/anaconda2/lib/python2.7/site-packages/pywsd/similarity.pyc in max_similarity(context_sentence, ambiguous_word, option, lemma, context_is_lemmatized, pos, best)
    106         result = sorted([(v,k) for k,v in result.items()],reverse=True)
    107     ##print result
--> 108     if best: return result[0][1];
    109     return result
    110 

IndexError: list index out of range

It works when i remove the pos argument.

pywsd correctly installed but get error when import (python 3)

Hi, according with my terminal I have successfully installed pywsd on python 3 (see install log below) however, when I import pyswd from python I get the following error. Can you help me to fix it? Thanks a lot!

ERROR LOG

Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3326, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 1, in
import pywsd
File "/usr/share/java/pycharm-community/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
module = self._system_import(name, *args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/pywsd/init.py", line 14, in
from wn import WordNet
File "/usr/share/java/pycharm-community/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
module = self._system_import(name, *args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/wn/init.py", line 10, in
from wn.constants import *
File "/usr/share/java/pycharm-community/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
module = self._system_import(name, *args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/wn/constants.py", line 196, in
exception_map = load_exception_map()
File "/usr/local/lib/python3.7/site-packages/wn/constants.py", line 126, in load_exception_map
with open(wordnet_dir+'%s.exc' % suffix) as fin:
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.7/site-packages/wn/data/wordnet-3.3/adj.exc'

########## INSTALL LOG ###########

pip3 install pywsd

Collecting pywsd
Downloading https://files.pythonhosted.org/packages/8c/79/39597ff5510a63f44c9d4ce2f6a8200bbb1ae9c7b50ef90fe1f851f2c10d/pywsd-1.2.1.tar.gz (23.7MB)
100% |████████████████████████████████| 23.7MB 1.3MB/s
Requirement already satisfied: nltk in /usr/local/lib/python3.7/site-packages (from pywsd) (3.4.4)
Requirement already satisfied: numpy in /usr/local/lib64/python3.7/site-packages (from pywsd) (1.16.4)
Collecting pandas (from pywsd)
Downloading https://files.pythonhosted.org/packages/7e/ab/ea76361f9d3e732e114adcd801d2820d5319c23d0ac5482fa3b412db217e/pandas-0.25.1-cp37-cp37m-manylinux1_x86_64.whl (10.4MB)
100% |████████████████████████████████| 10.4MB 1.2MB/s
Collecting wn (from pywsd)
Downloading https://files.pythonhosted.org/packages/c4/ee/171109f853370256cce3fc10e2574bc4b4165503332e1c327217f855bf92/wn-0.0.20.tar.gz (12.0MB)
100% |████████████████████████████████| 12.1MB 2.9MB/s
Requirement already satisfied: six in /usr/lib/python3.7/site-packages (from pywsd) (1.11.0)
Requirement already satisfied: python-dateutil>=2.6.1 in /usr/lib/python3.7/site-packages (from pandas->pywsd) (2.7.5)
Requirement already satisfied: pytz>=2017.2 in /usr/lib/python3.7/site-packages (from pandas->pywsd) (2018.5)
Building wheels for collected packages: pywsd, wn
Running setup.py bdist_wheel for pywsd ... done
Stored in directory: /root/.cache/pip/wheels/0f/44/85/3829bb6c6188f30e13ba8981e8038c61db494a9788ea3bed01
Running setup.py bdist_wheel for wn ... done
Stored in directory: /root/.cache/pip/wheels/80/68/3b/f1101703d1b65ef59fb45b1e4d2623d8329349785304db5fa2
Successfully built pywsd wn
Installing collected packages: pandas, wn, pywsd
Successfully installed pandas-0.25.1 pywsd-1.2.1 wn-0.0.20

Lemma vs Surface word vs Stems in all-words

Buggy POS reliance.

When extracting signatures, pywsd lemmatizes with POS knowledge when counting overlap, POS was not consider. This happens when we start to rely on POS for lemmatization because of http://stackoverflow.com/questions/27659179/porter-stemming-of-fried/27660340#27660340

For individual WSD it's fine since we can specify POS from the start and that will resolve the issue. The main issue comes when you do all-words WSD and the POS is recognized when lemmatizing but when disambiguating it goes wrong.

Code-sprint for supervised methods and semeval data

Code-sprint from 28th Sept - 2nd Oct 2014
Scheduled release with semeval data + supervised methods: 3rd Oct 2014

TODO:

  • reformat semeval 2007 course-grain all-words data
  • function for wn2.1 to wn3.x mappings
  • supverised methods with and without sklearn
  • emulate IMS system + SVM wsd

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.