Code Monkey home page Code Monkey logo

Comments (4)

alvations avatar alvations commented on June 10, 2024

I would suggest that you read through this chapter of the NLTK book, http://www.nltk.org/howto/wordnet.html and then take a look at the lesk algorithm http://en.wikipedia.org/wiki/Lesk_algorithm first. But i'll try to answer some question about WSD and the pywsd tool

WSD (Word Sense Disambiguation) is the task of identifying the right sense of a word given a context sentence. Hence the input for WSD is usually (i) an ambiguous word and (ii) a sentence that the ambiguous word occurs in. And the output would be the sense of the word, in our case we use WordNet as our inventory of senses to disambiguate from.

There are other WSD software available but I'm not sure whether they are python-friendly, see http://stackoverflow.com/questions/4613773/anyone-know-of-some-good-word-sense-disambiguation-software

Next to compare similarity, it is pretty hard to compare similarity base on the surface form of a word, often it is more natural to compare similarity of two senses because there are many sense given a surface word. For example:

>>> from nltk.corpus import wordnet as wn
>>> dog = wn.synsets('dog')
>>> cat = wn.synsets('cat')
>>> dog
[Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('chase.v.01')]
>>> cat
[Synset('cat.n.01'), Synset('guy.n.01'), Synset('cat.n.03'), Synset('kat.n.01'), Synset("cat-o'-nine-tails.n.01"), Synset('caterpillar.n.02'), Synset('big_cat.n.01'), hSynset('computerized_tomography.n.01'), Synset('cat.v.01'), Synset('vomit.v.01')]

As for comparing similarity between senses, you can easily get it through the WordNet API from NLTK, e.g.

>>> from nltk.corpus import wordnet as wn
>>> dog = wn.synsets('dog')
>>> cat = wn.synsets('cat')
>>> dog[0]
Synset('dog.n.01')
>>> dog[0].definition
'a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds'
>>> cat[0]
Synset('cat.n.01')
>>> cat[0].definition
'feline mammal usually having thick soft fur and no ability to roar: domestic cats; wildcats'
>>> wn.path_similarity(dog[0], cat[0])
0.2

Alternatively, if you want to only use surface words, you can "trick" pywsd to give you a score without finding out the sense of the word by maximizing the similarity of all possible sense combination from each word given two sets of senses from both words, e.g.:

>>> from pywsd.similarity import max_similarity as max_sim
>>> dog = 'dog'
>>> cat = 'cat'
>>>
# The following will treat 'dog' as the context sentence and the 
# `pywsd` will try to disambiguate and assign a sense to 'cat
>>> max_sim(dog, cat, best=False)
[(0.2, Synset('big_cat.n.01')), (0.2, Synset('guy.n.01')), (0.2, Synset('cat.n.01')), (0.16666666666666666, Synset("cat-o'-nine-tails.n.01")), (0.14285714285714285, Synset('cat.v.01')), (0.14285714285714285, Synset('vomit.v.01')), (0.14285714285714285, Synset('cat.n.03')), (0.1111111111111111, Synset('caterpillar.n.02')), (0.1111111111111111, Synset('kat.n.01')), (0.058823529411764705, Synset('computerized_tomography.n.01'))]
>>>
# but what you need is to extract the highest similarity score
# among the disambigated output
>>> max(score for score,sense in max_sim(cat, dog, best=False))
0.2

Not surprising, the maximization of synset similarities reaches the global maximum as of the WordNet path_similarity. More interestingly, we see that the maximum similarity between the words achieves different sense pairs from the two surface words =)

from pywsd.

alvations avatar alvations commented on June 10, 2024

In short, if you know the senses, use

>>> wn.path_similarity(sense1, sense2)
0.2

If you don't know the senses, then you can try:

>>> max_sim('dog', 'cat', best=False)[0][0]
0.2

However do note that max_sim()[0][0] is achieved by maximizing all possible sense combinations of all senses from the two words, as described in the previous comment.

Also do note that different similarity measures will give different scores, see the README for the relevant papers describing the different similarity measures, e.g. the Wu-Palmer similarity:

>>> from nltk.corpus import wordnet as wn
>>> from pywsd.similarity import max_similarity as max_sim
>>> cat = wn.synsets('cat')
>>> dog = wn.synsets('dog')
>>> wn.wup_similarity(dog[0], cat[0])
0.8571428571428571
>>> max(score for score,sense in max_sim('cat', 'dog', best=False, option='wup'))
0.8571428571428571

And more:

>>> max(score for score,sense in max_sim(cat, dog, best=False, option='path'))
0.2
>>> max(score for score,sense in max_sim(cat, dog, best=False, option='wup'))
0.8571428571428571
>>> max(score for score,sense in max_sim(cat, dog, best=False, option='lin'))
0.8710969790925273
>>> max(score for score,sense in max_sim(cat, dog, best=False, option='res'))
7.66654127408746
>>> max(score for score,sense in max_sim(cat, dog, best=False, option='jcn'))
0.4305029636162658

from pywsd.

andrieski avatar andrieski commented on June 10, 2024

Hi, thanks so much for replying. You're right, I think I was confusing terms. It makes sense that the Lesk measure applies mainly to sentence disambiguation. I guess I was thinking of the "lesk" measure as applied here> http://ws4jdemo.appspot.com/?mode=w&s1=&w1=calculate&s2=&w2=solve, but it looks like that's a different comparison.

The reason I asked was that the lesk score from the ws4j demo appeared to have the most accurate results, and my goal (since I know the sense of the word, or in my case, the POS) was to get a list of similar words, and the tools provided by the api's and programs I've tried (wsj4 with java, nltk/wordnet) don't quite have that capability.

For example, when I type in "solve", I wanted to find all the verbs that match to some degree, like "calculate", "estimate", 'figure_out", "assess" "clear_up" (I'd use a thesuarus like Moby but that still doesn't have what I'd like).

My original idea was to compare the search word against every other word (based on what I thought was the best measure of relatedness at the time, the lesk measure), and the just list the top 50 results, for example. But since then, I realized that those measures are very sensitive and prone to give me matches that I don't want. So I've shifted focus on iterating through the results.

What I mean is, I've been using wsj4 with Java and getting the hypernyms, synonyms (and sometimes hyponyms too), and then iterating the search again over all the results, and maybe another iteration (over the previous results, and so on), but just over the synonyms (in order to keep the results from branching away too much).

And finally, I was going to sort the results with a similarity measure. Do you think this is a recommended solution, or is there a better/faster way? I've noticed how my iterations are set up depend highly on the POS. I've looked at VerbNet and VerbOcean, and those seem promising, so I may just use this tool (http://verbs.colorado.edu/verb-index/inspector/) to find similar verbs, but I think I'll have to come up with my own methods to find similar nouns/adjectives. Sorry for the length, and seemingly jumping off topic so quickly, I guess I'm still in the brainstorming phase, but if you have any suggestions or directions I can go into, that would be helpful, thanks!

from pywsd.

alvations avatar alvations commented on June 10, 2024

I think you are looking for text similarity software (http://en.wikipedia.org/wiki/Semantic_similarity) instead of WSD software (http://en.wikipedia.org/wiki/Word-sense_disambiguation). While related, it's different.

The issue is going off-topic so i'll close it because there are other watchers in this repo. Unless users or other WSD developers are interested in building text similarity measures into pywsd, we should only use the issue tracker to report bugs/enhancement/fixes/suggestions.

Please contact me personally through email for further discussion (http://alvations.bitbucket.org/about.html) but I would also suggest that you ask other dev/researchers for advice through one or some of these mailing lists: http://www.clres.com/distlist.html

from pywsd.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.