Comments (4)
I would suggest that you read through this chapter of the NLTK book, http://www.nltk.org/howto/wordnet.html and then take a look at the lesk algorithm http://en.wikipedia.org/wiki/Lesk_algorithm first. But i'll try to answer some question about WSD and the pywsd
tool
WSD (Word Sense Disambiguation) is the task of identifying the right sense
of a word given a context sentence. Hence the input for WSD is usually (i) an ambiguous word and (ii) a sentence that the ambiguous word occurs in. And the output would be the sense of the word, in our case we use WordNet as our inventory of senses to disambiguate from.
There are other WSD software available but I'm not sure whether they are python-friendly, see http://stackoverflow.com/questions/4613773/anyone-know-of-some-good-word-sense-disambiguation-software
Next to compare similarity, it is pretty hard to compare similarity base on the surface form of a word, often it is more natural to compare similarity of two senses because there are many sense given a surface word. For example:
>>> from nltk.corpus import wordnet as wn
>>> dog = wn.synsets('dog')
>>> cat = wn.synsets('cat')
>>> dog
[Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('chase.v.01')]
>>> cat
[Synset('cat.n.01'), Synset('guy.n.01'), Synset('cat.n.03'), Synset('kat.n.01'), Synset("cat-o'-nine-tails.n.01"), Synset('caterpillar.n.02'), Synset('big_cat.n.01'), hSynset('computerized_tomography.n.01'), Synset('cat.v.01'), Synset('vomit.v.01')]
As for comparing similarity between senses, you can easily get it through the WordNet API from NLTK, e.g.
>>> from nltk.corpus import wordnet as wn
>>> dog = wn.synsets('dog')
>>> cat = wn.synsets('cat')
>>> dog[0]
Synset('dog.n.01')
>>> dog[0].definition
'a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds'
>>> cat[0]
Synset('cat.n.01')
>>> cat[0].definition
'feline mammal usually having thick soft fur and no ability to roar: domestic cats; wildcats'
>>> wn.path_similarity(dog[0], cat[0])
0.2
Alternatively, if you want to only use surface words, you can "trick" pywsd
to give you a score without finding out the sense of the word by maximizing the similarity of all possible sense combination from each word given two sets of senses from both words, e.g.:
>>> from pywsd.similarity import max_similarity as max_sim
>>> dog = 'dog'
>>> cat = 'cat'
>>>
# The following will treat 'dog' as the context sentence and the
# `pywsd` will try to disambiguate and assign a sense to 'cat
>>> max_sim(dog, cat, best=False)
[(0.2, Synset('big_cat.n.01')), (0.2, Synset('guy.n.01')), (0.2, Synset('cat.n.01')), (0.16666666666666666, Synset("cat-o'-nine-tails.n.01")), (0.14285714285714285, Synset('cat.v.01')), (0.14285714285714285, Synset('vomit.v.01')), (0.14285714285714285, Synset('cat.n.03')), (0.1111111111111111, Synset('caterpillar.n.02')), (0.1111111111111111, Synset('kat.n.01')), (0.058823529411764705, Synset('computerized_tomography.n.01'))]
>>>
# but what you need is to extract the highest similarity score
# among the disambigated output
>>> max(score for score,sense in max_sim(cat, dog, best=False))
0.2
Not surprising, the maximization of synset similarities reaches the global maximum as of the WordNet path_similarity. More interestingly, we see that the maximum similarity between the words achieves different sense pairs from the two surface words =)
from pywsd.
In short, if you know the senses, use
>>> wn.path_similarity(sense1, sense2)
0.2
If you don't know the senses, then you can try:
>>> max_sim('dog', 'cat', best=False)[0][0]
0.2
However do note that max_sim()[0][0]
is achieved by maximizing all possible sense combinations of all senses from the two words, as described in the previous comment.
Also do note that different similarity measures will give different scores, see the README for the relevant papers describing the different similarity measures, e.g. the Wu-Palmer similarity
:
>>> from nltk.corpus import wordnet as wn
>>> from pywsd.similarity import max_similarity as max_sim
>>> cat = wn.synsets('cat')
>>> dog = wn.synsets('dog')
>>> wn.wup_similarity(dog[0], cat[0])
0.8571428571428571
>>> max(score for score,sense in max_sim('cat', 'dog', best=False, option='wup'))
0.8571428571428571
And more:
>>> max(score for score,sense in max_sim(cat, dog, best=False, option='path'))
0.2
>>> max(score for score,sense in max_sim(cat, dog, best=False, option='wup'))
0.8571428571428571
>>> max(score for score,sense in max_sim(cat, dog, best=False, option='lin'))
0.8710969790925273
>>> max(score for score,sense in max_sim(cat, dog, best=False, option='res'))
7.66654127408746
>>> max(score for score,sense in max_sim(cat, dog, best=False, option='jcn'))
0.4305029636162658
from pywsd.
Hi, thanks so much for replying. You're right, I think I was confusing terms. It makes sense that the Lesk measure applies mainly to sentence disambiguation. I guess I was thinking of the "lesk" measure as applied here> http://ws4jdemo.appspot.com/?mode=w&s1=&w1=calculate&s2=&w2=solve, but it looks like that's a different comparison.
The reason I asked was that the lesk score from the ws4j demo appeared to have the most accurate results, and my goal (since I know the sense of the word, or in my case, the POS) was to get a list of similar words, and the tools provided by the api's and programs I've tried (wsj4 with java, nltk/wordnet) don't quite have that capability.
For example, when I type in "solve", I wanted to find all the verbs that match to some degree, like "calculate", "estimate", 'figure_out", "assess" "clear_up" (I'd use a thesuarus like Moby but that still doesn't have what I'd like).
My original idea was to compare the search word against every other word (based on what I thought was the best measure of relatedness at the time, the lesk measure), and the just list the top 50 results, for example. But since then, I realized that those measures are very sensitive and prone to give me matches that I don't want. So I've shifted focus on iterating through the results.
What I mean is, I've been using wsj4 with Java and getting the hypernyms, synonyms (and sometimes hyponyms too), and then iterating the search again over all the results, and maybe another iteration (over the previous results, and so on), but just over the synonyms (in order to keep the results from branching away too much).
And finally, I was going to sort the results with a similarity measure. Do you think this is a recommended solution, or is there a better/faster way? I've noticed how my iterations are set up depend highly on the POS. I've looked at VerbNet and VerbOcean, and those seem promising, so I may just use this tool (http://verbs.colorado.edu/verb-index/inspector/) to find similar verbs, but I think I'll have to come up with my own methods to find similar nouns/adjectives. Sorry for the length, and seemingly jumping off topic so quickly, I guess I'm still in the brainstorming phase, but if you have any suggestions or directions I can go into, that would be helpful, thanks!
from pywsd.
I think you are looking for text similarity software (http://en.wikipedia.org/wiki/Semantic_similarity) instead of WSD software (http://en.wikipedia.org/wiki/Word-sense_disambiguation). While related, it's different.
The issue is going off-topic so i'll close it because there are other watchers in this repo. Unless users or other WSD developers are interested in building text similarity measures into pywsd
, we should only use the issue tracker to report bugs/enhancement/fixes/suggestions.
Please contact me personally through email for further discussion (http://alvations.bitbucket.org/about.html) but I would also suggest that you ask other dev/researchers for advice through one or some of these mailing lists: http://www.clres.com/distlist.html
from pywsd.
Related Issues (20)
- Link in Maximizing Similarity in README.md Showing an Error HOT 2
- Maxsimiliarity Algorithm HOT 1
- Using wup_similarity on simple_lesk output HOT 2
- pos mismatch breaks similiarity
- pywsd correctly installed but get error when import (python 3) HOT 2
- simple_lesk bug HOT 1
- disambiguate bug HOT 7
- IndexError when using disambiguate() with maxsim algorithm HOT 2
- Using signatures computed using wordNet 3.0
- Support for other languages
- Notice: please pin wn dependency HOT 2
- version inconsistency (GitHub vs. PyPI)
- hit ImportError: cannot import name 'WordNet' from 'wn' HOT 8
- Proposing a PR to fix a few small typos
- Proposing a PR to fix a few small typos
- partially disambiguated sentence
- PyWSD with `wn` HOT 1
- Cached signatures could be replaced by json to improve performance
- ModuleNotFoundError: No module named 'BeautifulSoup'
- wrong repo HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pywsd.