bjascob / lemminflect Goto Github PK

View Code? Open in Web Editor NEW

255.0 4.0 25.0 4.42 MB

A python module for English lemmatization and inflection.

License: MIT License

Python 99.76% Shell 0.24%

inflection lemmatization python spacy spacy-extensions nlp nlp-machine-learning

lemminflect's People

Contributors

Stargazers

Watchers

lemminflect's Issues

Pronouns support

Please add pronouns support.
lemminflect.getAllLemmas('his') returns {}

Doc Enhancement

spacy.load("en_core_web_sm")(word)[0]._.inflect("NNS")

gives me "Can't retrieve unregistered extension attribute 'inflect'. Did you forget to call the set_extension method?".

If I follow the message and add "spacy.tokens.Token.set_extension('inflect', method=Inflections().spacyGetInfl)", I now get " Extension 'inflect' already exists on Token. To overwrite the existing extension, set force=True on Token.set_extension"

If I add force=True, it gives me what I want, but there's no mention of this in the tests.

If you'd like, I can add a test to this extent or a note in the README.

Contractions not in the lookup

Contracts are not in the dictionary lookups
They show up in the LEXICON and in the english_dict.txt but not the forms_table.csv.gz. Likely they are being eliminated by the ASCII checks and shouldn't be.
words = ["'d", "'ll", "'m", "'re", "'s", "'ve"]
words = [would, will, am, are, is, have]
lemmas = [will, will, be, be, be, have

Stanford Morphology class uses POS tags

I'm not sure how you used the lemma annotator for CoreNLP to test the lemmatizer, but the Morphology class definitely does use POS tags if available:

https://github.com/stanfordnlp/CoreNLP/blob/main/src/edu/stanford/nlp/process/Morphology.java

For example, the WordTag stemStatic(String word, String tag) interface

FWIW, the next version of CoreNLP will cover ADJ & ADV as well

Word casing not preserved in all cases

LemmInflect currently preserves casing for "all lower", "all upper" and "first upper" casing styles. For words like "McDonald", after lemma/inflection the returned word will be "Mcdonald" since capitalization of individual letters is not maintained.

Incorrect base form for "install"

Is it appropriate to report single words that are incorrect as bugs? I realize the dictionary can't be 100% complete.

Lemminflect 0.2.1:
getAllLemmas('install')
{'VERB': ('install', 'instal')}

I think it should just return 'install'.

Lemma model can select rule for wrong pos type

For the test case 'quilting/NOUN' and 'plastering/NOUN', the words are not in the lemma lookup so OOV rules are called.

getAllLemmasOOV('quilting`, 'NOUN')` returns 'quilt' (it selects rule "ing,,False")
getAllLemmasOOV('plastering`, 'NOUN') returns 'plastering' (it selects rule ",,False")

In the case of 'quilting' the model selects a verb rule. To prevent this consider...

Add hard-coded rules to choose the next best if the rule doesn't apply
Split the model into 3 parts (verb, noun, adj/adv) and run separately
Add contra-cases to training data so it learns not to do this

In addition, the model classes include the ending letters to remove. However, similar above, there is nothing to prevent it selecting a "remove ing" rule for a word ending in something else. I'm not aware of this causing issues but it should be investigated when looking into the first issue.

Ability to find base word without knowing the POS tag

There are some use cases where users would like to find the base word (aka lemma) but don't know what part-of-speech the word is. This is problematic for words like "painting" which could either be "paint" for a verb or "painting" for a noun. Regardless, it may be useful to simply return "paint" for use in Neural Network sentence classification, etc..

Proposed approach is to use the dictionary to find the shortest word. If the word is not in the dictionary then try OOV for Nouns and Verbs and choose the shortest.

inflection tool for other languages

Hi,

Thanks for providing this library.
Have you ever thought of implementing other languages?
If not, I'm currently looking for a tool that does Russian word inflection and I was wondering whether you know of any such tools?

Best,
Eva

Incorrect inflections of special adjectives like beautiful and handsome

Hi, thanks for building this amazing tool!
Currently, it doesn't seem to handle inflections of special adjectives like beautiful and handsome correctly.

Example:

from lemminflect import getLemma, getInflection

lemma = getLemma('beautiful', upos='ADJ')
inflection1 = getInflection(lemma[0], tag='JJR')
inflection2 = getInflection(lemma[0], tag='JJS')
print(inflection1, inflection2)

gives ('beautifuler',) and ('beautifulest',). It'd be great if lemminflect can output something like ('more', 'beautiful',) or ('more beautiful',)!

"Haves"

>>> import lemminflect as li
>>> li.getInflection('have', 'VBZ')
('haves',)

Shouldn't that be has? What am I doing wrong?

Make spaCy integration explicit

import spacy
import lemminflect

results in the last import being reported as unused by linters.

import spacy
import lemminflect

lemminflect.extend_spacy()

or something similar would've been much better.

Incorrect inflections

['somewhat', ####], ['somew', 'ADJ']

[['his', ####], ['hi', 'PROPN'], ['hi', 'ADJ'], ['hi', 'ADV']],

['her', ####], ['he', 'ADJ'], ['h', 'ADV']],

[['could', ####], ['coul', 'ADV']

[['another', ####], ['anoth', 'ADJ'], ['anoth', 'ADV']],

[['question', ####], ['quest', 'ADV']],

[['vs', ####], ['v', 'NOUN'], ['v', 'PROPN'], ['v', 'VERB'], ['v', 'ADJ'], ['v', 'ADV']],

getAllInflections('arrive')
{'VBD': ('arrived',), 'VBG': ('arriving',), 'VBZ': ('arrives',), 'VB': ('arrive',), 'VBP': ('arrive',)}

bjascob / lemminflect Goto Github PK

lemminflect's People

Contributors

Stargazers

Watchers

Forkers

lemminflect's Issues

Recommend Projects

Recommend Topics

Recommend Org