nihopa / nlpre Goto Github PK

View Code? Open in Web Editor NEW

188.0 13.0 34.0 52.23 MB

Python library for Natural Language Preprocessing (NLPre)

Python 100.00%

nlp nlp-parsing natural-language-processing text-processing python

nlpre's People

Contributors

Stargazers

Watchers

nlpre's Issues

Doc strings

All classes should have a valid (and useful!) docstring. See

http://stackoverflow.com/a/24385103/249341

for some good examples. The "google" format is a good one to template.

The class parenthesis_nester in identify_parenthetical_phrases.py is almost identical to the code block used in remove_parenthesis.py. The class could be imported and used in remove_parenthesis to reduce repetition.

replace_acronyms is not a new-style class

class replace_acronyms():

should be

class replace_acronyms(object):

Total combined unittest

We need a few unittests that take in a paragraph and run through all known function of the library. It will be a pain to make the first ones, but it is essential that we make sure nothing goes wrong when we combine the functions.

Parsing fails sometimes

There is a weird bug that I'm tracking down now, which only shows up occasionally in the new replace_from_dictionary code. That is, running tests gives inconsistent results (sometimes fails, sometimes works). The output from tox with some added debug statements:

replace_from_dict_tests.Replace_From_Dict_Test.hydroxyethylrutoside_test2 ... 

MeSH_Hydroxyethylrutoside is great
0 MeSH_Hydroxyethylrutoside is great
FAIL

======================================================================
FAIL: replace_from_dict_tests.Replace_From_Dict_Test.hydroxyethylrutoside_test2
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/hoppeta/git-repo/NLPre/.tox/py27/local/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/hoppeta/git-repo/NLPre/tests/replace_from_dict_tests.py", line 30, in hydroxyethylrutoside_test2
    assert_equal(doc_right, doc_new)
AssertionError: 'MeSH_Hydroxyethylrutoside is great' != '0 MeSH_Hydroxyethylrutoside is great'

----------------------------------------------------------------------
Ran 1 test in 0.024s

FAILED (failures=1)
ERROR: InvocationError: '/home/hoppeta/git-repo/NLPre/.tox/py27/bin/nosetests tests/replace_from_dict_tests.py:Replace_From_Dict_Test.hydroxyethylrutoside_test2 -vs'
________________________________________________ summary ________________________________________________
ERROR:   py27: commands failed

It looks like the "0" is not being concatenated into the string.

Migrate to python 3

Pattern is the only thing keeping this library from being 3+ compatible.

https://github.com/clips/pattern
https://spacy.io/

Error in separated parenthesis

This is a large traceback, but it's unfortunately all we get when running in parallel. It looks like the input to the function is truncated a bit too so it's hard to tell what's going in.

/usr/local/lib/python2.7/dist-packages/joblib/parallel.py in __call__(self=<joblib.parallel.BatchedCalls object>)
     67     def __init__(self, iterator_slice):
     68         self.items = list(iterator_slice)
     69         self._size = len(self.items)
     70 
     71     def __call__(self):
---> 72         return [func(*args, **kwargs) for func, args, kwargs in self.items]
        func = <function dispatcher>
        args = ({'_filename': 'data_import/RPG_2012.csv', '_ref': '50223', 'text': 'Dynamic Dopamine Images: A New View of the Neuro...vement will be greater in men than women. [[[ ]]]'},)
        kwargs = {'target_column': 'text'}
        self.items = [(<function dispatcher>, ({'_filename': 'data_import/RPG_2012.csv', '_ref': '50223', 'text': 'Dynamic Dopamine Images: A New View of the Neuro...vement will be greater in men than women. [[[ ]]]'},), {'target_column': 'text'})]
     73 
     74     def __len__(self):
     75         return self._size
     76 

...........................................................................
/home/hoppeta/test/word2vec_pipeline/word2vec_pipeline/parse.py in dispatcher(row={'_filename': 'data_import/RPG_2012.csv', '_ref': '50223', 'text': 'Dynamic Dopamine Images: A New View of the Neuro...vement will be greater in men than women. [[[ ]]]'}, target_column='text')
     16 
     17 def dispatcher(row, target_column):
     18     text = row[target_column] if target_column in row else None
     19 
     20     for f in parser_functions:
---> 21         text = unicode(f(text))
        text = u'Dynamic Dopamine Images : A New View of the Ne...will be greater in men than women .\n[ [ [ ] ] ]'
        f = <nlpre.separated_parenthesis.separated_parenthesis object>
     22 
     23     row[target_column] = text
     24     return row
     25 

...........................................................................
/usr/local/lib/python2.7/dist-packages/nlpre/separated_parenthesis.py in __call__(self=<nlpre.separated_parenthesis.separated_parenthesis object>, text=[u'Based on findings with other stimuli , we hypo... involvement will be greater in men than women .', u'a .', u'b .'])
     81 
     82                 text = ' '.join(tokens)
     83                 doc_out.append(text)
     84             else:
     85 
---> 86                 text = self.paren_pop(tokens)
        text = [u'Based on findings with other stimuli , we hypo... involvement will be greater in men than women .', u'a .', u'b .']
        self.paren_pop = <bound method separated_parenthesis.paren_pop of...arated_parenthesis.separated_parenthesis object>>
        tokens = ([([([([], {})], {})], {})], {})
     87                 doc_out.extend(text)
     88 
     89         return '\n'.join(doc_out)
     90 

...........................................................................
/usr/local/lib/python2.7/dist-packages/nlpre/separated_parenthesis.py in paren_pop(self=<nlpre.separated_parenthesis.separated_parenthesis object>, parsed_tokens=[[[[]]]])
     99         # must convert the ParseResult to a list, otherwise adding it to a list
    100         # causes weird results.
    101         if isinstance(parsed_tokens, pypar.ParseResults):
    102             parsed_tokens = parsed_tokens.asList()
    103 
--> 104         content = self.paren_pop_helper(parsed_tokens)
        content = undefined
        self.paren_pop_helper = <bound method separated_parenthesis.paren_pop_he...arated_parenthesis.separated_parenthesis object>>
        parsed_tokens = [[[[]]]]
    105         return content
    106 
    107     def paren_pop_helper(self, tokens):
    108         '''

...........................................................................
/usr/local/lib/python2.7/dist-packages/nlpre/separated_parenthesis.py in paren_pop_helper(self=<nlpre.separated_parenthesis.separated_parenthesis object>, tokens=[[[]]])
    134             reorged_tokens = []
    135 
    136             # Iterate through all parenthetical content, recursing on them
    137             # This allows content in nested parenthesis to be captured
    138             for tokes in token_parens:
--> 139                 sents = self.paren_pop_helper(tokes)
        sents = undefined
        self.paren_pop_helper = <bound method separated_parenthesis.paren_pop_he...arated_parenthesis.separated_parenthesis object>>
        tokes = [[]]
    140                 self.logger.info('Expanded parenthetical content: %s' % sents)
    141                 reorged_tokens.extend(sents)
    142 
    143             # Bundles outer sentence with inner parenthetical content

...........................................................................
/usr/local/lib/python2.7/dist-packages/nlpre/separated_parenthesis.py in paren_pop_helper(self=<nlpre.separated_parenthesis.separated_parenthesis object>, tokens=[])
    124         new_tokens = []
    125         token_words = [x for x in tokens if isinstance(x, six.string_types)]
    126 
    127         # If tokens don't include parenthetical content, return as string
    128         if len(token_words) == len(tokens):
--> 129             if token_words[-1] not in ['.', '!', '?']:
        token_words = []
    130                 token_words.append('.')
    131             return [' '.join(token_words)]
    132         else:
    133             token_parens = [x for x in tokens if isinstance(x, list)]

IndexError: list index out of range
___________________________________________________________________________

Fatal error: local() encountered an error (return code 1) while executing 'python word2vec_pipeline parse'

Replace acronym fails with parenthesis

from nlpre import replace_acronyms

text = '''
BEACH (beige and Chediak Higashi) domain containing proteins (BDCPs) are a highly conserved protein family in eukaryotes.
'''

ABBR = { (('BEACH', 'domain', 'containing', 'proteins'), 'BDCPs'): 1}
P1 = replace_acronyms(ABBR)
print P1(text)

Traceback (most recent call last):
  File "tx.py", line 31, in <module>
    print P1(text)
  File "/home/hoppeta/git-repo/NLPre/nlpre/replace_acronyms.py", line 229, in __call__
    highest_phrase = '_'.join(highest_phrase)
TypeError: sequence item 1: expected string, ParseResults found

Upload package to pypi

https://pypi.python.org/pypi

Look into FlashText

A large part of the text processing is still spent replacing keywords, examine the use of FlashText

In this paper we introduce, the FlashText algorithm for replacing keywords or finding keywords in a given text. FlashText can search or replace keywords in one pass over a document. The time complexity of this algorithm is not dependent on the number of terms being searched or replaced. For a document of size N (characters) and a dictionary of M keywords, the time complexity will be O(N). This algorithm is much faster than Regex, because regex time complexity is O(MxN). It is also different from Aho Corasick Algorithm, as it doesn't match substrings. FlashText is designed to only match complete words (words with boundary characters on both sides). For an input dictionary of {Apple}, this algorithm won't match it to 'I like Pineapple'. This algorithm is also designed to go for the longest match first. For an input dictionary {Machine, Learning, Machine learning} on a string 'I like Machine learning', it will only consider the longest match, which is Machine Learning. We have made python implementation of this algorithm available as open-source on GitHub, released under the permissive MIT License.

https://github.com/vi3k6i5/flashtext

https://arxiv.org/abs/1711.00046

Reference tagger and remover

Longer biomedical texts include references which often are concatenated with regular text. This module aims to either remove or partition out the references. For example

...  key feature in Drosophila3-5 and elegans(7).

...  key feature in Drosophila and elegans.

Add more examples as comments to this issue as they are identified.

Add parenthetical phrase finder

Import this from the pipeline_word2vec module.

Standard Input Variables For Classes

Most of the classes take 'doc' as their input variable. However, a couple of them take 'text' or 'org_doc' instead. Since they all take the same kind of input (a document that is a string), I think we should standardize the input variable to reduce confusion.

Garbage texts

Sometimes there are corrupted texts on iSearch, where every word in the document is corrupted. See Appl ID 7889277 for an example. I think this can be pretty easily solved by checking every word in a document against a dictionary, and if a certain percentage aren't listed you can assume the text is corrupted.

Clearer File Titles

"remove_parenthesis.py" isn't an accurate description of what this module now does, since it now expands parenthetical content rather than delete it. Does "expand_parenthetical_content.py" work as a name?

Similarly, I'd like to rename "replace_from_dict.py" to "replace_from_mesh_dict.py"

These changes need to be reflected in the init file and the README--is there anywhere else they need to be reflected?

Expand Parenthetical Sentences

Instead of deleting all of the content in a parenthesis, I want to set them as a new sentence to be parsed. This should work for nested parenthesis. Ie, "AB.C(DE.FG)H" would be processed as "AB.CH.DE.FG

Quotes sometimes kill the pos tokenizer

from nlpre import pos_tokenizer as parser
text = '''We find the answer is "not quite".'''
P = parser(POS_blacklist= ['connector', 'cardinal', 'pronoun', 'symbol', 'punctuation', 'modal_verb', 'adverb', 'verb', 'w_word'])
print P(text)

Traceback (most recent call last):
  File "tx.py", line 20, in <module>
    print P(text)
  File "/home/hoppeta/git-repo/NLPre/nlpre/pos_tokenizer.py", line 117, in __call__
    pos = self.POS_map[tag]
KeyError: u'"'

Problem with Pattern Parsing Token Sentances

A minor problem I've noticed when using pattern to parse sentences of tokens. When I used tokens as
representations of words (ie, "A B C (D E) F. G"), the token sentences weren't being properly split on periods. "F." is being treated as either an abbreviation or a match in the code block below. I'm not sure if we will ever encounter this in production, but it's worth noting

if t.endswith("."):
    if t in abbreviations or \
        RE_ABBR1.match(t) is not None or \
        RE_ABBR2.match(t) is not None or \
        RE_ABBR3.match(t) is not None:

separated_parenthesis fails with nesting

from nlpre import separated_parenthesis
text = ('''Superoxide anion (A[B?])''')
print separated_parenthesis()(text)

Traceback (most recent call last):
  File "tx.py", line 11, in <module>
    print separated_parenthesis()(text)
  File "/usr/local/lib/python2.7/dist-packages/nlpre/separated_parenthesis.py", line 68, in __call__
    tokens = self.grammar.grammar.parseString(sent)
  File "/usr/local/lib/python2.7/dist-packages/pyparsing.py", line 1617, in parseString
    raise exc
pyparsing.ParseException: Expected {W:(0123...) | nested () expression | nested [] expression | nested {} expression} (at char 0), (line:1, col:1)

Add unit tests to parenthetical phrases

In the new module (from #11), please add unit tests. I think there will be a lot of cases that won't work that should (hence the tests!). Example of a known issue, "and":

The Environmental Protection Agency (EPA) is not a government organization (GO) of 
Health and Human Services (HHS).

Finds only

Counter({(('government', 'organization'), 'GO'): 1, (('Environmental', 'Protection', 'Agency'), 'EPA'): 1})

Possessive tags not being handled correctly

This is probably due to them being stripped out in the pipeline before. This should be an easy fix as we just need to add POS to the list of tags.

from nlpre import pos_tokenizer as parser
text = ("fins's")
print parser([])(text)

Traceback (most recent call last):
  File "tx.py", line 12, in <module>
    print parser(blacklist)(text)
  File "/usr/local/lib/python2.7/dist-packages/nlpre/pos_tokenizer.py", line 113, in __call__
    pos = self.POS_map[tag]
KeyError: u'POS'

Global logging options

Is it possible to have the logging options to be set globally? As far as I can tell, setting,

 logging.basicConfig(level=logging.ERROR)

should silence the logger but it currently does not. We should probably make the default logging not so verbose as well.

Using an unknown POS tag fails silently

In production accidentally used adjectives instead of adjective. This needs to throw an error.

Fix documentation

Many of the docstrings are incomplete and/or in the wrong place. For example, some of docstrings are at the end of a function, which doesn't trigger when help(function) is called. Additionally, some of the minor functions do not follow the standard convention of describing the function in the class def, then the input rules for __init__ and __call__.

Setup Travis.CI

Remove hyperlinks

New module to strip (or replace) hyperlinks in documents. Hyperlink detection should work even if the link is in parenthesis like (www.google.com) or [https:XXX]

Sentences with multiple pieces of parenthetical content and no non-parenthetical content crash

For instances, the sentence "Hello world. (Good Evening)(goodbye)" will crash the module. I can't imagine we'd ever run into this unless there was an error transcribing documents, but I hit this when my OBSSR LDA reprocessing crashed.

I'm committing and pushing a test case and fix to the parens debugging branch.

Replace acronym error

Somehow, the combination of these two sentences causes the parser to fail here. Taking away either one will allow it to run.

text = '''
A large region upstream (~30 kb) of GATA 4.
Small interfering RNA (siRNA) mediated depletion of EZH2.
'''

from nlpre import replace_acronyms as parser
print parser({})(text)

Traceback (most recent call last):
  File "tx.py", line 15, in <module>
    print parser({})(text)
  File "/usr/local/lib/python2.7/dist-packages/nlpre/replace_acronyms.py", line 200, in __call__
    doc_counter = self.IPP(document)
  File "/usr/local/lib/python2.7/dist-packages/nlpre/identify_parenthetical_phrases.py", line 41, in __call__
    subtokens = self._check_matching(word, k, tokens)
  File "/usr/local/lib/python2.7/dist-packages/nlpre/identify_parenthetical_phrases.py", line 129, in _check_matching
    if let not in tokens_to_remove]
TypeError: 'str' object is not callable

Move to proper logging

In the code, there are instances of print statements that should be properly handled with logging. Most of these can be handled with logging.info.

Add unittests to unidecoder

Add reasonable unittests to unidecoder, Greek letters are the most common, but it might be worth checking for a few other diacritics like

α-Helix β-sheet Αα Νν Ββ Ξξ Γγ Οο
Lëtzebuergesch
vóórkomen
perispōménē

replace_acronyms should conjoin the original tokens as well

Currently in replace_acronyms

non-Hodgkin lymphoma (NHL)

Gets turned into

non-Hodgkin lymphoma ( non_Hodgkin_lymphoma )

but it would be ideal to turn it into

non-Hodgkin_lymphoma ( non_Hodgkin_lymphoma )

otherwise downstream parsers (like replace_from_dict) can mangle this.

Phrases from abbreviations

From the original w2v pipeline, there was replacement code to handle the phrases found in parenthesis. This needs to be imported in.

Single sentences in parenthesis bug

Just noticed this as I was leaving. If a single sentence is put in parenthesis, the parenthesis are not being deleted properly. For instance, "Hello world. (It is a beautiful day.) Goodbye world" is split into three sentences: 'Hello world.' , '(It is a beautiful day.)', 'Goodbye world'.

I would expect it to split into: 'Hello world.', (It is a beautiful day.' , ') Goodbye world' . I'm not sure why it's not.

Replace_from_dict.py optimization

In replace_from_dict.py, we have a double for loop that iterates through sent in sentences and then word in keywords. In the inner loop we split the sent into tokens - I believe this can be done in the outer for loop instead.

OA (Ocean acidification or osteoarthritis)?

Given that there is text

The etiology of osteoarthritis (OA) is at present unknown. ... which are associated with primary OA utilizing the over 35 families ...

And a prior abbreviation of OA -> Ocean acidification the unmatched phrase of OA will be replaced with "Ocean acidification" which is incorrect. We need to not match if the abbr can not be matched.

Remove pandas dependence

There is no compelling reason to require pandas in the library, it was mostly used as a convenience function. Remove it and replace it with CSV functions. This will also speed up testing as pandas is a huge library to install.

Don't append period to parenthetical content if punctuation already exists

Currently the code only checks to see if a period already exists at the end of parenthetical content. Check if other punctuation exists as well (ie, question marks and exclamation points).

Log Warning When Module Fails

We should have a logger catch when a module fails. My idea is to wrap every module in a try/except statement, unless there's a more elegant way to do it.

The only issue is that this will reduce coverage, unless we can create failing tests for every module. I'm not sure if I can create tests for every module that will cause them to fail. If I found a bug, I'd just fix it. However, it seems important to have a try/except statement to catch errors that we haven't accounted for.

Titlecaps converts to unicode

the titlecaps class calls sentence_tokenizer from tokenizers.py to split the string into sentances. However, sentence tokenizer returns the sentences as unicode strings, which breaks other modules downstream. Should I re-run unidecoder after titlecaps, or edit sentence_tokenizer?

Speed up replace_from_dictionary

Even though speed isn't our top concern, replace_from_dictionary is orders of magnitude slower than most functions.

                                    time      frac
function                                          
unidecoder                      0.000008  0.000018
token_replacement               0.000010  0.000022
dedash                          0.000535  0.001172
titlecaps                       0.003216  0.007043
decaps_text                     0.003802  0.008327
identify_parenthetical_phrases  0.009862  0.021598
replace_acronyms                0.012591  0.027574
separated_parenthesis           0.013224  0.028960
pos_tokenizer                   0.068994  0.151094
replace_from_dictionary         0.344384  0.754191

Expand Parenthetical Content in Remove Parenthesis

Parenthetical content is now expanded rather than deleted. That is, "ABC(DE(FG)H)I" is now processed as "ABCI.DEH.FG". However, the parser will still split all sentances on periods, regardless of whether these periods are in parenthetical content. That is, "AB(CD.EF)G" is processed as "AB(CD.EF)G" -> "ABCD.EFG". It should instead be processed as "ABG.CD.EF"

Proper coverage

Now that coverage is near or at 100%, we need to add it to the test suite. Part of the problem is the failing of the command:

coverage run --source nlpre setup.py test

Which should work, but fails with

ImportError: No module named tests
Coverage.py warning: No data was collected.

Once this is fixed, we can add a badge with coveralls.

Parallel pipeline

A helper function to take in documents and process them in a pipeline using joblib would be a useful addition.

replace_from_dict fails on empty text

Offending code is here:

69         for word, i, j in keywords:
70             if n < i:
71                 tokens.append(doc[n:i])
72             tokens.append(self.prefix+word)
73             n = j
---> 74         tokens.append(doc[j:len(doc)])

UnboundLocalError: local variable 'j' referenced before assignment

Periods Appended in Parenthetical Content

Should we add periods to the end of parenthetical sentence content? Ie, do we want "Hello (hello world1) world2. Hello world3." -> "Hello world2 .\nHello world1 .\nHello world3." ? Currently parenthetical content has the newline appended correctly, but we don't manual append a period. It seems like we should if these modules are meant to be modular.

Speed up replace_from_dictionary (part 2)

There still looks to be avenues for speed improvements for replace_from_dictionary, which is still the slowest part of the pipeline. Look into this.

Add unidecode

Add unidecode as a library function. Unit tests will be a little bit tricky, since we need to input UTF-8 strings in the test. Example β-hairpin becomes "b-hairpin".

Allow for separated_parenthesis to remove content

Have an option to remove the content found in separated_parenthesis.py. If the value is set to None which should be the default, all content is removed. If set to 0, all content is kept (current behavior). If the value is n then only if n tokens are in the parenthetical content is the partial sentence retained.

token_replacement.py Remove Possessive splits

This is the code to remove possessive splits:

# Remove possesive splits
doc = doc.replace(" 's ", ' ')

This will only replace the possessive 's if there is a space before it, which I don't think would ever happen.

Clean README

Right now the README is a bit of a mess, work on cleaning it and getting it ready for a submission to pyPI.

nihopa / nlpre Goto Github PK

nlpre's People

Contributors

Stargazers

Watchers

Forkers

nlpre's Issues

Recommend Projects

Recommend Topics

Recommend Org