nihopa / nlpre Goto Github PK
View Code? Open in Web Editor NEWPython library for Natural Language Preprocessing (NLPre)
Python library for Natural Language Preprocessing (NLPre)
All classes should have a valid (and useful!) docstring. See
http://stackoverflow.com/a/24385103/249341
for some good examples. The "google" format is a good one to template.
The class parenthesis_nester in identify_parenthetical_phrases.py is almost identical to the code block used in remove_parenthesis.py. The class could be imported and used in remove_parenthesis to reduce repetition.
class replace_acronyms():
should be
class replace_acronyms(object):
We need a few unittests that take in a paragraph and run through all known function of the library. It will be a pain to make the first ones, but it is essential that we make sure nothing goes wrong when we combine the functions.
There is a weird bug that I'm tracking down now, which only shows up occasionally in the new replace_from_dictionary
code. That is, running tests gives inconsistent results (sometimes fails, sometimes works). The output from tox with some added debug statements:
replace_from_dict_tests.Replace_From_Dict_Test.hydroxyethylrutoside_test2 ...
MeSH_Hydroxyethylrutoside is great
0 MeSH_Hydroxyethylrutoside is great
FAIL
======================================================================
FAIL: replace_from_dict_tests.Replace_From_Dict_Test.hydroxyethylrutoside_test2
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/hoppeta/git-repo/NLPre/.tox/py27/local/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
self.test(*self.arg)
File "/home/hoppeta/git-repo/NLPre/tests/replace_from_dict_tests.py", line 30, in hydroxyethylrutoside_test2
assert_equal(doc_right, doc_new)
AssertionError: 'MeSH_Hydroxyethylrutoside is great' != '0 MeSH_Hydroxyethylrutoside is great'
----------------------------------------------------------------------
Ran 1 test in 0.024s
FAILED (failures=1)
ERROR: InvocationError: '/home/hoppeta/git-repo/NLPre/.tox/py27/bin/nosetests tests/replace_from_dict_tests.py:Replace_From_Dict_Test.hydroxyethylrutoside_test2 -vs'
________________________________________________ summary ________________________________________________
ERROR: py27: commands failed
It looks like the "0" is not being concatenated into the string.
Pattern is the only thing keeping this library from being 3+ compatible.
This is a large traceback, but it's unfortunately all we get when running in parallel. It looks like the input to the function is truncated a bit too so it's hard to tell what's going in.
/usr/local/lib/python2.7/dist-packages/joblib/parallel.py in __call__(self=<joblib.parallel.BatchedCalls object>)
67 def __init__(self, iterator_slice):
68 self.items = list(iterator_slice)
69 self._size = len(self.items)
70
71 def __call__(self):
---> 72 return [func(*args, **kwargs) for func, args, kwargs in self.items]
func = <function dispatcher>
args = ({'_filename': 'data_import/RPG_2012.csv', '_ref': '50223', 'text': 'Dynamic Dopamine Images: A New View of the Neuro...vement will be greater in men than women. [[[ ]]]'},)
kwargs = {'target_column': 'text'}
self.items = [(<function dispatcher>, ({'_filename': 'data_import/RPG_2012.csv', '_ref': '50223', 'text': 'Dynamic Dopamine Images: A New View of the Neuro...vement will be greater in men than women. [[[ ]]]'},), {'target_column': 'text'})]
73
74 def __len__(self):
75 return self._size
76
...........................................................................
/home/hoppeta/test/word2vec_pipeline/word2vec_pipeline/parse.py in dispatcher(row={'_filename': 'data_import/RPG_2012.csv', '_ref': '50223', 'text': 'Dynamic Dopamine Images: A New View of the Neuro...vement will be greater in men than women. [[[ ]]]'}, target_column='text')
16
17 def dispatcher(row, target_column):
18 text = row[target_column] if target_column in row else None
19
20 for f in parser_functions:
---> 21 text = unicode(f(text))
text = u'Dynamic Dopamine Images : A New View of the Ne...will be greater in men than women .\n[ [ [ ] ] ]'
f = <nlpre.separated_parenthesis.separated_parenthesis object>
22
23 row[target_column] = text
24 return row
25
...........................................................................
/usr/local/lib/python2.7/dist-packages/nlpre/separated_parenthesis.py in __call__(self=<nlpre.separated_parenthesis.separated_parenthesis object>, text=[u'Based on findings with other stimuli , we hypo... involvement will be greater in men than women .', u'a .', u'b .'])
81
82 text = ' '.join(tokens)
83 doc_out.append(text)
84 else:
85
---> 86 text = self.paren_pop(tokens)
text = [u'Based on findings with other stimuli , we hypo... involvement will be greater in men than women .', u'a .', u'b .']
self.paren_pop = <bound method separated_parenthesis.paren_pop of...arated_parenthesis.separated_parenthesis object>>
tokens = ([([([([], {})], {})], {})], {})
87 doc_out.extend(text)
88
89 return '\n'.join(doc_out)
90
...........................................................................
/usr/local/lib/python2.7/dist-packages/nlpre/separated_parenthesis.py in paren_pop(self=<nlpre.separated_parenthesis.separated_parenthesis object>, parsed_tokens=[[[[]]]])
99 # must convert the ParseResult to a list, otherwise adding it to a list
100 # causes weird results.
101 if isinstance(parsed_tokens, pypar.ParseResults):
102 parsed_tokens = parsed_tokens.asList()
103
--> 104 content = self.paren_pop_helper(parsed_tokens)
content = undefined
self.paren_pop_helper = <bound method separated_parenthesis.paren_pop_he...arated_parenthesis.separated_parenthesis object>>
parsed_tokens = [[[[]]]]
105 return content
106
107 def paren_pop_helper(self, tokens):
108 '''
...........................................................................
/usr/local/lib/python2.7/dist-packages/nlpre/separated_parenthesis.py in paren_pop_helper(self=<nlpre.separated_parenthesis.separated_parenthesis object>, tokens=[[[]]])
134 reorged_tokens = []
135
136 # Iterate through all parenthetical content, recursing on them
137 # This allows content in nested parenthesis to be captured
138 for tokes in token_parens:
--> 139 sents = self.paren_pop_helper(tokes)
sents = undefined
self.paren_pop_helper = <bound method separated_parenthesis.paren_pop_he...arated_parenthesis.separated_parenthesis object>>
tokes = [[]]
140 self.logger.info('Expanded parenthetical content: %s' % sents)
141 reorged_tokens.extend(sents)
142
143 # Bundles outer sentence with inner parenthetical content
...........................................................................
/usr/local/lib/python2.7/dist-packages/nlpre/separated_parenthesis.py in paren_pop_helper(self=<nlpre.separated_parenthesis.separated_parenthesis object>, tokens=[])
124 new_tokens = []
125 token_words = [x for x in tokens if isinstance(x, six.string_types)]
126
127 # If tokens don't include parenthetical content, return as string
128 if len(token_words) == len(tokens):
--> 129 if token_words[-1] not in ['.', '!', '?']:
token_words = []
130 token_words.append('.')
131 return [' '.join(token_words)]
132 else:
133 token_parens = [x for x in tokens if isinstance(x, list)]
IndexError: list index out of range
___________________________________________________________________________
Fatal error: local() encountered an error (return code 1) while executing 'python word2vec_pipeline parse'
from nlpre import replace_acronyms
text = '''
BEACH (beige and Chediak Higashi) domain containing proteins (BDCPs) are a highly conserved protein family in eukaryotes.
'''
ABBR = { (('BEACH', 'domain', 'containing', 'proteins'), 'BDCPs'): 1}
P1 = replace_acronyms(ABBR)
print P1(text)
Traceback (most recent call last):
File "tx.py", line 31, in <module>
print P1(text)
File "/home/hoppeta/git-repo/NLPre/nlpre/replace_acronyms.py", line 229, in __call__
highest_phrase = '_'.join(highest_phrase)
TypeError: sequence item 1: expected string, ParseResults found
A large part of the text processing is still spent replacing keywords, examine the use of FlashText
In this paper we introduce, the FlashText algorithm for replacing keywords or finding keywords in a given text. FlashText can search or replace keywords in one pass over a document. The time complexity of this algorithm is not dependent on the number of terms being searched or replaced. For a document of size N (characters) and a dictionary of M keywords, the time complexity will be O(N). This algorithm is much faster than Regex, because regex time complexity is O(MxN). It is also different from Aho Corasick Algorithm, as it doesn't match substrings. FlashText is designed to only match complete words (words with boundary characters on both sides). For an input dictionary of {Apple}, this algorithm won't match it to 'I like Pineapple'. This algorithm is also designed to go for the longest match first. For an input dictionary {Machine, Learning, Machine learning} on a string 'I like Machine learning', it will only consider the longest match, which is Machine Learning. We have made python implementation of this algorithm available as open-source on GitHub, released under the permissive MIT License.
Longer biomedical texts include references which often are concatenated with regular text. This module aims to either remove or partition out the references. For example
... key feature in Drosophila3-5 and elegans(7).
... key feature in Drosophila and elegans.
Add more examples as comments to this issue as they are identified.
Import this from the pipeline_word2vec
module.
Most of the classes take 'doc' as their input variable. However, a couple of them take 'text' or 'org_doc' instead. Since they all take the same kind of input (a document that is a string), I think we should standardize the input variable to reduce confusion.
Sometimes there are corrupted texts on iSearch, where every word in the document is corrupted. See Appl ID 7889277 for an example. I think this can be pretty easily solved by checking every word in a document against a dictionary, and if a certain percentage aren't listed you can assume the text is corrupted.
"remove_parenthesis.py" isn't an accurate description of what this module now does, since it now expands parenthetical content rather than delete it. Does "expand_parenthetical_content.py" work as a name?
Similarly, I'd like to rename "replace_from_dict.py" to "replace_from_mesh_dict.py"
These changes need to be reflected in the init file and the README--is there anywhere else they need to be reflected?
Instead of deleting all of the content in a parenthesis, I want to set them as a new sentence to be parsed. This should work for nested parenthesis. Ie, "AB.C(DE.FG)H" would be processed as "AB.CH.DE.FG
from nlpre import pos_tokenizer as parser
text = '''We find the answer is "not quite".'''
P = parser(POS_blacklist= ['connector', 'cardinal', 'pronoun', 'symbol', 'punctuation', 'modal_verb', 'adverb', 'verb', 'w_word'])
print P(text)
Traceback (most recent call last):
File "tx.py", line 20, in <module>
print P(text)
File "/home/hoppeta/git-repo/NLPre/nlpre/pos_tokenizer.py", line 117, in __call__
pos = self.POS_map[tag]
KeyError: u'"'
A minor problem I've noticed when using pattern to parse sentences of tokens. When I used tokens as
representations of words (ie, "A B C (D E) F. G"), the token sentences weren't being properly split on periods. "F." is being treated as either an abbreviation or a match in the code block below. I'm not sure if we will ever encounter this in production, but it's worth noting
if t.endswith("."):
if t in abbreviations or \
RE_ABBR1.match(t) is not None or \
RE_ABBR2.match(t) is not None or \
RE_ABBR3.match(t) is not None:
from nlpre import separated_parenthesis
text = ('''Superoxide anion (A[B?])''')
print separated_parenthesis()(text)
Traceback (most recent call last):
File "tx.py", line 11, in <module>
print separated_parenthesis()(text)
File "/usr/local/lib/python2.7/dist-packages/nlpre/separated_parenthesis.py", line 68, in __call__
tokens = self.grammar.grammar.parseString(sent)
File "/usr/local/lib/python2.7/dist-packages/pyparsing.py", line 1617, in parseString
raise exc
pyparsing.ParseException: Expected {W:(0123...) | nested () expression | nested [] expression | nested {} expression} (at char 0), (line:1, col:1)
In the new module (from #11), please add unit tests. I think there will be a lot of cases that won't work that should (hence the tests!). Example of a known issue, "and":
The Environmental Protection Agency (EPA) is not a government organization (GO) of
Health and Human Services (HHS).
Finds only
Counter({(('government', 'organization'), 'GO'): 1, (('Environmental', 'Protection', 'Agency'), 'EPA'): 1})
This is probably due to them being stripped out in the pipeline before. This should be an easy fix as we just need to add POS
to the list of tags.
from nlpre import pos_tokenizer as parser
text = ("fins's")
print parser([])(text)
Traceback (most recent call last):
File "tx.py", line 12, in <module>
print parser(blacklist)(text)
File "/usr/local/lib/python2.7/dist-packages/nlpre/pos_tokenizer.py", line 113, in __call__
pos = self.POS_map[tag]
KeyError: u'POS'
Is it possible to have the logging options to be set globally? As far as I can tell, setting,
logging.basicConfig(level=logging.ERROR)
should silence the logger but it currently does not. We should probably make the default logging not so verbose as well.
In production accidentally used adjectives
instead of adjective
. This needs to throw an error.
Many of the docstrings are incomplete and/or in the wrong place. For example, some of docstrings are at the end of a function, which doesn't trigger when help(function)
is called. Additionally, some of the minor functions do not follow the standard convention of describing the function in the class def, then the input rules for __init__
and __call__
.
New module to strip (or replace) hyperlinks in documents. Hyperlink detection should work even if the link is in parenthesis like (www.google.com)
or [https:XXX]
For instances, the sentence "Hello world. (Good Evening)(goodbye)" will crash the module. I can't imagine we'd ever run into this unless there was an error transcribing documents, but I hit this when my OBSSR LDA reprocessing crashed.
I'm committing and pushing a test case and fix to the parens debugging branch.
Somehow, the combination of these two sentences causes the parser to fail here. Taking away either one will allow it to run.
text = '''
A large region upstream (~30 kb) of GATA 4.
Small interfering RNA (siRNA) mediated depletion of EZH2.
'''
from nlpre import replace_acronyms as parser
print parser({})(text)
Traceback (most recent call last):
File "tx.py", line 15, in <module>
print parser({})(text)
File "/usr/local/lib/python2.7/dist-packages/nlpre/replace_acronyms.py", line 200, in __call__
doc_counter = self.IPP(document)
File "/usr/local/lib/python2.7/dist-packages/nlpre/identify_parenthetical_phrases.py", line 41, in __call__
subtokens = self._check_matching(word, k, tokens)
File "/usr/local/lib/python2.7/dist-packages/nlpre/identify_parenthetical_phrases.py", line 129, in _check_matching
if let not in tokens_to_remove]
TypeError: 'str' object is not callable
In the code, there are instances of print statements that should be properly handled with logging. Most of these can be handled with logging.info
.
Add reasonable unittests to unidecoder, Greek letters are the most common, but it might be worth checking for a few other diacritics like
α-Helix β-sheet Αα Νν Ββ Ξξ Γγ Οο
Lëtzebuergesch
vóórkomen
perispōménē
Currently in replace_acronyms
non-Hodgkin lymphoma (NHL)
Gets turned into
non-Hodgkin lymphoma ( non_Hodgkin_lymphoma )
but it would be ideal to turn it into
non-Hodgkin_lymphoma ( non_Hodgkin_lymphoma )
otherwise downstream parsers (like replace_from_dict
) can mangle this.
From the original w2v pipeline, there was replacement code to handle the phrases found in parenthesis. This needs to be imported in.
Just noticed this as I was leaving. If a single sentence is put in parenthesis, the parenthesis are not being deleted properly. For instance, "Hello world. (It is a beautiful day.) Goodbye world" is split into three sentences: 'Hello world.' , '(It is a beautiful day.)', 'Goodbye world'.
I would expect it to split into: 'Hello world.', (It is a beautiful day.' , ') Goodbye world' . I'm not sure why it's not.
In replace_from_dict.py, we have a double for loop that iterates through sent in sentences and then word in keywords. In the inner loop we split the sent into tokens - I believe this can be done in the outer for loop instead.
Given that there is text
The etiology of osteoarthritis (OA) is at present unknown. ... which are associated with primary OA utilizing the over 35 families ...
And a prior abbreviation of OA -> Ocean acidification
the unmatched phrase of OA will be replaced with "Ocean acidification" which is incorrect. We need to not match if the abbr can not be matched.
There is no compelling reason to require pandas
in the library, it was mostly used as a convenience function. Remove it and replace it with CSV functions. This will also speed up testing as pandas is a huge library to install.
Currently the code only checks to see if a period already exists at the end of parenthetical content. Check if other punctuation exists as well (ie, question marks and exclamation points).
We should have a logger catch when a module fails. My idea is to wrap every module in a try/except statement, unless there's a more elegant way to do it.
The only issue is that this will reduce coverage, unless we can create failing tests for every module. I'm not sure if I can create tests for every module that will cause them to fail. If I found a bug, I'd just fix it. However, it seems important to have a try/except statement to catch errors that we haven't accounted for.
the titlecaps class calls sentence_tokenizer from tokenizers.py to split the string into sentances. However, sentence tokenizer returns the sentences as unicode strings, which breaks other modules downstream. Should I re-run unidecoder after titlecaps, or edit sentence_tokenizer?
Even though speed isn't our top concern, replace_from_dictionary
is orders of magnitude slower than most functions.
time frac
function
unidecoder 0.000008 0.000018
token_replacement 0.000010 0.000022
dedash 0.000535 0.001172
titlecaps 0.003216 0.007043
decaps_text 0.003802 0.008327
identify_parenthetical_phrases 0.009862 0.021598
replace_acronyms 0.012591 0.027574
separated_parenthesis 0.013224 0.028960
pos_tokenizer 0.068994 0.151094
replace_from_dictionary 0.344384 0.754191
Parenthetical content is now expanded rather than deleted. That is, "ABC(DE(FG)H)I" is now processed as "ABCI.DEH.FG". However, the parser will still split all sentances on periods, regardless of whether these periods are in parenthetical content. That is, "AB(CD.EF)G" is processed as "AB(CD.EF)G" -> "ABCD.EFG". It should instead be processed as "ABG.CD.EF"
Now that coverage is near or at 100%, we need to add it to the test suite. Part of the problem is the failing of the command:
coverage run --source nlpre setup.py test
Which should work, but fails with
ImportError: No module named tests
Coverage.py warning: No data was collected.
Once this is fixed, we can add a badge with coveralls.
A helper function to take in documents and process them in a pipeline using joblib
would be a useful addition.
Offending code is here:
69 for word, i, j in keywords:
70 if n < i:
71 tokens.append(doc[n:i])
72 tokens.append(self.prefix+word)
73 n = j
---> 74 tokens.append(doc[j:len(doc)])
UnboundLocalError: local variable 'j' referenced before assignment
Should we add periods to the end of parenthetical sentence content? Ie, do we want "Hello (hello world1) world2. Hello world3." -> "Hello world2 .\nHello world1 .\nHello world3." ? Currently parenthetical content has the newline appended correctly, but we don't manual append a period. It seems like we should if these modules are meant to be modular.
There still looks to be avenues for speed improvements for replace_from_dictionary
, which is still the slowest part of the pipeline. Look into this.
Add unidecode
as a library function. Unit tests will be a little bit tricky, since we need to input UTF-8 strings in the test. Example β-hairpin
becomes "b-hairpin".
Have an option to remove the content found in separated_parenthesis.py
. If the value is set to None
which should be the default, all content is removed. If set to 0, all content is kept (current behavior). If the value is n
then only if n
tokens are in the parenthetical content is the partial sentence retained.
This is the code to remove possessive splits:
# Remove possesive splits
doc = doc.replace(" 's ", ' ')
This will only replace the possessive 's if there is a space before it, which I don't think would ever happen.
Right now the README is a bit of a mess, work on cleaning it and getting it ready for a submission to pyPI.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.