ermanh / trieregex Goto Github PK
View Code? Open in Web Editor NEWBuild efficient trie-based regular expressions from large word lists
License: MIT License
Build efficient trie-based regular expressions from large word lists
License: MIT License
Hi,
TrieRegex really help me to improve my regex performance.
However, I try to use TrieRegex inside a spark UDF and seem that sometime TrieRegex return an empty string.
To simplify, I understand that the same behaviour is present if I use the TrieRegex inside a loop:
# Del
i = 0
VALUES = ["REGEX1", "REGEX2"]
while i < 20:
TRIE_VALUES = TrieRegEx(*VALUES)
i = i + 1
if len(TRIE_VALUES.regex()) < 1:
print(f"ERROR on loop i:{i}")
print(f"TRIE_VALUES: '{TRIE_VALUES.regex()}' (len: {len(TRIE_VALUES.regex())})")
break
I have:
ERROR on loop i:5 # Where the number can change
TRIE_VALUES: '' (len: 0)
My workaround for this case is to add a del like this:
# Del
i = 0
VALUES = ["REGEX1", "REGEX2"]
while i < 20:
TRIE_VALUES = TrieRegEx(*VALUES)
i = i + 1
if len(TRIE_VALUES.regex()) < 1:
print(f"ERROR on loop i:{i}")
print(f"TRIE_VALUES: '{TRIE_VALUES.regex()}' (len: {len(TRIE_VALUES.regex())})")
break
del TRIE_VALUES
With the code above it works well.
However, if I use TrieRegex inside a PandasUDF, I have the same bug.
My pandas udf is something like this:
def trieregex_udf(df):
# Read source
values = ### read_values()
trie = TrieRegex(*patterns)
regex = trie.regex()
# Apply regex to DF
output = .....
return output
output = df.groupby("id").applyInPandas(trieregex_udf, schema="v string").toPandas()
sometimes the trie.regex() return an empty string
It seems that the problem is present only in case of instance the TrieRegex inside the udf, if I pass the result regex everything work well.
I have an error when trying to use this library with a list of 10K celebrity names.
Traceback (most recent call last):
File "/home/hhh/ratings-api/clean.py", line 74, in <module>
logger.debug(trie.regex())
File "/home/hhh/.local/lib/python3.9/site-packages/trieregex/memoizer.py", line 20, in __call__
self.cache[stringed] = self.func(*args)
File "/home/hhh/.local/lib/python3.9/site-packages/trieregex/trieregex.py", line 111, in regex
return f'{escape(key)}{self.regex(trie[key], False)}'
File "/home/hhh/.local/lib/python3.9/site-packages/trieregex/memoizer.py", line 18, in __call__
stringed = str(args)
RecursionError: maximum recursion depth exceeded while getting the repr of an object
Thanks for making this. Tries are extremely powerful, and your module makes them easy to use.
I'm wondering if I can use it also to optimize a list of regular expressions instead of a list of words.
By regular expressions I mean anything I would feed into re.find(regex, string), for example. So flags like (?mi), non-capturing groups, etc..
I tried doing that, but your module simply escaped my regular expressions, i.e. treated them as words to be matched verbatim. So it seems the algorithm you are using to generate the trie only works for words, or strings to be matched verbatim, and not for regular expressions.
Am I overlooking something? Do you think it's even possible to do for a list of regular expressions what you've done for lists of words here? Basically, a general regex optimizer.
Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.