ermanh / trieregex Goto Github PK

52.0 2.0 5.0 82 KB

Build efficient trie-based regular expressions from large word lists

License: MIT License

Python 100.00%

trieregex's Introduction

trieregex

trieregex creates efficient regular expressions (regexes) by storing a list of words in a trie structure, and translating the trie into a more compact pattern.

The speed performance of a trie-based regex (e.g., r'under(?:sta(?:nd|te)|take|go)') -- compared to a flat regex union (i.e., r'(?:understand|understate|undertake|undergo)') -- becomes apparent when using extremely large word lists, and especially when more specific or complicated contexts are specified at the boundaries. The processing time of this package itself is also minimized with memoization.

Installation

pip install trieregex

Usage

import re
from trieregex import TrieRegEx as TRE

words = ['lemon', 'lime', 'pomelo', 'orange', 'citron']
more_words = ['grapefruit', 'grape', 'tangerine', 'tangelo']

# Initialize class instance
tre = TRE()

# Add word(s)
tre = TRE(*words)  # word(s) can be added upon instance creation, or after
tre.add('kumquat')  # add one word
tre.add(*more_words)  # add a list of words 

# Remove word(s)
tre.remove('citron')  # remove one word
tre.remove(*words)  # remove a list of words

# Check if a word exists in the trie
tre.has('citron')  # Returns: False
tre.has('tangerine')  # Returns: True

# Create regex pattern from the trie
tre.regex()  # Returns: '(?:tange(?:rine|lo)|grape(?:fruit)?|kumquat)'

# Add boundary context and compile for matching
pattern = re.compile(f'\\b{tre.regex()}\\b')  # OR rf'\b{tre.regex()}\b'
pattern  # Returns: re.compile('\\b(?:tange(?:rine|lo)|grape(?:fruit)?|kumquat)\\b')
pattern.findall("A kumquat is tastier than a loquat")  # Returns: ['kumquat']

# Inspect unique initial characters in the trie
tre.initials()  # Returns: ['g', 'k', 't']

# Inspect unique final characters in the trie
tre.finals()  # Returns: ['e', 'o', 't']

The last two methods are intended for checking what boundary contexts are appropriate for the final regex for pattern matching (discussed below).

(No) Boundaries

trieregex does not include any default boundaries (such as r'\b') in the pattern returned from its TrieRegEx.regex() method, that way the user can determine what is appropriate for each particular use case.

Consider a fictitious brand name !Citrus with an exclamation mark at the beginning, and trying to catch it in a target string by using r'\b' for its boundaries in the regex:

string = 'I love !Citrus products!'
re.findall(r'\b(!Citrus)\b', string)  # Returns: []

The re.findall() call returns an empty list because the first r'\b' is not matched. While r'\b' stands for the boundary between a word character and a non-word character, the exclamation mark and its preceding space character are both non-word characters.

An appropriate regex for catching '!Citrus' can be written as follows, where the context before the exclamation mark is either start-of-string (r'^') or a non-word character (r'[^\w]'):

re.findall(r'(?:^|[^\w])(!Citrus)\b', string)  # Returns: ['!Citrus']

This package is designed to allow any pattern to be added to its trie, not just normal words bound by space and punctuation, so that the user can define their own boundaries in the regex, and have the option to leave the data unnormalized when desired.

Python version comptability

Due to f-strings and type hints, this package is only compatible with Python versions >=3.6.

For Python versions >=2.7, <3.6, backports such as future-fstrings and typing are available.

trieregex's People

Contributors

Stargazers

Watchers

Forkers

diceroll123 joinedsenses kolanich-libs zhaoxjmail supercoolin

trieregex's Issues

Using trieregex for a list not of words, but of regular expressions?

Thanks for making this. Tries are extremely powerful, and your module makes them easy to use.

I'm wondering if I can use it also to optimize a list of regular expressions instead of a list of words.

By regular expressions I mean anything I would feed into re.find(regex, string), for example. So flags like (?mi), non-capturing groups, etc..

I tried doing that, but your module simply escaped my regular expressions, i.e. treated them as words to be matched verbatim. So it seems the algorithm you are using to generate the trie only works for words, or strings to be matched verbatim, and not for regular expressions.

Am I overlooking something? Do you think it's even possible to do for a list of regular expressions what you've done for lists of words here? Basically, a general regex optimizer.

Thanks!

RecursionError: maximum recursion depth exceeded while getting the repr of an object

I have an error when trying to use this library with a list of 10K celebrity names.

Traceback (most recent call last):
  File "/home/hhh/ratings-api/clean.py", line 74, in <module>
    logger.debug(trie.regex())
  File "/home/hhh/.local/lib/python3.9/site-packages/trieregex/memoizer.py", line 20, in __call__
    self.cache[stringed] = self.func(*args)
  File "/home/hhh/.local/lib/python3.9/site-packages/trieregex/trieregex.py", line 111, in regex
    return f'{escape(key)}{self.regex(trie[key], False)}'
  File "/home/hhh/.local/lib/python3.9/site-packages/trieregex/memoizer.py", line 18, in __call__
    stringed = str(args)
RecursionError: maximum recursion depth exceeded while getting the repr of an object

It seems that the regex method return empty string inside a loop or a spark UDF

Hi,
TrieRegex really help me to improve my regex performance.
However, I try to use TrieRegex inside a spark UDF and seem that sometime TrieRegex return an empty string.

To simplify, I understand that the same behaviour is present if I use the TrieRegex inside a loop:

# Del
i = 0
VALUES = ["REGEX1", "REGEX2"]
while i < 20:
    TRIE_VALUES = TrieRegEx(*VALUES)
    i = i + 1
    if len(TRIE_VALUES.regex()) < 1:
        print(f"ERROR on loop i:{i}")
        print(f"TRIE_VALUES: '{TRIE_VALUES.regex()}' (len: {len(TRIE_VALUES.regex())})")
        break

I have:

ERROR on loop i:5 # Where the number can change
TRIE_VALUES: '' (len: 0)

My workaround for this case is to add a del like this:

# Del
i = 0
VALUES = ["REGEX1", "REGEX2"]
while i < 20:
    TRIE_VALUES = TrieRegEx(*VALUES)
    i = i + 1
    if len(TRIE_VALUES.regex()) < 1:
        print(f"ERROR on loop i:{i}")
        print(f"TRIE_VALUES: '{TRIE_VALUES.regex()}' (len: {len(TRIE_VALUES.regex())})")
        break
    del TRIE_VALUES

With the code above it works well.
However, if I use TrieRegex inside a PandasUDF, I have the same bug.

My pandas udf is something like this:

def trieregex_udf(df):
     # Read source
     values = ### read_values()
     trie = TrieRegex(*patterns)
     regex = trie.regex()
     # Apply regex to DF
     output = .....
     return output

output = df.groupby("id").applyInPandas(trieregex_udf, schema="v string").toPandas()

sometimes the trie.regex() return an empty string
It seems that the problem is present only in case of instance the TrieRegex inside the udf, if I pass the result regex everything work well.

ermanh / trieregex Goto Github PK

trieregex's Introduction

trieregex

Installation

Usage

(No) Boundaries

Python version comptability

trieregex's People

Contributors

Stargazers

Watchers

Forkers

trieregex's Issues

Using trieregex for a list not of words, but of regular expressions?

RecursionError: maximum recursion depth exceeded while getting the repr of an object

It seems that the regex method return empty string inside a loop or a spark UDF

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent