Code Monkey home page Code Monkey logo

trieregex's Introduction

trieregex

pypi Version python

trieregex creates efficient regular expressions (regexes) by storing a list of words in a trie structure, and translating the trie into a more compact pattern.

The speed performance of a trie-based regex (e.g., r'under(?:sta(?:nd|te)|take|go)') -- compared to a flat regex union (i.e., r'(?:understand|understate|undertake|undergo)') -- becomes apparent when using extremely large word lists, and especially when more specific or complicated contexts are specified at the boundaries. The processing time of this package itself is also minimized with memoization.

Installation

pip install trieregex

Usage

import re
from trieregex import TrieRegEx as TRE

words = ['lemon', 'lime', 'pomelo', 'orange', 'citron']
more_words = ['grapefruit', 'grape', 'tangerine', 'tangelo']

# Initialize class instance
tre = TRE()

# Add word(s)
tre = TRE(*words)  # word(s) can be added upon instance creation, or after
tre.add('kumquat')  # add one word
tre.add(*more_words)  # add a list of words 

# Remove word(s)
tre.remove('citron')  # remove one word
tre.remove(*words)  # remove a list of words

# Check if a word exists in the trie
tre.has('citron')  # Returns: False
tre.has('tangerine')  # Returns: True

# Create regex pattern from the trie
tre.regex()  # Returns: '(?:tange(?:rine|lo)|grape(?:fruit)?|kumquat)'

# Add boundary context and compile for matching
pattern = re.compile(f'\\b{tre.regex()}\\b')  # OR rf'\b{tre.regex()}\b'
pattern  # Returns: re.compile('\\b(?:tange(?:rine|lo)|grape(?:fruit)?|kumquat)\\b')
pattern.findall("A kumquat is tastier than a loquat")  # Returns: ['kumquat']

# Inspect unique initial characters in the trie
tre.initials()  # Returns: ['g', 'k', 't']

# Inspect unique final characters in the trie
tre.finals()  # Returns: ['e', 'o', 't']

The last two methods are intended for checking what boundary contexts are appropriate for the final regex for pattern matching (discussed below).

(No) Boundaries

trieregex does not include any default boundaries (such as r'\b') in the pattern returned from its TrieRegEx.regex() method, that way the user can determine what is appropriate for each particular use case.

Consider a fictitious brand name !Citrus with an exclamation mark at the beginning, and trying to catch it in a target string by using r'\b' for its boundaries in the regex:

string = 'I love !Citrus products!'
re.findall(r'\b(!Citrus)\b', string)  # Returns: []

The re.findall() call returns an empty list because the first r'\b' is not matched. While r'\b' stands for the boundary between a word character and a non-word character, the exclamation mark and its preceding space character are both non-word characters.

An appropriate regex for catching '!Citrus' can be written as follows, where the context before the exclamation mark is either start-of-string (r'^') or a non-word character (r'[^\w]'):

re.findall(r'(?:^|[^\w])(!Citrus)\b', string)  # Returns: ['!Citrus']

This package is designed to allow any pattern to be added to its trie, not just normal words bound by space and punctuation, so that the user can define their own boundaries in the regex, and have the option to leave the data unnormalized when desired.

Python version comptability

Due to f-strings and type hints, this package is only compatible with Python versions >=3.6.

For Python versions >=2.7, <3.6, backports such as future-fstrings and typing are available.

trieregex's People

Contributors

ermanh avatar

Stargazers

Ryan Sudhakaran avatar Ivan Neutov avatar Hakim avatar Colin Evrard avatar dr43400 avatar Cynde avatar Phillip Sawyer avatar Hugefiver avatar Xiang.Lin avatar  avatar  avatar Cédric Bonhomme avatar Alexandre Dulaunoy avatar Richard Gomez avatar Diwank Singh Tomer avatar ry avatar MeganerdDev avatar  avatar Rico Koschmitzky avatar James Hale avatar Gabriel Freire avatar  avatar Simao (Sam) Eduardo avatar Luzia Troebinger avatar Michael I Chen avatar Kevin avatar Mark Gillard avatar Gicque avatar  avatar Ebraheem Almuaili avatar Akmal avatar Luis Belloch avatar Andoni Alonso  avatar Pablo San José avatar Arron Vinyard avatar Santiago M. Mola avatar Evo Stamatov avatar Adam Eury avatar Steve C avatar Matthew Blewitt avatar  avatar Timothy Choi avatar Michał M. Więcław avatar Dan Salo avatar Haroldo de Oliveira Pinheiro avatar Tasuku SUENAGA a.k.a. gunyarakun avatar Kyle Booten avatar Kenil Cheng avatar Ahmed Benjelloun avatar  avatar Leo Chu avatar Albert Au Yeung avatar

Watchers

James Cloos avatar  avatar

trieregex's Issues

Using trieregex for a list not of words, but of regular expressions?

Thanks for making this. Tries are extremely powerful, and your module makes them easy to use.

I'm wondering if I can use it also to optimize a list of regular expressions instead of a list of words.

By regular expressions I mean anything I would feed into re.find(regex, string), for example. So flags like (?mi), non-capturing groups, etc..

I tried doing that, but your module simply escaped my regular expressions, i.e. treated them as words to be matched verbatim. So it seems the algorithm you are using to generate the trie only works for words, or strings to be matched verbatim, and not for regular expressions.

Am I overlooking something? Do you think it's even possible to do for a list of regular expressions what you've done for lists of words here? Basically, a general regex optimizer.

Thanks!

RecursionError: maximum recursion depth exceeded while getting the repr of an object

I have an error when trying to use this library with a list of 10K celebrity names.

Traceback (most recent call last):
  File "/home/hhh/ratings-api/clean.py", line 74, in <module>
    logger.debug(trie.regex())
  File "/home/hhh/.local/lib/python3.9/site-packages/trieregex/memoizer.py", line 20, in __call__
    self.cache[stringed] = self.func(*args)
  File "/home/hhh/.local/lib/python3.9/site-packages/trieregex/trieregex.py", line 111, in regex
    return f'{escape(key)}{self.regex(trie[key], False)}'
  File "/home/hhh/.local/lib/python3.9/site-packages/trieregex/memoizer.py", line 18, in __call__
    stringed = str(args)
RecursionError: maximum recursion depth exceeded while getting the repr of an object

It seems that the regex method return empty string inside a loop or a spark UDF

Hi,
TrieRegex really help me to improve my regex performance.
However, I try to use TrieRegex inside a spark UDF and seem that sometime TrieRegex return an empty string.

To simplify, I understand that the same behaviour is present if I use the TrieRegex inside a loop:

# Del
i = 0
VALUES = ["REGEX1", "REGEX2"]
while i < 20:
    TRIE_VALUES = TrieRegEx(*VALUES)
    i = i + 1
    if len(TRIE_VALUES.regex()) < 1:
        print(f"ERROR on loop i:{i}")
        print(f"TRIE_VALUES: '{TRIE_VALUES.regex()}' (len: {len(TRIE_VALUES.regex())})")
        break

I have:

ERROR on loop i:5 # Where the number can change
TRIE_VALUES: '' (len: 0)

My workaround for this case is to add a del like this:

# Del
i = 0
VALUES = ["REGEX1", "REGEX2"]
while i < 20:
    TRIE_VALUES = TrieRegEx(*VALUES)
    i = i + 1
    if len(TRIE_VALUES.regex()) < 1:
        print(f"ERROR on loop i:{i}")
        print(f"TRIE_VALUES: '{TRIE_VALUES.regex()}' (len: {len(TRIE_VALUES.regex())})")
        break
    del TRIE_VALUES

With the code above it works well.
However, if I use TrieRegex inside a PandasUDF, I have the same bug.

My pandas udf is something like this:

def trieregex_udf(df):
     # Read source
     values = ### read_values()
     trie = TrieRegex(*patterns)
     regex = trie.regex()
     # Apply regex to DF
     output = .....
     return output

output = df.groupby("id").applyInPandas(trieregex_udf, schema="v string").toPandas()

sometimes the trie.regex() return an empty string
It seems that the problem is present only in case of instance the TrieRegex inside the udf, if I pass the result regex everything work well.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.