Code Monkey home page Code Monkey logo

Comments (6)

pemistahl avatar pemistahl commented on May 17, 2024

Hi Vítor, thanks for reaching out to me. This tool is simply not meant to scale to billion-sized inputs. I would not even know how to accomplish that. For DFA generation, I use a separate library so I'm limited by that one. The tool's primary purpose is to help creating regular expressions with a focus on correctness (which is hard enough), not performance. I will work on a new feature that allows to create various character classes and ranges of characters so that more generalized expressions are possible.

Feel free to open a PR if you see a way to improve performance or other aspects. Even though I'm relatively proficient in Rust, I'm far from being an expert.

from grex.

vitorsr avatar vitorsr commented on May 17, 2024

from grex.

pemistahl avatar pemistahl commented on May 17, 2024

Thank you for your suggestions. Looks interesting. I will leave this issue open and come back to it as soon as I continue working on performance improvements.

from grex.

vitorsr avatar vitorsr commented on May 17, 2024

I changed the title so it looks less like clickbait to outside viewers.

I'll share a few test sets to consider and more as they come along.

  • Case 1.1
$ python3 -c 'from itertools import combinations_with_replacement;'\
'_ = [print("".join(i)) for i in combinations_with_replacement("ab", 256)]' | grex -r -
^(?:a{255}b|a{254}b{2}|a{253}b{3}|a{252}b{4}|a{251}b{5}|a{250}b{6}|a{249}b{7}|a{248}b{8}|a{247}b{9}|a{246}b{10}|a{24...

Expected output: ^[ab]{256}$.
grex does not yield the expected output. Adding more characters to the repeated alphabet or increasing the size of combinations chokes the program. There is insufficient character set (or interval) detection.

  • Case 1.2
$ python3 -c 'from itertools import combinations_with_replacement;'\
'_ = [print("".join(i)) for i in combinations_with_replacement("ab", 256)]' | grex -
^(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:...

Expected output: ^[ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab....
grex does not yield the expected output. The expression builder favors nesting to a fault without the repetitions or substitution flags.

  • Case 1.3
$ python3 -c 'from itertools import combinations_with_replacement;'\
'_ = [print("".join(i)) for i in combinations_with_replacement("ab", 256)]' | grex -w -r -
^\w{256}$

Expected output: ^\w{256}$.
grex yields the correct output. However this shouldn't take so long - after substitution all entries are the same. There needs to be deduplication guards to avoid choking the program.

  • Case 2.1
$ grex -r '<{{ aaa }}>' '<{{ aab }}>' '<{{ aac }}>' '<{{ abb }}>' '<{{ abc }}>' '<{{ acc }}>' '<{{ bbb }}>' '<{{ bbc }}>' '<{{ bcc }}>' '<{{ ccc }}>' '</{{ aaa }}>' '</{{ aab }}>' '</{{ aac }}>' '</{{ abb }}>' '</{{ abc }}>' '</{{ acc }}>' '</{{ bbb }}>' '</{{ bbc }}>' '</{{ bcc }}>' '</{{ ccc }}>'
^<(?:/\{{2} (?:(?:ab|b{2})c|a{2}[bc]|a(?:b{2}|c{2})|bc{2}|a{3}|b{3}|c{3}) \}{2}>|\{{2} (?:(?:ab|b{2})c|a{2}[bc]|a(?:b{2}|c{2})|bc{2}|a{3}|b{3}|c{3}) \}{2}>)$

Expected output: ^</?\{{2} [abc]{3} \}{2}>$.
grex does not yield the expected output. This case illustrates insufficient common prefix and suffix detection as well as the issues pointed out in Case 1.1 and 1.2.

These test cases also only use a single thread. Lots of operations including (independent) iterations and sorting could be exchanged by parallel counterpart calls for a free win.

from grex.

scheidelerl avatar scheidelerl commented on May 17, 2024

Hey,
so far I like the tool very much, but if I may I would like to give more input on this topic.

I have a very simple data set where the mixture is initial digits followed by one or two lowercase non-digit characters or followed by one or two lowercase non-digit characters with a roman numeral from I-VI in parentheses.

I would offer the following regex. Surely this can also be done more simply, but here:

^[0-9]+[a-z]{0,2}(\s\((IV|V?I{0,3})\))?$

regex101 analysis.

Test file: test.txt

Here's what I got with grex -f text.txt:

^(?:(?:(?:13|69)6 \(II|440 \(II|725 \(II|(?:2(?:56j|85|98)|4(?:2(?:2b|9)|1[07]|4[38])|7(?:10e|02|22)|1(?:63|75|81)|3(?:43|12)|5(?:15|83)|6(?:0[0358]|1[34])|89[08]|502) \(I|(?:13|69)6 \(V|(?:(?:783|(?:(?:23|31)|44))|883) \(I)I?\)|(?:296|409|708|821) \(I(?:II?\)|\))|9(?:(?:57i|22) \(II?\)|9(?:8 \(I(?:II?\)|\))|9a|9|[0-7])?|5(?:7[ad-fh]|[0-689])|7(?:7[a-h]|[0-689])|57?|77?|1[0-9ab]|2[013-9]|6[013-9a]|[0348][0-9]|1|2|6|[0348])?|(?:255 \(I[IV]|(?:13|69)6 \(IV|440 \(IV|725 \(IV)\)|2(?:5(?:5 \(I\)|6(?:a[a-s]|[bce-ik-oq-y])|7[a-c]|[0-489])|1[0-57-9]|4[0-9a-c]|6[0-4689]|7[1-689]|8[0-46-8]|3[0-9]|9[0-579])|(?:13|69)6 \(I\)|440 \(I\)|7(?:2(?:5 \(I\)|[1346-9])|1(?:0(?:[ab][a-z]|[df-rt-z])|[1-57-9])|10c[a-i]|7(?:4(?:a[a-i]|[b-kw-z])|[0-35-8])|3(?:9[ac-g]|[0-8])|8(?:3[ab]|[124-69])|0[013-79])|1(?:0(?:1(?:9[a-j]|[0-8])|2[01346-9]|6[4-9a-f]|7[013-9])|1(?:0[0-7]|[1-9])|3(?:6b|[0-579])|6(?:2[bc]|[014-689])|2[0-25-9]|4[013-689]|5[235-9]|7[0-46-9a-c]|8[02-9])|(?:1(?:060|54|67)|277|35[58]|7(?:16|20|8[78]))[ab]|(?:1(?:0(?:22|63|72)|50)|265|3(?:4[12]|60)|4(?:24|35)|7(?:79|80))a|1(?:0(?:19?|2|6|7)?|10?|62?|2|3|4|5|7|8)?|2(?:5(?:6a?)?|1|4|6|7|8)?|7(?:1(?:0[ab]?)?|10c|7(?:4a?)?|39?|0|2|8)?|1(?:060|54|67)|1(?:0(?:22|63|72)|50)|105[013-9]|1(?:38|47)[a-e]|3(?:2(?:4[a-eg-l]|[0-35-9])|45[a-e]|8(?:4[abd-f]|6[a-e]|[0-357-9])|0[013-9]|3[0-79]|4[046-9a-k]|5[0-49]|6[1-9]|7[0-46-9]|1[013-79])|4(?:2(?:2[ac]|[0135-8])|0[0-8]|1[1-689]|3[0-46-9ab]|4[124-79])|108[02-9]|(?:1(?:42|51)|267)[a-d]|(?:270|3(?:38|56|75))[a-c]|79(?:0[ab]|[1-9])|8(?:3(?:6[a-fh-mq-z]|[0-57-9])|4(?:5[a-d]|8[a-c]|[0-4679ab])|5(?:0[ab]|[1-9])|7(?:5[a-l]|[0-46-9])|8(?:3[a-c]|[0-24-9])|0[02-8]|1[02-9]|2[02-9]|6[0-9a]|9[1-79a]|[a-h])|(?:1(?:0[0349]|9)|2[02]|39|4[5-9]|5[2-7]|6[2-68]|7[45])[0-9]|105|1(?:38|47)|3(?:24?|45|8(?:4|6)?|0|3|4|5|6|7)?|4(?:22?|0|1|3)?|108|1(?:42|51)|267|270|3(?:38|56|75)|277|35[58]|7(?:16|20|8[78])|790?|8(?:36?|4(?:5|8)?|50?|75?|0|1|2|6|9)?|1(?:0[0349]|9)|265|3(?:4[12]|60)|4(?:24|35)|7(?:79|80)|5(?:1[01346-9]|8[0-24-9]|9[0-9a-d])|6(?:0[124679]|1[0-25-9]|7[0-9c]|9[0-57-9])|76[0-9a]|50[013-9]|5(?:1|8)?|6(?:0|1|7|9)?|76|50|2[02]|39|4[5-9]|5[2-7]|6[2-68]|7[45])$

I tried to find out with the analysis of regex101 if grex follows some kind of rule set in which it is grouped to possibly provide helpful information for the enhancement. Since I didn't really have the skills to get into the code right now. But I love regex. The pattern grex is trying to group is too wild for me to put into words.

The analysis of regex101 indicates that grex has formed a large non-capturing group and 59 alternatives to follow.

I do play around with -r, --min-repetitions and --min-substring-length and -dand -w.

  • with -r the thing goes much wilder. here. I believe because we do have unique lines.
  • I think --min-repetitions and --min-substring-length doesn't do anything useful in this case.
  • -d and -w would do a lot for grouping, but in this case I will be specific - especially with the roman numeral. here

If this comment does not provide important or purposeful information, let me know and I will delete it. I just wanted to share my play around.

from grex.

pemistahl avatar pemistahl commented on May 17, 2024

Hi @scheidelerl, thank you for your valuable input, very much appreciated. :)

I know about the shortcomings of the current algorithm, that's why the issues #48 and #122 already exist. I want to work on these, I'm just lacking free time at the moment. But the project is still active and I will continue working on it. I just can't tell you when.

from grex.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.