Comments (6)
Hi Vítor, thanks for reaching out to me. This tool is simply not meant to scale to billion-sized inputs. I would not even know how to accomplish that. For DFA generation, I use a separate library so I'm limited by that one. The tool's primary purpose is to help creating regular expressions with a focus on correctness (which is hard enough), not performance. I will work on a new feature that allows to create various character classes and ranges of characters so that more generalized expressions are possible.
Feel free to open a PR if you see a way to improve performance or other aspects. Even though I'm relatively proficient in Rust, I'm far from being an expert.
from grex.
from grex.
Thank you for your suggestions. Looks interesting. I will leave this issue open and come back to it as soon as I continue working on performance improvements.
from grex.
I changed the title so it looks less like clickbait to outside viewers.
I'll share a few test sets to consider and more as they come along.
- Case 1.1
$ python3 -c 'from itertools import combinations_with_replacement;'\
'_ = [print("".join(i)) for i in combinations_with_replacement("ab", 256)]' | grex -r -
^(?:a{255}b|a{254}b{2}|a{253}b{3}|a{252}b{4}|a{251}b{5}|a{250}b{6}|a{249}b{7}|a{248}b{8}|a{247}b{9}|a{246}b{10}|a{24...
Expected output: ^[ab]{256}$
.
grex does not yield the expected output. Adding more characters to the repeated alphabet or increasing the size of combinations chokes the program. There is insufficient character set (or interval) detection.
- Case 1.2
$ python3 -c 'from itertools import combinations_with_replacement;'\
'_ = [print("".join(i)) for i in combinations_with_replacement("ab", 256)]' | grex -
^(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:...
Expected output: ^[ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab...
.
grex does not yield the expected output. The expression builder favors nesting to a fault without the repetitions or substitution flags.
- Case 1.3
$ python3 -c 'from itertools import combinations_with_replacement;'\
'_ = [print("".join(i)) for i in combinations_with_replacement("ab", 256)]' | grex -w -r -
^\w{256}$
Expected output: ^\w{256}$
.
grex yields the correct output. However this shouldn't take so long - after substitution all entries are the same. There needs to be deduplication guards to avoid choking the program.
- Case 2.1
$ grex -r '<{{ aaa }}>' '<{{ aab }}>' '<{{ aac }}>' '<{{ abb }}>' '<{{ abc }}>' '<{{ acc }}>' '<{{ bbb }}>' '<{{ bbc }}>' '<{{ bcc }}>' '<{{ ccc }}>' '</{{ aaa }}>' '</{{ aab }}>' '</{{ aac }}>' '</{{ abb }}>' '</{{ abc }}>' '</{{ acc }}>' '</{{ bbb }}>' '</{{ bbc }}>' '</{{ bcc }}>' '</{{ ccc }}>'
^<(?:/\{{2} (?:(?:ab|b{2})c|a{2}[bc]|a(?:b{2}|c{2})|bc{2}|a{3}|b{3}|c{3}) \}{2}>|\{{2} (?:(?:ab|b{2})c|a{2}[bc]|a(?:b{2}|c{2})|bc{2}|a{3}|b{3}|c{3}) \}{2}>)$
Expected output: ^</?\{{2} [abc]{3} \}{2}>$
.
grex does not yield the expected output. This case illustrates insufficient common prefix and suffix detection as well as the issues pointed out in Case 1.1 and 1.2.
These test cases also only use a single thread. Lots of operations including (independent) iterations and sorting could be exchanged by parallel counterpart calls for a free win.
from grex.
Hey,
so far I like the tool very much, but if I may I would like to give more input on this topic.
I have a very simple data set where the mixture is initial digits followed by one or two lowercase non-digit characters or followed by one or two lowercase non-digit characters with a roman numeral from I-VI in parentheses.
I would offer the following regex. Surely this can also be done more simply, but here:
^[0-9]+[a-z]{0,2}(\s\((IV|V?I{0,3})\))?$
regex101 analysis.
Test file: test.txt
Here's what I got with grex -f text.txt
:
^(?:(?:(?:13|69)6 \(II|440 \(II|725 \(II|(?:2(?:56j|85|98)|4(?:2(?:2b|9)|1[07]|4[38])|7(?:10e|02|22)|1(?:63|75|81)|3(?:43|12)|5(?:15|83)|6(?:0[0358]|1[34])|89[08]|502) \(I|(?:13|69)6 \(V|(?:(?:783|(?:(?:23|31)|44))|883) \(I)I?\)|(?:296|409|708|821) \(I(?:II?\)|\))|9(?:(?:57i|22) \(II?\)|9(?:8 \(I(?:II?\)|\))|9a|9|[0-7])?|5(?:7[ad-fh]|[0-689])|7(?:7[a-h]|[0-689])|57?|77?|1[0-9ab]|2[013-9]|6[013-9a]|[0348][0-9]|1|2|6|[0348])?|(?:255 \(I[IV]|(?:13|69)6 \(IV|440 \(IV|725 \(IV)\)|2(?:5(?:5 \(I\)|6(?:a[a-s]|[bce-ik-oq-y])|7[a-c]|[0-489])|1[0-57-9]|4[0-9a-c]|6[0-4689]|7[1-689]|8[0-46-8]|3[0-9]|9[0-579])|(?:13|69)6 \(I\)|440 \(I\)|7(?:2(?:5 \(I\)|[1346-9])|1(?:0(?:[ab][a-z]|[df-rt-z])|[1-57-9])|10c[a-i]|7(?:4(?:a[a-i]|[b-kw-z])|[0-35-8])|3(?:9[ac-g]|[0-8])|8(?:3[ab]|[124-69])|0[013-79])|1(?:0(?:1(?:9[a-j]|[0-8])|2[01346-9]|6[4-9a-f]|7[013-9])|1(?:0[0-7]|[1-9])|3(?:6b|[0-579])|6(?:2[bc]|[014-689])|2[0-25-9]|4[013-689]|5[235-9]|7[0-46-9a-c]|8[02-9])|(?:1(?:060|54|67)|277|35[58]|7(?:16|20|8[78]))[ab]|(?:1(?:0(?:22|63|72)|50)|265|3(?:4[12]|60)|4(?:24|35)|7(?:79|80))a|1(?:0(?:19?|2|6|7)?|10?|62?|2|3|4|5|7|8)?|2(?:5(?:6a?)?|1|4|6|7|8)?|7(?:1(?:0[ab]?)?|10c|7(?:4a?)?|39?|0|2|8)?|1(?:060|54|67)|1(?:0(?:22|63|72)|50)|105[013-9]|1(?:38|47)[a-e]|3(?:2(?:4[a-eg-l]|[0-35-9])|45[a-e]|8(?:4[abd-f]|6[a-e]|[0-357-9])|0[013-9]|3[0-79]|4[046-9a-k]|5[0-49]|6[1-9]|7[0-46-9]|1[013-79])|4(?:2(?:2[ac]|[0135-8])|0[0-8]|1[1-689]|3[0-46-9ab]|4[124-79])|108[02-9]|(?:1(?:42|51)|267)[a-d]|(?:270|3(?:38|56|75))[a-c]|79(?:0[ab]|[1-9])|8(?:3(?:6[a-fh-mq-z]|[0-57-9])|4(?:5[a-d]|8[a-c]|[0-4679ab])|5(?:0[ab]|[1-9])|7(?:5[a-l]|[0-46-9])|8(?:3[a-c]|[0-24-9])|0[02-8]|1[02-9]|2[02-9]|6[0-9a]|9[1-79a]|[a-h])|(?:1(?:0[0349]|9)|2[02]|39|4[5-9]|5[2-7]|6[2-68]|7[45])[0-9]|105|1(?:38|47)|3(?:24?|45|8(?:4|6)?|0|3|4|5|6|7)?|4(?:22?|0|1|3)?|108|1(?:42|51)|267|270|3(?:38|56|75)|277|35[58]|7(?:16|20|8[78])|790?|8(?:36?|4(?:5|8)?|50?|75?|0|1|2|6|9)?|1(?:0[0349]|9)|265|3(?:4[12]|60)|4(?:24|35)|7(?:79|80)|5(?:1[01346-9]|8[0-24-9]|9[0-9a-d])|6(?:0[124679]|1[0-25-9]|7[0-9c]|9[0-57-9])|76[0-9a]|50[013-9]|5(?:1|8)?|6(?:0|1|7|9)?|76|50|2[02]|39|4[5-9]|5[2-7]|6[2-68]|7[45])$
I tried to find out with the analysis of regex101 if grex follows some kind of rule set in which it is grouped to possibly provide helpful information for the enhancement. Since I didn't really have the skills to get into the code right now. But I love regex. The pattern grex is trying to group is too wild for me to put into words.
The analysis of regex101 indicates that grex has formed a large non-capturing group and 59 alternatives to follow.
I do play around with -r
, --min-repetitions
and --min-substring-length
and -d
and -w
.
- with
-r
the thing goes much wilder. here. I believe because we do have unique lines. - I think
--min-repetitions
and--min-substring-length
doesn't do anything useful in this case. -d
and-w
would do a lot for grouping, but in this case I will be specific - especially with the roman numeral. here
If this comment does not provide important or purposeful information, let me know and I will delete it. I just wanted to share my play around.
from grex.
Hi @scheidelerl, thank you for your valuable input, very much appreciated. :)
I know about the shortcomings of the current algorithm, that's why the issues #48 and #122 already exist. I want to work on these, I'm just lacking free time at the moment. But the project is still active and I will continue working on it. I just can't tell you when.
from grex.
Related Issues (20)
- Generate BRE and ERE regex HOT 1
- Treat diffs as separate groups HOT 2
- Add support for Unicode 15 HOT 1
- "Scalable Synthesis of Regular Expressions From Only Positive Examples"
- Document how stdin is read in `--help` message HOT 3
- Create Python bindings for the library
- ETA for a new release?
- Please add support for arm arch HOT 3
- JavaScript/TypeScript bindings to the browser and nodejs HOT 2
- Couldn't compile ndarray HOT 3
- Improve command-line tool by adding STDIN support HOT 1
- installation problem HOT 1
- Deprecate enum `Feature` in favor of additional methods in `RegExpBuilder`
- Allow to specify characters that have to be converted to character class HOT 2
- usage error HOT 1
- Feature Request: Allow to pipe text into grex HOT 2
- Add optional CLI feature if using grex as a library HOT 5
- Very slow for large number of lines HOT 1
- Add WASM support
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from grex.