Comments (23)
Still too many states
![image](https://private-user-images.githubusercontent.com/44467788/300229538-0a674a36-0633-4f13-bab8-c5c3d0879bd7.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTIxNjM0MjcsIm5iZiI6MTcxMjE2MzEyNywicGF0aCI6Ii80NDQ2Nzc4OC8zMDAyMjk1MzgtMGE2NzRhMzYtMDYzMy00ZjEzLWJhYjgtYzVjM2QwODc5YmQ3LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA0MDMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNDAzVDE2NTIwN1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTMwMzRiYzdhYTI3Zjc2NmMzOTc1ZThiZDJiNGY4OGM0MTkyNjM4NDNhMzQzZWE3NTdkNmRlYzVmZmE0ODFjMDQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.GBXpfOMLozGuiicrLT5zDzHON7oFvOHcsn3-d2VajNA)
Also, how do I get the DFA?
from regex.
can I set any sort of configuration to avoid the end of input (EOI)
Nope. I'm not aware of any compelling reason to do so.
sanity check (\x00-\xFF)
I don't know what you mean by this. There is no "sanity check" in the DFAs generated by this crate.
from regex.
I think this guide might also be helpful to you: https://www.freecodecamp.org/news/asking-effective-questions-a-practical-guide-for-developers/
from regex.
Why are you using a onepass DFA?
Why not use a dense DFA and enable the minimize option? https://docs.rs/regex-automata/latest/regex_automata/dfa/dense/struct.Config.html#method.minimize
from regex.
And the onepass DFA you show is the same as the DFA graph you show.
from regex.
Why are you using a onepass DFA?
Why not use a dense DFA and enable the minimize option? https://docs.rs/regex-automata/latest/regex_automata/dfa/dense/struct.Config.html#method.minimize
I did and there are too many states as compared to what should be there.
![image](https://private-user-images.githubusercontent.com/44467788/300227890-0d36e15c-9347-4be9-bf0c-019e4d9c6646.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTIxNjM0MjcsIm5iZiI6MTcxMjE2MzEyNywicGF0aCI6Ii80NDQ2Nzc4OC8zMDAyMjc4OTAtMGQzNmUxNWMtOTM0Ny00YmU5LWJmMGMtMDE5ZTRkOWM2NjQ2LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA0MDMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNDAzVDE2NTIwN1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTI0MjY3MDI0NDUwNjJkY2I1M2Q0NGQ2Njc5YjZiZWI5MTkxNDVkM2E3ZjkwMTQ5ZDIzZGY1OTYzOTJlMmYxMTQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.VKnIy9BKmSQUnAnLTqEGuBdWxAjPNjYyTfJk3bUfSks)
from regex.
Only because it supports unanchored and anchored searches. You can change that also in the configuration. The DFA you show only does an anchored search.
from regex.
https://docs.rs/regex-automata/latest/regex_automata/dfa/dense/struct.Config.html#method.start_kind
from regex.
No, that's a minimal DFA. There's exactly one extra state to account for the end-of-input transition.
And you have the DFA. I don't know what you're asking for otherwise. Most of its functionality is made available via the Automaton trait.
You haven't shared any code, and the only thing I have to go on is that you want a minimal DFA. But just because the thing you have isn't precisely identical to what you might find in a textbook about DFAs doesn't mean it isn't minimal.
I would really encourage you to read more of the docs here. I realize there is a lot of it, but this is a necessarily complex API to support there engineering necessary to make a general purpose regex engine work with finite automata.
Otherwise, it seems to me like there might be an XY problem here. A minimal DFA is a means to an end, not an end to itself.
from regex.
Thank you so much for your reply and time. Actually I need to create a JSON which I'd use for a js project via wasm. Yes, it gives the states but the transitions can't be comprehended. I don't know how \x01 is referring to (a|b), like some sort of mapping would have had been helpful.
from regex.
Please read the docs for the DFA configuration. And then the Automaton docs. They should answer everything. This crate uses alphabet compression, and there is a knob called byte_classes that explains this. And the mapping can be accessed: https://docs.rs/regex-automata/latest/regex_automata/dfa/dense/struct.DFA.html#method.byte_classes
like some sort of mapping would have had been helpful.
You say this as if the mapping isn't accessible to you. But it is. It's all there.
from regex.
If you're using the DFA from this crate but not its search routines, you'll need to read the Automaton docs very carefully to implement your own search routines. This is not your average simplistic DFA library. :)
from regex.
Got it, I think more or less I have everything that I need, one last question, can I set any sort of configuration to avoid the end of input (EOI) and sanity check (\x00-\xFF) so as to remove them from the DFA?
from regex.
Why are both state 3 and 5 there? State 5 is not needed at all and the elements from 4 can just point to 3. How is this min DFA?
![image](https://private-user-images.githubusercontent.com/44467788/300276627-9c6c1208-ab76-42ed-8998-ac585b47f622.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTIxNjM0MjcsIm5iZiI6MTcxMjE2MzEyNywicGF0aCI6Ii80NDQ2Nzc4OC8zMDAyNzY2MjctOWM2YzEyMDgtYWI3Ni00MmVkLTg5OTgtYWM1ODViNDdmNjIyLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA0MDMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNDAzVDE2NTIwN1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTFhZmMyMWI3MGZlYTkzMGYwNTE3OWMwOWFkZjY4ZGExNzUzYTNiN2UwNzI2ZTJjZGY1MWM4MmFiNTE4MjFlNGEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.o3IYs1bcbcEDKBPZxjrOBE6ft0YxeX34v1OX84JJSK8)
from regex.
Because 3 is a match state. 5 is not.
from regex.
For regex [^\\.]+
, this shouldn't be the min DFA.
![image](https://private-user-images.githubusercontent.com/44467788/300546844-3873bf76-25e1-4c22-a957-e1cfd175f853.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTIxNjM0MjcsIm5iZiI6MTcxMjE2MzEyNywicGF0aCI6Ii80NDQ2Nzc4OC8zMDA1NDY4NDQtMzg3M2JmNzYtMjVlMS00YzIyLWE5NTctZTFjZmQxNzVmODUzLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA0MDMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNDAzVDE2NTIwN1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWFlZWExYzUwODMwZTdiYjkzYTE4NWE0YWUzZDA5MTZjYjI0NDM3MGM0NzM1YTIzYjE5NDA5NWQ1ZTlhNWJlM2ImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.MmCDQf31Z8DFRxoVhiUyrlOU0qntNI8szZuQ2pice2Y)
from regex.
Yes it should.
from regex.
No, this is the correct DFA, there should only be 2 states in theory.
![image](https://private-user-images.githubusercontent.com/44467788/300550955-a16c730a-181b-41d9-beff-ce5948cb89d9.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTIxNjM0MjcsIm5iZiI6MTcxMjE2MzEyNywicGF0aCI6Ii80NDQ2Nzc4OC8zMDA1NTA5NTUtYTE2YzczMGEtMTgxYi00MWQ5LWJlZmYtY2U1OTQ4Y2I4OWQ5LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA0MDMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNDAzVDE2NTIwN1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTcyNDgyYTg1ZWM0YjYzODczY2ViZmIxZGVhZjIxZDJiNDE3MmY2NTllMDZlMWU2MTdiY2I2MmY0Mzc3YmQ0OTUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.-ISSs4LQA75qNpX71c1VGcl9rlVTZlRlRhKa85baGpc)
from regex.
I'll be frank here: I can't keep going back-and-forth with you like this. It isn't my job to constantly prove that your unstated (and unknowable, to me) expectations of what a "minimal DFA" are for any given regex are wrong. That's not to say that it's impossible for this library to have any bugs, but if you're going to assert that its output is incorrect, I really do need you to spend a little more time on your end understanding the library you're using. In the example of your most recent comment for example, there are multiple different ways to interpret it:
- Do you know that
[^\\.]
could just as easily be written as[^.]
? Or have you typed a pattern that does something different than you expect by accident? I don't know. - Do you know that, by default,
[^a]
matches the UTF-8 encoding of all valid Unicode scalar values except fora
? And thus, it has to generate a DFA that can match between 1 and 4 bytes corresponding to valid UTF-8 sequences. So it's not just, "will match any byte except fora
." I don't know if you know this or not. - Do you know about the UTF-8 semantics but are asking how to use non-UTF-8 semantics and shrink the DFA even more? That would be
(?-u:[^.])+
. I don't know if that's what you're asking. - Finally, do you know and understand all of the above and yet have still found a legitimate bug in DFA minimization's output? I don't know, maybe, but you haven't shown me what the correct output ought to be.
Your comments leave me guessing as to which of these possibilities is actually in play here. This is an enormous amount of work on my end, and while I want to make myself available to answer questions, I have to put up a boundary here and ask you do a little more work on your end because I can't keep doing this.
No, this is the correct DFA
Stop asserting this. It isn't the correct DFA because it doesn't match the semantics of the regex you're using.
from regex.
I'll tell you what I want, maybe it might not be in the scope of this crate. Suppose I have a text dabcdabc
and I have a regex abc
. Now this will have 2 matches, both abc
with different indexes. Now I want to have a min DFA, where I can parse these matches, i.e., abc
. So, in this case, 0 -a-> 1 -b-> 2 -c-> 3*
.
I understand that the crate right now works in a way where I can capture either anywhere in between the text (unanchored) or from the beginning or end (anchored). For my case, the first use of the crate is straightforward, getting the matches, I'm having difficulty in the second DFA where I need a DFA which is able to parse the match, that is it. I want to this because I want to create a zero knowledge proof of the regex's capture's existence (working on zk-regex). If the second DFA is not already present in the crate itself (as it may not be needed), then also it's completely fine. Sorry for bothering and disturbing you so much.
from regex.
This crate does not let you say, "I want a DFA that has precisely these states." The crate let's you say, "I want a DFA that matches this pattern, with these semantics and minimize it such that it is as small as possible given the configuration and assumptions made by this crate." The "assumptions made by this crate" is not meant to be a weasel phrase. There is nothing in this crate that will materially impact the minimal size of a DFA. It may cause minimal DFAs to be a couple states bigger (a constant factor), but that's it. Separately from that, a minimal DFA may be bigger than you expect if you don't understand the semantics of the regex pattern you've written. [^a]
for example is not a single DFA state. It is many. But (?-u:[^a])
is a single DFA state. Unicode matters here and it is enabled by default.
I understand that the crate right now works in a way where I can capture either anywhere in between the text (unanchored) or from the beginning or end (anchored). For my case, the first use of the crate is straightforward, getting the matches, I'm having difficulty in the second DFA where I need a DFA which is able to parse the match, that is it.
So you want an anchored search? Then that's this, as I mentioned above: https://docs.rs/regex-automata/latest/regex_automata/dfa/dense/struct.Config.html#method.start_kind
... but you also said:
Suppose I have a text dabcdabc and I have a regex abc. Now this will have 2 matches, both abc with different indexes. Now I want to have a min DFA, where I can parse these matches
So if you want to find the match abc
in dabcdabc
and starting at the beginning, then you need an unanchored search. Not an anchored search. Your DFA of 0 -a-> 1 -b-> 2 -c-> 3*
is incorrect.
Sorry for bothering and disturbing you so much.
I appreciate that. But to be clear, I don't want an apology. I want you to do some work on your end. I tried to explain why your approach was extremely time consuming for me, because I don't know what it is that you want. And even now, with your latest response, it seems this entire issue is even more confused than it was. You say you want to find matches anywhere in a haystack, but then go on to say that you want an anchored search.
I think at this point I really do have to insist that you provide a minimal working program, even if it doesn't do what you want. Write down the commands to compile and run it. Then tell me the output that you see. And then tell me the output that you want to see. This should have been how you started this issue.
from regex.
use regex::Regex;
pub fn main() {
// Step 1 - Find all matches of a regex in a string
let text = "dabcdabcd";
let re = Regex::new(r"abc").unwrap();
let mut results = vec![];
for mat in re.find_iter(text) {
results.push(mat.as_str());
}
// Step 2 - Create a DFA that only matches the matches in the results vector
/*
Firstly, I get all the matches of the regex in the text,
and stores them in the results vector. After this, I want a
DFA which can only parse the matches in the results vector.
No characters before or after, just the matches. I don't
want anchored or unanchored kind since I don't want to match
anything except the matches in the results vector, neither
before nor after. So think of it as this way, that I just
want to check if the DFA can parse the matches in the results
or not. Hence for this, the DFA comes out to be
0 -a-> 1 -b-> 2 -c-> 3*
*/
}
I think it should be clear now.
from regex.
I'm going to close this because I think I've been about as helpful as I can possibly here. I do not understand the disconnect. Sorry.
from regex.
Related Issues (20)
- Mention MSRV change in CHANGELOG.md HOT 1
- regex-lite: make the std feature optional HOT 9
- Add/replace std::OnceCell mention in the Readme HOT 1
- Use the same name of a capture group in different alternatives of disjunction (| operator) HOT 1
- Big bump in code size with the switch to regex-automata in 1.9.0 HOT 2
- `regex-syntax` error messages highlight incorrect characters/not handling graphemes correctly HOT 4
- Add char_range() method for the match type HOT 2
- `regex::bytes::Regex::is_match` with a simple pattern with long sequences of wildcards is significantly slower than a naïve alternative HOT 2
- UnicodeSetsMode support (`v` flag mode, `\q`) HOT 9
- Detect if a replacement may allocate HOT 3
- Add method to get full match from `Captures` HOT 3
- Have a way to iterate over sub matches with names included HOT 1
- O(m * n) lookaround
- `meta::Cache::reset` can panic
- Inconsistent behavior with zero-width matches on empty strings
- Valid prefix search (with ^) goes into dead state HOT 3
- The regex parse error while the expre is correct ! HOT 2
- Onepass DFA always has empty captures (user error) HOT 2
- dfa/onepass.rs: index out of bounds HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from regex.