Code Monkey home page Code Monkey logo

Comments (23)

Bisht13 avatar Bisht13 commented on July 17, 2024 1

Still too many states

image

Also, how do I get the DFA?

from regex.

BurntSushi avatar BurntSushi commented on July 17, 2024 1

can I set any sort of configuration to avoid the end of input (EOI)

Nope. I'm not aware of any compelling reason to do so.

sanity check (\x00-\xFF)

I don't know what you mean by this. There is no "sanity check" in the DFAs generated by this crate.

from regex.

BurntSushi avatar BurntSushi commented on July 17, 2024 1

I think this guide might also be helpful to you: https://www.freecodecamp.org/news/asking-effective-questions-a-practical-guide-for-developers/

from regex.

BurntSushi avatar BurntSushi commented on July 17, 2024

Why are you using a onepass DFA?

Why not use a dense DFA and enable the minimize option? https://docs.rs/regex-automata/latest/regex_automata/dfa/dense/struct.Config.html#method.minimize

from regex.

BurntSushi avatar BurntSushi commented on July 17, 2024

And the onepass DFA you show is the same as the DFA graph you show.

from regex.

Bisht13 avatar Bisht13 commented on July 17, 2024

Why are you using a onepass DFA?

Why not use a dense DFA and enable the minimize option? https://docs.rs/regex-automata/latest/regex_automata/dfa/dense/struct.Config.html#method.minimize

I did and there are too many states as compared to what should be there.

image

from regex.

BurntSushi avatar BurntSushi commented on July 17, 2024

Only because it supports unanchored and anchored searches. You can change that also in the configuration. The DFA you show only does an anchored search.

from regex.

BurntSushi avatar BurntSushi commented on July 17, 2024

https://docs.rs/regex-automata/latest/regex_automata/dfa/dense/struct.Config.html#method.start_kind

from regex.

BurntSushi avatar BurntSushi commented on July 17, 2024

No, that's a minimal DFA. There's exactly one extra state to account for the end-of-input transition.

And you have the DFA. I don't know what you're asking for otherwise. Most of its functionality is made available via the Automaton trait.

You haven't shared any code, and the only thing I have to go on is that you want a minimal DFA. But just because the thing you have isn't precisely identical to what you might find in a textbook about DFAs doesn't mean it isn't minimal.

I would really encourage you to read more of the docs here. I realize there is a lot of it, but this is a necessarily complex API to support there engineering necessary to make a general purpose regex engine work with finite automata.

Otherwise, it seems to me like there might be an XY problem here. A minimal DFA is a means to an end, not an end to itself.

from regex.

Bisht13 avatar Bisht13 commented on July 17, 2024

Thank you so much for your reply and time. Actually I need to create a JSON which I'd use for a js project via wasm. Yes, it gives the states but the transitions can't be comprehended. I don't know how \x01 is referring to (a|b), like some sort of mapping would have had been helpful.

from regex.

BurntSushi avatar BurntSushi commented on July 17, 2024

Please read the docs for the DFA configuration. And then the Automaton docs. They should answer everything. This crate uses alphabet compression, and there is a knob called byte_classes that explains this. And the mapping can be accessed: https://docs.rs/regex-automata/latest/regex_automata/dfa/dense/struct.DFA.html#method.byte_classes

like some sort of mapping would have had been helpful.

You say this as if the mapping isn't accessible to you. But it is. It's all there.

from regex.

BurntSushi avatar BurntSushi commented on July 17, 2024

If you're using the DFA from this crate but not its search routines, you'll need to read the Automaton docs very carefully to implement your own search routines. This is not your average simplistic DFA library. :)

from regex.

Bisht13 avatar Bisht13 commented on July 17, 2024

Got it, I think more or less I have everything that I need, one last question, can I set any sort of configuration to avoid the end of input (EOI) and sanity check (\x00-\xFF) so as to remove them from the DFA?

from regex.

Bisht13 avatar Bisht13 commented on July 17, 2024

Why are both state 3 and 5 there? State 5 is not needed at all and the elements from 4 can just point to 3. How is this min DFA?

image

from regex.

BurntSushi avatar BurntSushi commented on July 17, 2024

Because 3 is a match state. 5 is not.

from regex.

Bisht13 avatar Bisht13 commented on July 17, 2024

For regex [^\\.]+, this shouldn't be the min DFA.

image

from regex.

BurntSushi avatar BurntSushi commented on July 17, 2024

Yes it should.

from regex.

Bisht13 avatar Bisht13 commented on July 17, 2024

No, this is the correct DFA, there should only be 2 states in theory.

image

from regex.

BurntSushi avatar BurntSushi commented on July 17, 2024

I'll be frank here: I can't keep going back-and-forth with you like this. It isn't my job to constantly prove that your unstated (and unknowable, to me) expectations of what a "minimal DFA" are for any given regex are wrong. That's not to say that it's impossible for this library to have any bugs, but if you're going to assert that its output is incorrect, I really do need you to spend a little more time on your end understanding the library you're using. In the example of your most recent comment for example, there are multiple different ways to interpret it:

  • Do you know that [^\\.] could just as easily be written as [^.]? Or have you typed a pattern that does something different than you expect by accident? I don't know.
  • Do you know that, by default, [^a] matches the UTF-8 encoding of all valid Unicode scalar values except for a? And thus, it has to generate a DFA that can match between 1 and 4 bytes corresponding to valid UTF-8 sequences. So it's not just, "will match any byte except for a." I don't know if you know this or not.
  • Do you know about the UTF-8 semantics but are asking how to use non-UTF-8 semantics and shrink the DFA even more? That would be (?-u:[^.])+. I don't know if that's what you're asking.
  • Finally, do you know and understand all of the above and yet have still found a legitimate bug in DFA minimization's output? I don't know, maybe, but you haven't shown me what the correct output ought to be.

Your comments leave me guessing as to which of these possibilities is actually in play here. This is an enormous amount of work on my end, and while I want to make myself available to answer questions, I have to put up a boundary here and ask you do a little more work on your end because I can't keep doing this.

No, this is the correct DFA

Stop asserting this. It isn't the correct DFA because it doesn't match the semantics of the regex you're using.

from regex.

Bisht13 avatar Bisht13 commented on July 17, 2024

I'll tell you what I want, maybe it might not be in the scope of this crate. Suppose I have a text dabcdabc and I have a regex abc. Now this will have 2 matches, both abc with different indexes. Now I want to have a min DFA, where I can parse these matches, i.e., abc. So, in this case, 0 -a-> 1 -b-> 2 -c-> 3*.

I understand that the crate right now works in a way where I can capture either anywhere in between the text (unanchored) or from the beginning or end (anchored). For my case, the first use of the crate is straightforward, getting the matches, I'm having difficulty in the second DFA where I need a DFA which is able to parse the match, that is it. I want to this because I want to create a zero knowledge proof of the regex's capture's existence (working on zk-regex). If the second DFA is not already present in the crate itself (as it may not be needed), then also it's completely fine. Sorry for bothering and disturbing you so much.

from regex.

BurntSushi avatar BurntSushi commented on July 17, 2024

This crate does not let you say, "I want a DFA that has precisely these states." The crate let's you say, "I want a DFA that matches this pattern, with these semantics and minimize it such that it is as small as possible given the configuration and assumptions made by this crate." The "assumptions made by this crate" is not meant to be a weasel phrase. There is nothing in this crate that will materially impact the minimal size of a DFA. It may cause minimal DFAs to be a couple states bigger (a constant factor), but that's it. Separately from that, a minimal DFA may be bigger than you expect if you don't understand the semantics of the regex pattern you've written. [^a] for example is not a single DFA state. It is many. But (?-u:[^a]) is a single DFA state. Unicode matters here and it is enabled by default.

I understand that the crate right now works in a way where I can capture either anywhere in between the text (unanchored) or from the beginning or end (anchored). For my case, the first use of the crate is straightforward, getting the matches, I'm having difficulty in the second DFA where I need a DFA which is able to parse the match, that is it.

So you want an anchored search? Then that's this, as I mentioned above: https://docs.rs/regex-automata/latest/regex_automata/dfa/dense/struct.Config.html#method.start_kind

... but you also said:

Suppose I have a text dabcdabc and I have a regex abc. Now this will have 2 matches, both abc with different indexes. Now I want to have a min DFA, where I can parse these matches

So if you want to find the match abc in dabcdabc and starting at the beginning, then you need an unanchored search. Not an anchored search. Your DFA of 0 -a-> 1 -b-> 2 -c-> 3* is incorrect.

Sorry for bothering and disturbing you so much.

I appreciate that. But to be clear, I don't want an apology. I want you to do some work on your end. I tried to explain why your approach was extremely time consuming for me, because I don't know what it is that you want. And even now, with your latest response, it seems this entire issue is even more confused than it was. You say you want to find matches anywhere in a haystack, but then go on to say that you want an anchored search.

I think at this point I really do have to insist that you provide a minimal working program, even if it doesn't do what you want. Write down the commands to compile and run it. Then tell me the output that you see. And then tell me the output that you want to see. This should have been how you started this issue.

from regex.

Bisht13 avatar Bisht13 commented on July 17, 2024
use regex::Regex;

pub fn main() {
    // Step 1 - Find all matches of a regex in a string

    let text = "dabcdabcd";
    let re = Regex::new(r"abc").unwrap();
    let mut results = vec![];
    for mat in re.find_iter(text) {
        results.push(mat.as_str());
    }

    // Step 2 - Create a DFA that only matches the matches in the results vector

    /*
        Firstly, I get all the matches of the regex in the text,
        and stores them in the results vector. After this, I want a
        DFA which can only parse the matches in the results vector.
        No characters before or after, just the matches. I don't
        want anchored or unanchored kind since I don't want to match
        anything except the matches in the results vector, neither
        before nor after. So think of it as this way, that I just
        want to check if the DFA can parse the matches in the results
        or not. Hence for this, the DFA comes out to be
                        0 -a-> 1 -b-> 2 -c-> 3*
    */
}

I think it should be clear now.

from regex.

BurntSushi avatar BurntSushi commented on July 17, 2024

I'm going to close this because I think I've been about as helpful as I can possibly here. I do not understand the disconnect. Sorry.

from regex.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.