Code Monkey home page Code Monkey logo

Comments (8)

jclark avatar jclark commented on August 22, 2024 1

I think the regex module should produce a "natural" semtype for the regex. I expect this will add some complexity to the current implementation, involving some sort of intermediate data structure between the regex string and the semtype. I don't think it necessarily needs to be an AST for the regex. Rather it might be something that represents the semtype that is to be built.

from nballerina.

jclark avatar jclark commented on August 22, 2024

There's something weird going on here.

If I change the innermost string on the second regex from glPrQ to gQ, it works fine (and quickly). If I add one more character and make it glQ, I get bored of waiting - it seems to be producing an unbounded number of new BDDs.

from nballerina.

jclark avatar jclark commented on August 22, 2024

#1076 is also slower than I would expect. My guess is that it is failing to recognize some equivalent BDDs as equivalent.

from nballerina.

jclark avatar jclark commented on August 22, 2024

The BDDs can be printed out with bddToString. It just seems to be exploring a large number of different permutations. I suspect the nesting of * is causing exponential behaviour. The BDDs aren't consistently getting bigger .

Let's see what happens after #1075 is fixed.

from nballerina.

jclark avatar jclark commented on August 22, 2024

Still get the problem after merging #1078 to fix #1075.

from nballerina.

heshanpadmasiri avatar heshanpadmasiri commented on August 22, 2024

I think the root of the problem in both #1074 and #1076 is that regex produces a type definition that is very complicated but ultimately equal to a much more simpler type. This is because the regex module can't (at least to the extent I have thought about it) produce that simpler type under a particular situation with nested stars (when the nested part is the beginning of the star (ex: ((a)b)) due to a limitation in Semtypes.

To explain why regex has this limitation, I shall start by explaining how the regex module does "*". If we have "(R1)*R2" (where R1 and R2 them self are regular expressions) we will define its type (say T as follows)

type T = T_R_star | TR2;

Here TR2 is the type representing R2 (which is not relevant to the problem). T_R_star is a type representing R1 and at the end must recurse back to R. For example take "(ab)*cd"

type T T_ab | T_cd;
type T_cd ["c", T_d];
type T_d ["d", ()];

type T_ab ["a", T_b];
type T_b ["b", T]; // <- note the recursion here

To generate T regex do the followings. First, it will generate the type for T_cd which is independent of T. Then it will generate a reference for T_ab and define T as the union of that reference and T_cd.(code). Then it tries to recursively define T_ab by passing T which now properly defined (so that at the end of T_ab it can recurse to T). However, this approach introduces a limitation that T_ab must be a list (or mapping since only those two types can give you a reference). (This is also why you can't do this in a different order like starting with T_ab because we can't create a reference for T since it is a union). This is fine if whatever inside the * starts with a character (not a * or |) because that ensures T_ab is a list. But if we have something like "((a)*b)*cd" (both #1074 and #1076 have this character) we would like our type definitions to be like

type T T_a_star_b|T_cd;
type T_cd ["c", T_d];
type T_d ["d", ()];

type T_a_star_b T_a_star | T_b;
type T_a_star ["a", T_a_star_b];
type T_b ["b", T];

Now we have a problem with T_a_star_b because it is a union we can't get a reference for that type and as a result, we can't define T. To work around this limitation regex produce an equivalent type to T_a_star_b which is a list. The intuitive reason why such an equivalent type must exist is because of the observation whatever T_a_star_b matches must be of the form [A, B] since we don't allow empty * (or |) and all non-empty string sequences end up in that form. This part of the code is supposed (I haven't formally proven this but so far every test case successfully passes this and even for this stack overflowing case I am getting 2 members as expected so I think it is still working) do that. Now the problem is as the number of "things" inside T_a_star_b grows (that is we have more characters than 2 or maybe stars and unions inside) the type of A and B gets hugely more complicated.

Now when we do normal regular expression match this is not a big deal since you are matching a character with A. But when you do the inclusion, that is when you compare A with another A (say A') I think that comparison is very complicated and this is why as you increase the number of characters inside the star things gets exponentially slower.

Now if we are to avoid this I think there are 3 options

  1. Make our subtype check "better" so it can do the comparison between A and A' fast (I am not sure this is worth it since I don't think in actual code anybody will write anything like A here)
  2. Avoid having to do the "equivalent type" thing. I can think of two ways of doing this
    2.1 Allow references to unions (similar to lists and maps). (I am not sure how useful this will become outside of regex)
    2.2 Rewrite regex to produce ballerina code (and then translate that into semtype) instead of trying to do it directly. (I don't think we can avoid this problem if we stick to producing semtype without adding 2.1)

from nballerina.

jclark avatar jclark commented on August 22, 2024

Can you not do what the front-end does (which is to expand definitions like T_a_star_b until you get something that's an atomic type)?

from nballerina.

heshanpadmasiri avatar heshanpadmasiri commented on August 22, 2024

@jclark if I recall correctly we did considered that option. The way we explored back then was to build a proper parser and AST for regex so when we come to T we know what the rest of definition will look like. I think we tried to avoid doing that since we are dealing with a regular language which shouldn't need an AST. (In retrospect I think we avoided that because we had "equivalent" type solution which didn't caused us any problems back then)

from nballerina.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.