See parser.hpp for "docs", and https://github.com/xparq/parsertoy/issues for change tracking.
xparq / parsertoy Goto Github PK
View Code? Open in Web Editor NEWProgrammable "meta-regexes" (And a porting experiment from PHP to C++)
Programmable "meta-regexes" (And a porting experiment from PHP to C++)
See parser.hpp for "docs", and https://github.com/xparq/parsertoy/issues for change tracking.
And then, out of consistency:
ERROR
, CONST
, OUT
), too, also stampeded by the Windows headers!
parser.hpp
, that's too bad, but at least #undef
ed at the end of the header...Yet another nail in the coffin of that sad, aborted COPYLESS_GRAMMAR
experiment?
(See e.g. https://abseil.io/tips/117)
Also: that double-copy case ("PROD: directly from ATOM-RULE, plus PROD from it") in the main
test set is still beyond me... :-/
Well, this is a huge big inconvenient question: the fully interpreted -- but less efficient -- mode, using the rules as-is, is mutually exclusive with running a precompiled version of the rules. Some of it could be merged, but the compiled version basically disqualifies self-modifying, or even just dynamically creating rules.
(At least eval
(#28) could easily use the compiler, but even that feels so sad...)
(Requires #5/2: parent/child tree, for searching...)
It's impossible to deduce implicitly, which exact enclosing rule (level) a standalone (nullary) _SELF
should refer to.
So, there should be support in the grammar def. syntax to mark a rule for later reference.
Also note that named references -- albeit inefficient (full syntax-tree search for every deref.! :-o ) -- allow named rule definitions after a referencing _USE, whereas the implicit marker + _SELF logic couldn't really do that.
(For the const ones, of course.)
(Requires #2)
Again, just like that of proper regexes... (Nameless ones just collecting into a vector
, named ones into a map
. UPDATE: Well, not that simple, though, see task 2...)
This could be a fundamental building block of subsequent parsing features.
Note: it should be used "wisely", i.e. to not nest capture ops carelessly. I mean it's fine, but probably makes little sense. E.g. I could detect it and issue a warning I guess...
this
as the key loses the "proper monotonic ordering for pseudo-indexing by enumeration" property!
operator[](size_t index)
) for unnamed results (via Parser
)Rather than fiddling with auto-assigning contiguous indexes to multiple matching unnamed _SAVE
rules, and having distinct (disjunct) capture-results collections for the two modes, just using the stringified this
pointer for the unnamed ones seems better, with a thin iterator wrapper for enumerating them. They can even go into the very same results map, perhaps!
["name"]
getters, and it can at least be a little simpler with numeric indexes and keeping the map ordered, and not having to slalom around the explicit name keys either.In "looping" rules (like ANY/MANY), or for SAVE_AS rules with the same name, repeated captures should append to a list, not just overwrite the last one...
Only... this could wreak havoc with recursion; just trying to think about backtracking from partial matches hurts already.
Clearing the results on backtracking would be like stack unwinding with exception handling: there should be a reference baseline "catch" point in the backtracking path, where any tentatively successful "sub-results" should be unrolled on failure.
And that point is just there implicitly right before every match: if that match failed, everything that had been collected during that (i.e. by successful child (sub-) matches) should be undone, as if the match hadn't happened at all.
SAVE "name"
are there already (#4), so all it takes is an operator like USE "name"
-- which is also there already (#15)...It's just that SAVE
and DEF
(which USE
is the counterpart of, currently) use different namespaces.
"name"
itself should tell where to look for the item, like URLs?
"\name"
or "$name"
, or both -- gives an instant way to make it work, and leaves the door open for a preprocessing phase to convert it to whatever it wants.USE
would need to know actually, if the named capture has actually matched already and it has not been backtracked, so it still has valid data! :-o So, it:
""
may not actually mean that! :-/ )(Still no use of a proper make
, as any change in the header would trigger a full rebuild of everything anyway.)
The arg. is a direct C++ RULE obj. address, which is auto-linked across object copy/move events via a _from
chain (map) built by the ctors... (Copy ctors update the target for _from to this, and also create 1 new entry for the source obj, setting it also to this; move ctors just update the target for _from: links[_from] = this
)
During execution, when a direct address arg is found, it's original value is looked up (as a key in the links map), and replaced with the final address. If the link is not found, it's an error (and the instruction is aborted (-> "kinda exceptions")).
And then, to prevent subsequent redundant lookups (the exact reason why _DIRECT adressing exist in addition to named references), the instruction opcodes themselves are replaced with their ..._RESOLVED counterparts...
Note: this precludes the entire syntax rule tree ("program") from being const
! But then it at least opens the door for constructing rules run-time! Or... it means compilation (#31; as already mentioned in #15) -- which, then OTOH, would kinda prevent dynamically modifying/creating grammar then...
This suggest the usual technique of allocating bits of the opcode for encoding the addressing mode, to allow generalizing it to any instructions.
It's c++20, and I've just learned about it; it could make init()
(#18) redundant (not just implicit).
(https://stackoverflow.com/questions/57845131/what-is-constinit-in-c20)
...then merge the two-headed build script set into just one build.cmd
!
Either as a rule or a pattern... So, a RULE
or NAMED_PATTERNS["_END"]
?...
Note: this is much like regex '$'. Seems like a RULE
OP then, but it can also well be a "virtual" (non-consuming) named pattern like _EMPTY
, or others...
It could also be the default. (Adjust the tests then!)
(I mean built-in language.)
The native C++ syntax for grammar definition is pretty light, but it's still cumbersome compared to how light it could be in principle...
Umm, also, the C++ aggreg. init. syntax can't support defining recursive productions directly...
Well, but its only benefit compared to
CONST_OPERATORS[opcode] = [](Parser& p, size_t pos, const Rule& rule, OUT size_t& len) -> bool { ... }
seems to be that it can check if it's been set already.
And perhaps that the [CONST_]OPERATORS map can be hidden, can also be considered an advantage, but it's a stretch...
OTOH:
If really checking for existence (instead of overriding), errors would need to be handled by callers...
Also, custom OPCODEs would probably still need to be defined symbolically -- somewhere... --, to be practical for actually defining grammar rules. :-/
Use it in init()
, and also in test/extending
!
But, again, there should be two sets: one for const, another for non-const, right? What does C++ do nowadays with Function
types differing in c-v only?
! Hehh, congratulations!... :) It doesn't accept a param. with a c-v mismatch, yet it refuses to dispatch on it, so: "ambiguous call" to define, if trying to overload it based on the c-v diff... The good old begin/end cbegin/cend crap, still...
I've added it tentatively for some features I don't think make very much sense after all... (#5)
But in case it would still need to be reintroduced, here's this issue to pin that change to, so the commit could be easier to find.
See #32!
OK, not replacing anything there, the checks themselves don't rely on those. (And shouldn't! I'm lookin' at you, output-capture-testing-I-haven't-implemented-yet-but-wanna!)
Well, to avoid the brittle address-tracking horror of #16, a named rule can't pre-register it's this
address at construction time, where things are in flux.
At runtime, though:
the first time the engine sees a name, it can safely remember its address. (Well, its parent's address, technically, as a _NAME opcode "rule" is always in the PROD of some owner rule -- the real one that the name belongs to.)
_DEF "name"
and _USE "name"
have now been implemented, doing just that (72e4508).or, if the first time it sees a branch OP (or other reference to a name), and it can't find it in the name->rule lookup table, it does a full search and updates the table. (Or errs out.)
This would also require a preprocessing/compilation stage over the grammar. (It's fully interpreted yet.) -> #31
Make a distinction between NAME
and LABEL
!... Name should really just be the name of a rule... err... but should it also be an ID
then?... And LABEL
is really any branching (GOTO/CALL) destination... Branching can apply to both names and labels, but not all labels are names! Labels should probably be more indirect, and also support positions (slots) as branching targets, where the actual rule can get replaced by another (independent of those rules' own name/identity)!
So, labels are a subset of names. Every name can be used as a label, but labels can't always be used as names.
(A limited, but probably still practical early feature for purely hierarchical grammars with no cross-refs. (#15) and recursion (#14).)
The "only thing" is just to find the proper tree node during a match...
The trick is that without "function calls" (macro substitution, or recursion) or other dynamic features in the grammar, there are no dynamic stack/context frames either, and the structure of a valid input has a one-to-one correspondence with the static structure of the grammar. I'm not sure if this is practical at all, or not; we'll see. (I suspect that simple cases could be handled just with match capturing, whereas ASTs would be useful for complex grammars -- exactly, where static grammars lose traction... So, I guess not very practical...)
Grammar construction itself can at least assign a unique ID (like their this
ptr) to every rule (temporaries would take some IDs to their grave, no "other harm" done), so a matching rule could immediately know where to write (e.g. using its ID as a map index).
Maintaining the parent/child hierarchy of a rule tree is somewhat more involved, but still looks kinda trivial: whenever a rule is constructed from a PRODUCTION
(i.e. not as an ATOM
), after the copy/move of the aggregate:
_SEQ
deduction; -> #7), andthis
) as parent.Then it's as simple as a successful match just always updating the corresponding node!
That's it. (As long as it's all-or-nothing (as usual with at least plain context-free syntaxes, AFAIK): any failed match anywhere would invalidate the whole thing anyway... The parser's recursive matcher does all the backtracking implicitly.)
UPDATE: init()
is now called implicitly when creating the first ATOM rule!
(And still also in Parser(), as a backup, to also support OP-only rules, and, well, just in case...)
It's currently done by the Parser ctor (because at least that has only one...), but RULE objects also need it: they can't even be created
before having that done... So, it's completely futile in Parser(...), as its syntax argument would get setup from a bunch of temporaries
created before the constructor's body had a chance to run...
Now, it could perhaps still be done in its first member init (e.g. to set some dummy initialized
member), but this is too much cryptic
trickery just to fight C++ for little benefit. Just document that init()
must be called for now...
Also, RULE has a whole bunch of ctors, so where to put it there?... :-/
(Note: the OP
(handler) map looks like another shared object, but it's static
only to make it a singleton, not for sharing. That one at least is only actually used by Parser
!)
An invariant here is extensibility: currently the pattern names (IDs) are just strings, so that the building blocks (atoms) of grammars can be extended by users seamlessly, either by adding to or changing the NAMED_PATTERNS[]
map, or by using string literals that can themselves be regex patterns (not just plain strings).
This really is shared data across all the RULE
and Parser
objects etc., and it can't be a const
object either.
(I don't really want to even put its interface into Parser
(or RULE
), as it would feel less generic (like RULE::PATTERNS
or Parser::PATTERNS
), and that alone wouldn't solve the static init fiasco anwyay.)
Obviously, it's relevant for Prod rules mostly.
The default greediness (the regex always eating its way up to the end of the entire source string, if at all possible) is not always what users want, especially when matching long structured text (like code) consisting of short atoms.
And I haven't seen "greedy" mentioned even once at the extended regex reference page, while it's readily available with ECMA's, per-pattern.
And ECMA is the JS and C++ native regex grammar anyway.
Or just not initializing it to 0 for non-consuming operators.
-> That was it, indeed, in the new _DEF
op. (probably called from a _SEQ
loop)!
Now in match():
#ifdef NDEBUG
len = 0; //! This doesn't help (it's even bad, for false sense of sec., and
//! easily masking bugs with that benign-looking 0 in the output),
//! as 0 is a valid output, which should still be ignored -- as any
//! others! -- if match() returned false!
//! OK, disabling it in debug builds for better diagnostics, but
//! enabling in release mode for some cushioning!...
#else
len = (unsigned)-666; // And indeed, this *did* crash! :) (e.g. #29)
#endif
I think I've earned it... :) Then finally (precompiled) regex objects could also be kept there (more) easily.
What if I want to match "//" as a C++ comment?!... One could write a regex for that (e.g. ////
should just work), but...
A simple thing like { REGEX(".*"), ...}
is only half of the solution, as the ATOM type is still a string in both cases...
I should pr'ly just bite the bullet and split them, which would be required anyway for supporting precompiled regexes...
First of all, what exactly an "empty rule" is, is pretty much undefined:
{}
, as it gets autoconverted to NIL
. Actually: {NIL}
, even...""
instead?! And should that be the same as {""}
, too?...NIL
and ""
mean the same, so therefore match the same way?""
always matching (unanchored), while NIL
was supposed to be a forced failed match (as opposed to the forced successful match of T
)?!...Since every rule is boolean, it's an unavoidable question...
If false (NIL), then hitting an empty rule could cancel the matching of the entire syntax, rather than just being redundant, as one might expect, and as is with the _EMPTY pattern!...
OTOH, in LISP, ()
is false (NIL).
Currently it's inconsistent, too: using "" would create a NIL rule, but using _EMPTY would match anything successfully.
_Not sure which one is right... _EMPTY might actually be wanted for matching the empty string exactly, and not matching a non-empty one... So, it's ambiguous.
_EMPTY
) vs "implicit" empty (""
)?Anyway, there should be a straightforward and reliable way to choose to match the empty string only (in the sense that any non-empty string should not match). That should be achievable with anchoring (-> #21).
It also feels natural to have both a forced-false (NIL) and a forced-true (T) rule, so this really is a conundrum now... The empty Production kinda must be NIL (to mimic LISP), but then it would not match anything, which would be inconsistent with the empty Atom ("") matching everything...
(This is an umbrella issue to keep all the related information kinda-sorta organized, and allow removing most of the notes from the source.)
Currently disabled; it only made the code more complex and less readable, with all the #ifdef
s etc. :-/
Should copy it over to a separate branch and mature it out, if at all possible, and free the main branch from its burden.
COPYLESS_GRAMMAR for empty rules! The problem is that unlike all the other objects that we just reference in this mode, an empty rule is actually a non-empty PROD vector
that we need to create ourselves... somehow...
A static obj could kinda do, if my ctors could be made constexpr
, but std::vector itself is dynamically allocated anyway, and isn't constexpr
, I guess...
The COPYLESS_GRAMMAR API flavor should refuse to take rvalue objects!
From the header comment, FTR:
!!DO NOT USE THIS:
- #define COPYLESS_GRAMMAR to avoid copying the grammar rules, in case the
life-cycle management of the source objects is of no concern. (Albeit the
COPYLESS setup API should refuse to take rvalue objects!)
Alas, it doesn't really work:
a) The PRODUCTION constructs are std::vectors, which do copy whatever we put in there...
b) There's no sane way to define the grammar rules directly using C++ code without a
plethora of temporaries (to copy)...
c) And grammars are not very big anyway, so... (It just felt so bad, in general, e.g.
for heap fragmentation and the churning of dozens of tiny temporary vectors...)
I seem to have implemented COPYLESS_GRAMMAR for nothing! :)
Move-construction should be the way to go...
Almost there -- _SAVE
(and possibly other extensions) in the old tests are not yet added etc.
(Not that there are "features", BTW... This can only check the syntax of a source, nothing else yet.)
I can see no reason for that currently, but anyway, old notes moved here from the source, FTR:
-> https://stackoverflow.com/questions/32580066/using-pcre2-in-a-c-project
#define PCRE2_CODE_UNIT_WIDTH 8
#include <pcre2.h>
...
PCRE2_SPTR subject = (PCRE2_SPTR) "this"s.c_str();
PCRE2_SPTR pattern = (PCRE2_SPTR) "([a-z]+)|\\s"s.c_str();
...
int errorcode; PCRE2_SIZE erroroffset;
pcre2_code *re = pcre2_compile(pattern, PCRE2_ZERO_TERMINATED,
PCRE2_ANCHORED | PCRE2_UTF, &errorcode,
&erroroffset, NULL);
PCRE is UNICODE-aware: http://pcre.org/original/doc/html/pcreunicode.html
-> Header-only C++ wrapper: jpcre2!...
All the nullary operators (NIL, T, _SELF) would be so much happier.
The non-uniformity burden of that is not confined to the related OP handlers, though, unfortunately!...
Deferring it until match()
is causing problems already (incl. blocking #5 etc.).
In any context, where a sequence rule must be identified, or operated on, this ambiguity/duality is a burden.
OTOH... This would cost things like _CAPTURE
becoming more cumbersome, which can now be just tucked in front of any implicit sequence, and it just passes the remainder of a rule as-is to match()
(which will imply a SEQ). Moving the conversion from run-time (match
) to "static" grammar construction, CAPTURE
itself -- and any other user-defined operators that create productions run-time!!!... -- would have to construct a proper SEQ out of their argument list (i.e. for prefix ops (which is: all...) it's "just the rest of the list"); just an iterator to a "next" in-place position of a const production would no longer work! A _SEQ
opcode would have to be inserted!... :-/
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.