xparq / parsertoy Goto Github PK

View Code? Open in Web Editor NEW

0.0 1.0 0.0 571 KB

Programmable "meta-regexes" (And a porting experiment from PHP to C++)

Batchfile 0.65% C++ 99.35%

parsertoy's Introduction

See parser.hpp for "docs", and https://github.com/xparq/parsertoy/issues for change tracking.

parsertoy's People

Contributors

Watchers

parsertoy's Issues

Parsing::define(OPCODE, OPERATOR)

Well, but its only benefit compared to

CONST_OPERATORS[opcode] = [](Parser& p, size_t pos, const Rule& rule, OUT size_t& len) -> bool { ... }

seems to be that it can check if it's been set already.
And perhaps that the [CONST_]OPERATORS map can be hidden, can also be considered an advantage, but it's a stretch...

OTOH:

If really checking for existence (instead of overriding), errors would need to be handled by callers...
Also, custom OPCODEs would probably still need to be defined symbolically -- somewhere... --, to be practical for actually defining grammar rules. :-/
Use it in init(), and also in test/extending!
But, again, there should be two sets: one for const, another for non-const, right? What does C++ do nowadays with Function types differing in c-v only?
! Hehh, congratulations!... :) It doesn't accept a param. with a c-v mismatch, yet it refuses to dispatch on it, so: "ambiguous call" to define, if trying to overload it based on the c-v diff... The good old begin/end cbegin/cend crap, still...
- Leaving it commented out for now.

_NAME, _CALL, _GOTO (and/or: _DEF, _USE, ...)

Well, to avoid the brittle address-tracking horror of #16, a named rule can't pre-register it's this address at construction time, where things are in flux.

At runtime, though:

the first time the engine sees a name, it can safely remember its address. (Well, its parent's address, technically, as a _NAME opcode "rule" is always in the PROD of some owner rule -- the real one that the name belongs to.)
- Well, not really: rules can be copied/created even run-time, so those addresses must be updated then, too! :-/
  - Short of that, only fresh new full searches can provide a reliable association!
    _DEF "name" and _USE "name" have now been implemented, doing just that (72e4508).
or, if the first time it sees a branch OP (or other reference to a name), and it can't find it in the name->rule lookup table, it does a full search and updates the table. (Or errs out.)
This would also require a preprocessing/compilation stage over the grammar. (It's fully interpreted yet.) -> #31
- Which would also imply that a (re)compilation would also be needed for every single run-time change to the grammar!

Make a distinction between NAME and LABEL!... Name should really just be the name of a rule... err... but should it also be an ID then?... And LABEL is really any branching (GOTO/CALL) destination... Branching can apply to both names and labels, but not all labels are names! Labels should probably be more indirect, and also support positions (slots) as branching targets, where the actual rule can get replaced by another (independent of those rules' own name/identity)!

So, labels are a subset of names. Every name can be used as a label, but labels can't always be used as names.

`_move()` should clean up the sources (if can't just `swap`)

Obviously, it's relevant for Prod rules mostly.

EMPTY_STRING from local statics to (one) namespace static (per module)

Remove that distracting explicit parent/child tree setup...

I've added it tentatively for some features I don't think make very much sense after all... (#5)

But in case it would still need to be reintroduced, here's this issue to pin that change to, so the commit could be easier to find.

Use `constinit` for the static pattern etc. tables

It's c++20, and I've just learned about it; it could make init() (#18) redundant (not just implicit).

(https://stackoverflow.com/questions/57845131/what-is-constinit-in-c20)

Tests: Replace those DBG(), DUMP() etc. with NDEBUG-proof alternatives

See #32!

OK, not replacing anything there, the checks themselves don't rely on those. (And shouldn't! I'm lookin' at you, output-capture-testing-I-haven't-implemented-yet-but-wanna!)

COPYLESS_GRAMMAR

(This is an umbrella issue to keep all the related information kinda-sorta organized, and allow removing most of the notes from the source.)

Currently disabled; it only made the code more complex and less readable, with all the #ifdefs etc. :-/

Should copy it over to a separate branch and mature it out, if at all possible, and free the main branch from its burden.
COPYLESS_GRAMMAR for empty rules! The problem is that unlike all the other objects that we just reference in this mode, an empty rule is actually a non-empty PROD vector that we need to create ourselves... somehow...

A static obj could kinda do, if my ctors could be made constexpr, but std::vector itself is dynamically allocated anyway, and isn't constexpr, I guess...
The COPYLESS_GRAMMAR API flavor should refuse to take rvalue objects!

From the header comment, FTR:

!!DO NOT USE THIS:
- #define COPYLESS_GRAMMAR to avoid copying the grammar rules, in case the
  life-cycle management of the source objects is of no concern. (Albeit the
  COPYLESS setup API should refuse to take rvalue objects!)
  Alas, it doesn't really work:
  a) The PRODUCTION constructs are std::vectors, which do copy whatever we put in there...
  b) There's no sane way to define the grammar rules directly using C++ code without a
     plethora of temporaries (to copy)...
  c) And grammars are not very big anyway, so... (It just felt so bad, in general, e.g.
     for heap fragmentation and the churning of dozens of tiny temporary vectors...)
  I seem to have implemented COPYLESS_GRAMMAR for nothing! :)
  Move-construction should be the way to go...

match() should handle pure opcode rules after all

All the nullary operators (NIL, T, _SELF) would be so much happier.

The non-uniformity burden of that is not confined to the related OP handlers, though, unfortunately!...

Add _SELF for recursive rules

(Requires #5/2: parent/child tree, for searching...)

It's impossible to deduce implicitly, which exact enclosing rule (level) a standalone (nullary) _SELF should refer to.

So, there should be support in the grammar def. syntax to mark a rule for later reference.

Currently, there's _DEF/_USE (#15), but TBH, that feels more like a workaround than a solution.
Come up with a light syntax + OPCODE semantics to mark a rule anonymously, for referencing it thereafter implicitly (by _SELF).
- _SELF could then just mean _USE , and then "just" a solution for a (nullary) position marker would be needed.
- The problem with that is... actually the same horrible burden, though, as the direct address references of e.g. #16...

Also note that named references -- albeit inefficient (full syntax-tree search for every deref.! :-o ) -- allow named rule definitions after a referencing _USE, whereas the implicit marker + _SELF logic couldn't really do that.

Generalize parse() to run() and match() to eval()

run() has been added now, but it's really just a purely syntactic reminder:
- for it to really be useful, the usual concepts related to running process -- input, output, error management etc. -- i.e. a runtime context must also be developed.
match(), as eval(), similarly, should then also use the runtime context struct instead of all the parsing-related (in/out) parameters directly

Aggregate captures with _SAVE/_SAVE_AS

In "looping" rules (like ANY/MANY), or for SAVE_AS rules with the same name, repeated captures should append to a list, not just overwrite the last one...

IOW, match collection should be transactional.

Only... this could wreak havoc with recursion; just trying to think about backtracking from partial matches hurts already.

Clearing the results on backtracking would be like stack unwinding with exception handling: there should be a reference baseline "catch" point in the backtracking path, where any tentatively successful "sub-results" should be unrolled on failure.

And that point is just there implicitly right before every match: if that match failed, everything that had been collected during that (i.e. by successful child (sub-) matches) should be undone, as if the match hadn't happened at all.

Actually, with appends, just by remembering the last "watermark" positions in the results containers (e.g. vectors), unrolling on a failed match could work very efficiently.

PCRE2?

I can see no reason for that currently, but anyway, old notes moved here from the source, FTR:

-> https://stackoverflow.com/questions/32580066/using-pcre2-in-a-c-project

#define PCRE2_CODE_UNIT_WIDTH 8
#include <pcre2.h>

...
PCRE2_SPTR subject = (PCRE2_SPTR) "this"s.c_str();
PCRE2_SPTR pattern = (PCRE2_SPTR) "([a-z]+)|\\s"s.c_str();
...
int errorcode; PCRE2_SIZE erroroffset;
pcre2_code *re = pcre2_compile(pattern, PCRE2_ZERO_TERMINATED,
				PCRE2_ANCHORED | PCRE2_UTF, &errorcode,
				&erroroffset, NULL);

PCRE is UNICODE-aware: http://pcre.org/original/doc/html/pcreunicode.html

-> Header-only C++ wrapper: jpcre2!...

Really add a _SEQ opcode at construction time to implicit sequences!

Deferring it until match() is causing problems already (incl. blocking #5 etc.).
In any context, where a sequence rule must be identified, or operated on, this ambiguity/duality is a burden.

OTOH... This would cost things like _CAPTURE becoming more cumbersome, which can now be just tucked in front of any implicit sequence, and it just passes the remainder of a rule as-is to match() (which will imply a SEQ). Moving the conversion from run-time (match) to "static" grammar construction, CAPTURE itself -- and any other user-defined operators that create productions run-time!!!... -- would have to construct a proper SEQ out of their argument list (i.e. for prefix ops (which is: all...) it's "just the rest of the list"); just an iterator to a "next" in-place position of a const production would no longer work! A _SEQ opcode would have to be inserted!... :-/

Don't apply _IMPLIED_SEQ for 1-long PROD!

Jon Blow's algo for expressions & precedence

https://youtu.be/fIPO4G42wYE

Solve the static NAMED_PATTERNS (map) init

UPDATE: init() is now called implicitly when creating the first ATOM rule!
(And still also in Parser(), as a backup, to also support OP-only rules, and, well, just in case...)

Old memo from the source, just FTR:

It's currently done by the Parser ctor (because at least that has only one...), but RULE objects also need it: they can't even be created
before having that done... So, it's completely futile in Parser(...), as its syntax argument would get setup from a bunch of temporaries
created before the constructor's body had a chance to run...

Now, it could perhaps still be done in its first member init (e.g. to set some dummy initialized member), but this is too much cryptic
trickery just to fight C++ for little benefit. Just document that init() must be called for now...

Also, RULE has a whole bunch of ctors, so where to put it there?... :-/

(Note: the OP (handler) map looks like another shared object, but it's static only to make it a singleton, not for sharing. That one at least is only actually used by Parser!)

An invariant here is extensibility: currently the pattern names (IDs) are just strings, so that the building blocks (atoms) of grammars can be extended by users seamlessly, either by adding to or changing the NAMED_PATTERNS[] map, or by using string literals that can themselves be regex patterns (not just plain strings).

This really is shared data across all the RULE and Parser objects etc., and it can't be a const object either.

(I don't really want to even put its interface into Parser (or RULE), as it would feel less generic (like RULE::PATTERNS or Parser::PATTERNS), and that alone wouldn't solve the static init fiasco anwyay.)

Switch regex grammar (from POSIX2) to ECMA (for its PERL-style non-greedy ops.)

The default greediness (the regex always eating its way up to the end of the entire source string, if at all possible) is not always what users want, especially when matching long structured text (like code) consisting of short atoms.

And I haven't seen "greedy" mentioned even once at the extended regex reference page, while it's readily available with ECMA's, per-pattern.

And ECMA is the JS and C++ native regex grammar anyway.

Build: Toolset switching (at least) for Windows

...then merge the two-headed build script set into just one build.cmd!

RULE -> std::variant

I think I've earned it... :) Then finally (precompiled) regex objects could also be kept there (more) easily.

Fix: Apparently using the `len` output of a failed `match` somewhere

Or just not initializing it to 0 for non-consuming operators.
-> That was it, indeed, in the new _DEF op. (probably called from a _SEQ loop)!

Now in match():

#ifdef NDEBUG
		len = 0; //! This doesn't help (it's even bad, for false sense of sec., and
		         //! easily masking bugs with that benign-looking 0 in the output),
		         //! as 0 is a valid output, which should still be ignored -- as any
		         //! others! -- if match() returned false!
			 //! OK, disabling it in debug builds for better diagnostics, but
			 //! enabling in release mode for some cushioning!...
#else
		len = (unsigned)-666; // And indeed, this *did* crash! :) (e.g. #29)
#endif

std::span for sub-Productions, perhaps?

(For the const ones, of course.)

Give in and change the `ATOM` type name, and the others monopolized by the Win32 headers

ATOM -> Atom

And then, out of consistency:

RULE -> Rule
PRODUCTION -> Production
Take care of my internally used toy macros (like ERROR, CONST, OUT), too, also stampeded by the Windows headers!
- Well, not eliminated/renamed yet, so if defined before including parser.hpp, that's too bad, but at least #undefed at the end of the header...
- OK, fukit, I'll leave it at that; will revisit when there's another case of clashes.

Simple AST builder for static grammars?

(A limited, but probably still practical early feature for purely hierarchical grammars with no cross-refs. (#15) and recursion (#14).)

The "only thing" is just to find the proper tree node during a match...

The trick is that without "function calls" (macro substitution, or recursion) or other dynamic features in the grammar, there are no dynamic stack/context frames either, and the structure of a valid input has a one-to-one correspondence with the static structure of the grammar. I'm not sure if this is practical at all, or not; we'll see. (I suspect that simple cases could be handled just with match capturing, whereas ASTs would be useful for complex grammars -- exactly, where static grammars lose traction... So, I guess not very practical...)

Grammar construction itself can at least assign a unique ID (like their this ptr) to every rule (temporaries would take some IDs to their grave, no "other harm" done), so a matching rule could immediately know where to write (e.g. using its ID as a map index).

Maintaining the parent/child hierarchy of a rule tree is somewhat more involved, but still looks kinda trivial: whenever a rule is constructed from a PRODUCTION (i.e. not as an ATOM), after the copy/move of the aggregate:

any implicit operators must be added explicitly (no more runtime _SEQ deduction; -> #7), and
each "argument" rule of the prod. must be tagged with the currently (having-been) created rule (i.e. this) as parent.
(Does this simply mean every cell but the first (OP)?! As long as only prefix operators exist, further bending to the already existing LISP allusions...)

Then it's as simple as a successful match just always updating the corresponding node!
That's it. (As long as it's all-or-nothing (as usual with at least plain context-free syntaxes, AFAIK): any failed match anywhere would invalidate the whole thing anyway... The parser's recursive matcher does all the backtracking implicitly.)

NOTES:

Detecting whether the grammar is static or dynamic is probably easy.
Even for simple grammars, where the only "dynamism" is repetitions, like processing an INI file, this fails to be useful: multiple matches of the same rule already can't be meaningfully mapped to a tree.
This automatic parent/child tagging at construction feels shaky. For one, it immediately kills #19, as it enforces copying every repeated (reused) rule separately (to avoid the grammar becoming a DAG).

Simple language for bootstrapping grammar construction

(I mean built-in language.)

The native C++ syntax for grammar definition is pretty light, but it's still cumbersome compared to how light it could be in principle...

Umm, also, the C++ aggreg. init. syntax can't support defining recursive productions directly...

-> #14

_CALL_DIRECT, GOTO_DIRECT with auto-link

The arg. is a direct C++ RULE obj. address, which is auto-linked across object copy/move events via a _from chain (map) built by the ctors... (Copy ctors update the target for _from to this, and also create 1 new entry for the source obj, setting it also to this; move ctors just update the target for _from: links[_from] = this)

During execution, when a direct address arg is found, it's original value is looked up (as a key in the links map), and replaced with the final address. If the link is not found, it's an error (and the instruction is aborted (-> "kinda exceptions")).

And then, to prevent subsequent redundant lookups (the exact reason why _DIRECT adressing exist in addition to named references), the instruction opcodes themselves are replaced with their ..._RESOLVED counterparts...

Note: this precludes the entire syntax rule tree ("program") from being const! But then it at least opens the door for constructing rules run-time! Or... it means compilation (#31; as already mentioned in #15) -- which, then OTOH, would kinda prevent dynamically modifying/creating grammar then...

This suggest the usual technique of allocating bits of the opcode for encoding the addressing mode, to allow generalizing it to any instructions.

Add `_SAVE` & `_SAVE_AS` for unnamed/named captures

(Requires #2)

Again, just like that of proper regexes... (Nameless ones just collecting into a vector, named ones into a map. UPDATE: Well, not that simple, though, see task 2...)

This could be a fundamental building block of subsequent parsing features.

Note: it should be used "wisely", i.e. to not nest capture ops carelessly. I mean it's fine, but probably makes little sense. E.g. I could detect it and issue a warning I guess...

_SAVE
_SAVE, but actually putting the result into some "reliably indexable" slot! :) Just putting it into some map using this as the key loses the "proper monotonic ordering for pseudo-indexing by enumeration" property!
- DEF/USE etc. or dynamic rule changes etc. could also make this a mess -- but we only need to not crash. :) The rest can be simply declared UB.
- Well, it just has to be done by a compilation phase (#31), and index them strictly by the order of appearance (as regexes do) -- even if in a DEF rule (but multiple USEs is UB!).
convenience getter (e.g. operator[](size_t index)) for unnamed results (via Parser)
also for named ones then...
_SAVE_AS "name"

Rather than fiddling with auto-assigning contiguous indexes to multiple matching unnamed _SAVE rules, and having distinct (disjunct) capture-results collections for the two modes, just using the stringified this pointer for the unnamed ones seems better, with a thin iterator wrapper for enumerating them. They can even go into the very same results map, perhaps!

OK, so not in the same map...
- First of all, users may want to query the count of named/unnamed captures separately.
- Second, the numeric indexing of unnamed results is a tad more nuanced than just the simple ["name"] getters, and it can at least be a little simpler with numeric indexes and keeping the map ordered, and not having to slalom around the explicit name keys either.
- The issue of name clashes between auto-indexes and "given names" disappears with 0 effort this way.

TBD: Should matching an "EMPTY rule" succeed?

First of all, what exactly an "empty rule" is, is pretty much undefined:

there isn't really an empty rule like {}, as it gets autoconverted to NIL. Actually: {NIL}, even...
should it be "" instead?! And should that be the same as {""}, too?...
should NIL and "" mean the same, so therefore match the same way?
then how about "" always matching (unanchored), while NIL was supposed to be a forced failed match (as opposed to the forced successful match of T)?!...

Since every rule is boolean, it's an unavoidable question...

If false (NIL), then hitting an empty rule could cancel the matching of the entire syntax, rather than just being redundant, as one might expect, and as is with the _EMPTY pattern!...
OTOH, in LISP, () is false (NIL).
Currently it's inconsistent, too: using "" would create a NIL rule, but using _EMPTY would match anything successfully.
_Not sure which one is right... _EMPTY might actually be wanted for matching the empty string exactly, and not matching a non-empty one... So, it's ambiguous.
- Maybe the distinction is "explicit" empty (_EMPTY) vs "implicit" empty ("")?
- NO: The distinction is that match() by default works with substrings! So _EMPTY duly matches the empty substring of everything!
Anyway, there should be a straightforward and reliable way to choose to match the empty string only (in the sense that any non-empty string should not match). That should be achievable with anchoring (-> #21).
It also feels natural to have both a forced-false (NIL) and a forced-true (T) rule, so this really is a conundrum now... The empty Production kinda must be NIL (to mimic LISP), but then it would not match anything, which would be inconsistent with the empty Atom ("") matching everything...
- Maybe it's time to recognize and separate the result of matching something from the "result" -- the intrinsic value -- of the rule that's being matched! So, matching NIL (or false, for that matter) against "anything" can still be true! But should it?... Still don't know...

Clean the obsolete junk from test/parsing.cpp

Disambiguate USER_REGEX patterns and literal "/.../" strings!

What if I want to match "//" as a C++ comment?!... One could write a regex for that (e.g. //// should just work), but...

A simple thing like { REGEX(".*"), ...} is only half of the solution, as the ATOM type is still a string in both cases...

I should pr'ly just bite the bullet and split them, which would be required anyway for supporting precompiled regexes...

Grammar preprocessing/compilation phase (after construction, before parsing)

Well, this is a huge big inconvenient question: the fully interpreted -- but less efficient -- mode, using the rules as-is, is mutually exclusive with running a precompiled version of the rules. Some of it could be merged, but the compiled version basically disqualifies self-modifying, or even just dynamically creating rules.

(At least eval (#28) could easily use the compiler, but even that feels so sad...)

Milestone: Feature parity with parsertoy-php

Almost there -- _SAVE (and possibly other extensions) in the old tests are not yet added etc.

(Not that there are "features", BTW... This can only check the syntax of a source, nothing else yet.)

Backreferences...

Named captures like SAVE "name" are there already (#4), so all it takes is an operator like USE "name" -- which is also there already (#15)...

It's just that SAVE and DEF (which USE is the counterpart of, currently) use different namespaces.

How to properly address the addressing bit of the operations?... With addressing bits, literally? Or should it be part of the operand? "name" itself should tell where to look for the item, like URLs?
- OK, do it with the name; see below... (Leaving the addressing part for now, TBD after #31!)
It should leave the door open to further extensions. (Which really cries for the URL-like semantics...)
- But then, if dispatching on "name": how to make it efficient? Compilation (#31)?
  - Yes, leaving it to the name -- e.g. like "\name" or "$name", or both -- gives an instant way to make it work, and leaves the door open for a preprocessing phase to convert it to whatever it wants.
  - Well, not so fast with that "instant" solution... :-/ Actually, USE would need to know actually, if the named capture has actually matched already and it has not been backtracked, so it still has valid data! :-o So, it:
    - needs a flag that the slot has valid data ("" may not actually mean that! :-/ )
    - needs the "transactional" backtracking cleanup mechanism (which would be required for #10 anyway!)...
It should be unified with the named patterns namespace, too.
- OK, so probably a different opcode should be generated for each; seems enough to unify them for the "API", with name prefixes, and then leaving it to the preprocessor to adjust.

Fix the NDEBUG build

I think there are cases for copy elision at grammar construction

Yet another nail in the coffin of that sad, aborted COPYLESS_GRAMMAR experiment?

(See e.g. https://abseil.io/tips/117)

Also: that double-copy case ("PROD: directly from ATOM-RULE, plus PROD from it") in the main test set is still beyond me... :-/

Fix the confusing op name _MANY

To _ATLEAST1 (and/or _MIN1)
And add a real _MANY, too (as _MIN2)

Build: Auto regression testing on build

Fix the failing cases in test/main (well, not really failing, but unfinished investigation for the move semantics)
Add a batch builder + runner for all the tests
Call it from the main builder script

(Still no use of a proper make, as any change in the header would trigger a full rebuild of everything anyway.)

Build: Add a native Linux GCC build script, too

Add _END for anchoring, to reject extraneous input after a full match

Either as a rule or a pattern... So, a RULE or NAMED_PATTERNS["_END"]?...

Note: this is much like regex '$'. Seems like a RULE OP then, but it can also well be a "virtual" (non-consuming) named pattern like _EMPTY, or others...

It could also be the default. (Adjust the tests then!)