goodmami / pe Goto Github PK
View Code? Open in Web Editor NEWFastest general-purpose parsing library for Python with a familiar API
License: MIT License
Fastest general-purpose parsing library for Python with a familiar API
License: MIT License
The specification's section about escape sequences says that the \xNN
, \uNNNN
, and \UNNNNNNNN
escapes are UTF-8, UTF-16, and UTF-32, respectively, but that may not be accurate. For instance, in Python literals, '\xNN'
is just a character whose value is given by two hexadecimal digits, and '\xNN\xNN'
is thus two characters and not a sequence of two bytes.
Going with each escape being a single character would make parsing grammars easier, at least. Tests are needed, whichever way this goes.
There are currently two "common" optimizations:
![...] .
becomes a negated character classHere are some more:
[a]
-> "a"
: a single-character non-negated character class becomes a single-character literal"a" "bc" "d"
-> "abcd"
: a sequence of literals becomes a single literal[ab] / "m" / [yz]
-> [abmyz]
: a choice of non-negated classes or single-character literals becomes a single class with the union of the others(![abc] .) / (![cde] .)
-> (![c] .)
: a choice of negated classes becomes a single class with the intersection of the others[cdeabzab]
-> [a-ez]
: duplicates in a class (negated or not) are removed; contiguous runs are replaced with rangesSome of these may become available as other optimizations are performed, so a single pass may not fully optimize the grammar.
A parser is used to read pe grammars, so invalid grammars will raise a ParseError
just as a valid grammar with bad input:
>>> pe.match('"a"+', 'b', flags=pe.STRICT|pe.MEMOIZE) # valid grammar, bad input
Traceback (most recent call last):
...
pe._errors.ParseError:
line 0, character 0
b
^
ParseError: `(?=(?P<_1>a+))(?P=_1)`
>>> pe.match('+"a"', 'b', flags=pe.STRICT|pe.MEMOIZE) # invalid grammar
Traceback (most recent call last):
...
pe._errors.ParseError:
line 0, character 0
+"a"
^
\arseError: `[\ \ ]`, `\
`, `\
`, `\#`, `\#`, `[a-zA-Z_]`, `\&`, `!`, `\~`, `[a-zA-Z_]`, `\(`, `'`, `"`, `"`, `\[`, `\.`, `\.`, `\.`
The latter case should be GrammarError.
Mypy removed support for implicit optionals (e.g., x: int = None
), so these need to be made explicit.
Something is going on with common optimizations on character classes:
>>> import pe
>>> print(pe.compile("[ab] / [bc]", flags=pe.COMMON).grammar)
Start <- [abbc] / [bc]
This occurs when the pe.COMMON
optimization occurs on a choice of character classes. The issue becomes more pronounced when nonterminals are involved with the pe.INLINE
optimization.
>>> grammar = """
... Start <- A / B / Foo
... A <- [Aa]
... B <- [Bb]
... Foo <- A / B / [Cc]
... """
>>> print(pe.compile(grammar, flags=pe.INLINE|pe.COMMON).grammar)
Start <- A / B / Foo
A <- [AaBbAaBbBbCcBbCc]
B <- [Bb]
Foo <- A / B / [Cc]
Without captures, a repetition operator has the same behavior with the packrat and machine parsers:
>>> import pe
>>> pe.match('"a"+', 'aaa', parser="packrat")
<Match object; span=(0, 3), match='aaa'>
>>> pe.match('"a"+', 'aaa', parser="machine")
<Match object; span=(0, 3), match='aaa'>
But when captures are used, they differ:
>>> pe.match('~"a"+', 'aaa', parser="packrat")
<Match object; span=(0, 3), match='aaa'>
>>> pe.match('~"a"+', 'aaa', parser="machine")
<Match object; span=(0, 1), match='a'>
Keep things current by upgrading to Cython 3.0. Migration should be straightforward, but there are some backward incompatibilities: https://cython.readthedocs.io/en/latest/src/userguide/migrating_to_cy30.html
Rules in pe are just expressions with actions and they can be embedded in other expressions, but often they are only named expressions in a grammar. The specification, however, anticipates rules that are not named expressions by placing their precedence between a sequence and a choice.
Assume we have an operator ->
that assigns an action to the previous sequence expression, and all to the right of the operator until the end of the line is the action. Then we could have:
Value <- Object -> dict
/ Array -> list
/ String -> json_decode
/ Number -> float
/ TRUE -> constant(True)
/ FALSE -> constant(False)
/ NULL -> constant(None)
For security, the action should not be simply ran through eval
, so it might need to be parsed as well, e.g., as some subset of Python. Above it takes some callable that will be called using the args at that point, but it could instead use the result of an expression:
Value <- Object -> dict(args)
/ Array -> list(args)
/ String -> json_decode(s)
/ Number -> float(s)
/ TRUE -> True
/ FALSE -> False
/ NULL -> None
This could also make it convenient for defining failures as well:
Value <- Object -> dict(args)
...
/ . -> fail(f"unexpected JSON value: {s}")
With the cythonized machine parser, a pattern such as ~("a" / "b")
will emit the matched string for "b" but not for "a". This is because the 'capture' flag is being set on the last instruction of the choice's subprogram and not after the entire choice.
The documentation has some common patterns defined, but it would also be useful to have some defined in code. But until there's a feature to add external definitions (e.g., a parameter to pe.compile()
or some kind of import
syntax within a grammar), these patterns will only be useful for non-parsed grammars using operator primitives.
More inspiration:
Being able to ignore whitespace when parsing without explicitly putting it in the grammar can make the grammars much shorter and easier to write and comprehend (assuming one understands the implicit whitespace exists). There are some challenges:
<
(link))A <- "a" B
B <- "b" "c"
"a"
and before B
, then again before "b"
in the B rule (that is, twice in a row)The pos
attribute on re.Match
objects is not the start of a match but the position in the string where the matching started. With re.match()
this is the same, but it can be different with re.search()
.
Since pe is modeled on re, this difference would be confusing. To use the same API would require Match.start()
and Match.end()
methods. This is fine, but currently there's no way to get the start and end of specific groups. So perhaps just changing pos
to a start
attrribute (as end
currently is) is sufficient. However it's possible that a later version could allow one to get the start and end of specific subexpressions (perhaps with the memo?) so using methods now would reduce the need to break the API later.
Aside from the often-advertised speed gains, ruff also allows for configuration via pyproject.toml
, unlike flake8.
The following snippet
A <- B C?
B <- "B"
C <- [a-z]+
produces the error
pe._errors.GrammarError: multiple repeat operators
upon compiling it.
It seems like pe thinks that the ?
and the +
operator follow each other immediately and ignores the rule indirection. However, I think a grammar like the above seems perfectly normal.
Therefore, this seems like a bug to me.
The regex optimization is sometimes producing overly complex regexes. Consider:
>>> _ = pe.compile(r'"a"*', flags=pe.DEBUG|pe.OPTIMIZE)
## Grammar ##
Start <- "a"*
## Modified Grammar ##
Start <- `(?=(?P<_1>(?:a)*))(?P=_1)`
Here, the single character 'a' does not need a non-capturing group as the following are equivalent:
(?=(?P<_1>(?:a)*))(?P=_1)
(?=(?P<_1>a*))(?P=_1)
I doubt this has much effect on performance, however. Now there's this:
>>> _ = pe.compile(r'"a"* / "b"', flags=pe.DEBUG|pe.OPTIMIZE) # collapses to single regex
## Grammar ##
Start <- "a"* / "b"
## Modified Grammar ##
Start <- `(?=(?P<_2>(?=(?P<_1>(?:a)*))(?P=_1)|b))(?P=_2)`
>>> _ = pe.compile(r'"a"* / ~"b"', flags=pe.DEBUG|pe.OPTIMIZE) # alternative cannot be collapsed
## Grammar ##
Start <- "a"* / ~"b"
## Modified Grammar ##
Start <- `(?=(?P<_2>(?=(?P<_1>(?:a)*))(?P=_1)))(?P=_2)` / ~`b`
There's no issue (aside from the superfluous non-capturing group) in the first one, but in the second where the alternative is blocked from collapsing into a single regex because of the semantic effect of the capture, the second lookahead/backreference is still there. That is, these are equivalent:
(?=(?P<_2>(?=(?P<_1>(?:a)*))(?P=_1)))(?P=_2)
(?=(?P<_1>(?:a)*))(?P=_1)
The example is from https://lists.csail.mit.edu/pipermail/peg/2021-October/000793.html. The parsing behavior is correct here, but the regex could be cleaner, which might help with debugging.
Add an operator that computes the first character of each alternative of a Choice and then does an immediate jump when one is unambiguous.
E.g.,
A <- "abc" / [x-z] / E / .
E <- "e"
Could be a dispatch like:
'a' >> "abc"
'x' >> [x-z]
'y' >> [x-z]
'z' >> [x-z]
'e' >> E
else >> .
Some situations could get tricky, so this should back off to a regular ordered choice if it's not clear:
A <- "abc" / "d" / "aaa"
becomes
'd' >>> "d"
'a' >>> "abc" / "aaa"
This is basically computing "first sets". The Dispatch() operator function (or whatever it's called) would just be an optimization and wouldn't necessarily get PEG notation associated with it.
pe does not work with overt syntax trees, whether concrete (CST) or abstract (AST), but they can be useful for debugging. I think it is straightforward to implement both:
For CSTs, a recursive function similar to the optimizers would traverse the parse graph and make each expression a rule with an action that generates a CST node. This modified grammar is separate from the one normally used for parsing and is used only for debugging.
For ASTs, node-generating rules are inserted only where it makes sense. One method is to replace existing actions with the AST node action, since nodes with actions are generally semantically meaningful. It would also be nice to accept a collection of nonterminal names and those specifically get node actions, which allows the user to create arbitrary ASTs and can also show what the normal semantic actions below those nodes are returning. Semantic actions above AST nodes would either have to be disabled, or the AST nodes need to somehow pass up their values so the actions don't break on unexpected input.
The machine parser, in pure python mode, is currently about 10% slower than the packrat parser on the JSON benchmark. In PyPy, however, the machine parser is roughly twice as fast. The Cython version is also roughly the same speed as the PyPy version. This means the packrat parser is only more performant in pure python without PyPy, so maybe it's not worth the maintenance burden of having both around.
The caveat is that the machine parser does not have memoization yet (see #15), so it will not do well on grammars where that makes a difference. I think the packrat parser can go once memoization is added to the machine parser.
For a serious user of pe in some other package, the startup time to parse and optimize a grammar can become significant, particularly for larger grammars. It would therefore be helpful to be able to save and restore the parser object. This could probably be done with pickle, but it might be better to print out a Python module that constructs the parser directly (which is much faster than parsing and optimizing a new one). This could also make the parser object more inspectable.
When debugging multiline strings the newlines mess up the output. Here's the most minimal non-trivial grammar I could come up with.
import pe
from pe.actions import Capture
multiline_parser = pe.compile(
r'''
File <- Spacing Integer (Spacing Integer)* Spacing EOF
Integer <- "-"? ("0" / [1-9] [0-9]*)
Spacing <- [\t\n\f\r ]*
EOF <- !.
''',
actions={
'Integer': Capture(int),
},
flags=pe.DEBUG,
)
test = """
1
2
"""
multiline_parser.match(test).groups()
This first few lines of this looks like:
## Grammar ##
File <- Spacing Integer (Spacing Integer)* Spacing EOF
Integer <- "-"? ("0" / [1-9] [0-9]*) -> Capture(<class 'int'>)
Spacing <- [\t\n\f\r ]*
EOF <- !.
1
2
| Spacing Integer (Spacing Integer)* Spacing EOF
1
2
| Spacing
1
2
| [\t\n\f\r ]*
1
2
| [\t\n\f\r ]
1
2
| [\t\n\f\r ]
With the change suggested in #30 this would change to:
## Grammar ##
File <- Spacing Integer (Spacing Integer)* Spacing EOF
Integer <- "-"? ("0" / [1-9] [0-9]*) -> Capture(<class 'int'>)
Spacing <- [\t\n\f\r ]*
EOF <- !.
\n1\n2\n | Spacing Integer (Spacing Integer)* Spacing EOF
\n1\n2\n | Spacing
\n1\n2\n | [\t\n\f\r ]*
\n1\n2\n | [\t\n\f\r ]
1\n2\n | [\t\n\f\r ]
1\n2\n | Integer
1\n2\n | "-"? ("0" / [1-9] [0-9]*) -> Capture(<class 'int'>)
1\n2\n | "-"? ("0" / [1-9] [0-9]*)
1\n2\n | "-"?
1\n2\n | "-"
1\n2\n | "0" / [1-9] [0-9]*
1\n2\n | "0"
1\n2\n | [1-9] [0-9]*
1\n2\n | [1-9]
\n2\n | [0-9]*
\n2\n | [0-9]
\n2\n | (Spacing Integer)*
An action that pairs off emitted values before applying a function would be useful. Consider something that constructs a simple dictionary:
Dict <- '{' Members? '}'
Values <- Member (',' Member)*
Member <- Key ':' Value
With some actions like this:
actions = {
'Dict': Pack(dict),
'Member': Pack(tuple)
}
Considering that Pack(tuple)
takes 2 function calls (Pack
, then tuple
) for every member, this takes 2N + 1 function calls to construct the list. Consider instead this action:
actions = {
'Dict': Pair(dict)
}
where Pair(dict)
does something like dict(zip(args[::2], args[1::2]))
, that's just 3 calls (Pair
, zip
, then dict
), plus the 2 __getitem__
calls on args
. Here I'm only counting calls from user code, as presumably CPython would be more efficient in constructing tuples in zip
. Preliminary tests show only slight gains in speed for small lists, but it seems to get better for longer lists. A bigger potential benefit is the reduction in recursive calls. In one example the recursion limit (e.g., depth of nested structure) went from 70 to 100 from this change alone.
The goal, particularly for the machine parser, is to use a faster platform for parsing than standard Python, while still being a Python module. It's not clear what is the best option, in terms of speed and efficiency, startup time, availability/portability, maintainability, licensing, etc. Here are some options:
Sometimes character classes fail unexpectedly. I noticed this when a double-quoted string had a pattern like "a1"
, but it didn't fail on "1a"
.
It turns out this is because the CharacterClass scanner for the machine parsers checks each range with a loop index, but it doesn't reset this index for the next character.
Take a Grammar object and print the PEG notation for it.
This would help with debugging.
The machine parser is currently working well, except that it doesn't do memoization so it is susceptible to bad performance on some kinds of grammars. Memoization needs to be added.
This sounds like it might be a bad idea, but it could be very useful for an expression or an action to have direct access to the memo for storing or retrieving arbitrary information. The purpose is for expressions that accumulate information across different branches. It's something like persistent bound values.
I can imagine some expression like foo@Bar
(syntax undecided) that stores the values of Bar
in the memo at key foo
, but that might not be enough. Perhaps the Action
class can take the memo as another argument, then some Memo
action does the same as the expression. Users could create subclasses of Action
or Memo
to do more than just store the value.
This would need some restrictions, such as only allowing string keys so they don't overwrite actual memo data.
Pe lists the supported Python versions as 3.6--3.9, but the currently active versions are 3.7--3.11. This should be aligned.
So far I have not seen a need for left recursion, particularly because pe emits values in a flat list. But some people seem to really care about it, so this issue is for tracking its implementation.
Using the memo to escape the left-recursive loop looks promising, and this method was chosen for Python's new pegen parser. Guido wrote about it here: https://medium.com/@gvanrossum_83706/left-recursive-peg-grammars-65dab3c580e1
This is a feature where an object is created as parsing begins and it can be used in actions during parsing.
Imagine parsing something like a large TOML file, where the top-level object is not created until the full file has been read in. You won't know if there's a table or key collision until the very end. Instead, if we could create an object that assembles the document as it parses, we could check as we go. E.g.:
from pe.actions import Sidecar
...
class TOMLSidecar:
def __init__(self):
self.doc = {}
def register_table(self, name: str) -> None:
if name in doc: # simplified for convenience
raise TOMLDecodeError(f"table already defined: {name}")
self.doc[name] = {}
...
TOML_GRAMMAR = """
...
Table <- "[" WS Key WS "]"
...
"""
toml_parser = pe.compile(
TOML_GRAMMAR,
actions={
"Table": Sidecar.call("register_table"),
...
},
sidecar_factory=TOMLSidecar,
)
The Sidecar.call("register_table")
does something like this at parse time:
getattr(self.sidecar, "register_table")(*args, **kwargs)
This would probably subsume the use case of #10.
Some notes:
Action
object even though it is used in actions={...}
, so maybe it doesn't belong in pe.actions
Values are now computed as soon as an expression succeeds, even if that value will later be discarded. Since a parse is inherently a tree, a stack should be able to compute the full derivation. Thus we could have a value stack with (start, end, function) triples. This means that a new tuple is created for each expression that emits values, but that is already the case as the emitted values args
is an immutable tuple and not a list. The difference then is that the function is not computed until it is needed. Since there are now callables for Capture and Bind, all features can be represented this way (although in some tests, Capture and Bind were a bit slower than the ~
and :
operators).
The default behavior of mypy has changed.
def foo(x: str = None) -> str:
...
should now be
from typing import Optional
def foo(x: Optional[str] = None) -> str:
...
It would be very handy if parsers generated by pe
were able to produce proper error messages when they fail to parse a given input. This includes pointing out the longest possible sequence of the input that the parser was able to match and then telling the user what the parser would have expected in order to continue matching from there.
E.g. when using a grammar such as
match < "Hello" "world"
matching an input of Hello
should say something like At line 1, col 5: Expected "world"
or if there were multiple alternatives:
match < "Hello" ("world" / "universe")
At line 1, col 5: Expected one of: "world", "universe"
For character classes one can simply print their definition as the expected value and for rule references print the name of the rule.
See e.g. the error messages that ANTLR produces.
The specification is very detailed about how the -
character is interpreted inside a character class (link), but it isn't entirely accurate or complete.
It says that a -
at the end of a character class is a literal character, but not if the class has two characters (e.g., [+-]
). In this case, it is in fact a range to the ]
character, which is treated literally in this context. This causes some problems in that the character class is not terminated when the user expects it to be.
The trouble is that this is how the PEG grammar is defined, so changing this would be an non-monotonic extension (current extensions only add to the syntax, not take away). This might be a good case to use a warn
action. If the user wants to silence the warning, they can escape the -
or the ]
, depending on what they intended.
A simple but useful extension is for bounded repetitions. The main issue is the syntax to use for it.
One obvious choice is {count}
or {start,end}
as in regex. A counterpoint to this is that {...}
is used for semantic actions by some parsing frameworks, but I don't anticipate Pe using those.
The behavior would follow regular expressions. Example:
>>> pe.match(".{1,3}", "abcdefg")
<Match object; span=(0, 3), match='abc'>
>>> pe.match(".{2}", "abcdefg")
<Match object; span=(0, 2), match='ab'>
>>> pe.match(".{2,}", "abcdefg")
<Match object; span=(0, 7), match='abcdefg'>
>>> pe.match(".{,2}", "abcdefg")
<Match object; span=(0, 2), match='ab'>
In the {start,end}
case, either operand can be omitted to indicate an open range, from 0 to infinity. Thus, .{,}
is equivalent to .*
, and .{1,}
is equivalent to .+
.
In publishing the last release (link to run), the following warning came up:
Build source distribution
Thepython-version
input is not set. The version of Python currently inPATH
will be used.
It seems the workflow needs to be updated.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.