goodmami / pe Goto Github PK

View Code? Open in Web Editor NEW

39.0 6.0 3.0 438 KB

Fastest general-purpose parsing library for Python with a familiar API

License: MIT License

Python 85.88% Cython 14.12%

parsing-expressions peg parsing-expression-grammar parsing-library python parsing parser-generator parser

pe's People

Contributors

Stargazers

Watchers

Forkers

icodein tomhodson cinarteker

pe's Issues

Escapes in grammars: UTF sequences or single characters?

The specification's section about escape sequences says that the \xNN, \uNNNN, and \UNNNNNNNN escapes are UTF-8, UTF-16, and UTF-32, respectively, but that may not be accurate. For instance, in Python literals, '\xNN' is just a character whose value is given by two hexadecimal digits, and '\xNN\xNN' is thus two characters and not a sequence of two bytes.

Going with each escape being a single character would make parsing grammars easier, at least. Tests are needed, whichever way this goes.

More "common" optimizations

There are currently two "common" optimizations:

![...] . becomes a negated character class
A sequence of one item becomes the item only (helps when the previous optimization results in the negated character class being the only thing left in its sequence)

Here are some more:

[a] -> "a": a single-character non-negated character class becomes a single-character literal
"a" "bc" "d" -> "abcd": a sequence of literals becomes a single literal
[ab] / "m" / [yz] -> [abmyz]: a choice of non-negated classes or single-character literals becomes a single class with the union of the others
(![abc] .) / (![cde] .) -> (![c] .): a choice of negated classes becomes a single class with the intersection of the others
[cdeabzab] -> [a-ez]: duplicates in a class (negated or not) are removed; contiguous runs are replaced with ranges

Some of these may become available as other optimizations are performed, so a single pass may not fully optimize the grammar.

Separate error type for failing to parse a grammar

A parser is used to read pe grammars, so invalid grammars will raise a ParseError just as a valid grammar with bad input:

>>> pe.match('"a"+', 'b', flags=pe.STRICT|pe.MEMOIZE)  # valid grammar, bad input
Traceback (most recent call last):
  ...
pe._errors.ParseError: 
  line 0, character 0
    b
    ^
ParseError: `(?=(?P<_1>a+))(?P=_1)`
>>> pe.match('+"a"', 'b', flags=pe.STRICT|pe.MEMOIZE)  # invalid grammar
Traceback (most recent call last):
  ...
pe._errors.ParseError: 
  line 0, character 0
    +"a"
    ^
\arseError: `[\ \	]`, `\
`, `\
`, `\#`, `\#`, `[a-zA-Z_]`, `\&`, `!`, `\~`, `[a-zA-Z_]`, `\(`, `'`, `"`, `"`, `\[`, `\.`, `\.`, `\.`

The latter case should be GrammarError.

Make implicit Optionals into explicit for current Mypy

Mypy removed support for implicit optionals (e.g., x: int = None), so these need to be made explicit.

Common optimization misbehaving on character classes

Something is going on with common optimizations on character classes:

>>> import pe
>>> print(pe.compile("[ab] / [bc]", flags=pe.COMMON).grammar)
Start <- [abbc] / [bc]

This occurs when the pe.COMMON optimization occurs on a choice of character classes. The issue becomes more pronounced when nonterminals are involved with the pe.INLINE optimization.

>>> grammar = """
... Start <- A / B / Foo
... A <- [Aa]
... B <- [Bb]
... Foo <- A / B / [Cc]
... """
>>> print(pe.compile(grammar, flags=pe.INLINE|pe.COMMON).grammar)
Start <- A / B / Foo
A     <- [AaBbAaBbBbCcBbCc]
B     <- [Bb]
Foo   <- A / B / [Cc]

Difference with machine/packrat parsers and captures

Without captures, a repetition operator has the same behavior with the packrat and machine parsers:

>>> import pe
>>> pe.match('"a"+', 'aaa', parser="packrat")
<Match object; span=(0, 3), match='aaa'>
>>> pe.match('"a"+', 'aaa', parser="machine")
<Match object; span=(0, 3), match='aaa'>

But when captures are used, they differ:

>>> pe.match('~"a"+', 'aaa', parser="packrat")
<Match object; span=(0, 3), match='aaa'>
>>> pe.match('~"a"+', 'aaa', parser="machine")
<Match object; span=(0, 1), match='a'>

Update to Cython 3.0

Keep things current by upgrading to Cython 3.0. Migration should be straightforward, but there are some backward incompatibilities: https://cython.readthedocs.io/en/latest/src/userguide/migrating_to_cy30.html

Operators for rule actions

Rules in pe are just expressions with actions and they can be embedded in other expressions, but often they are only named expressions in a grammar. The specification, however, anticipates rules that are not named expressions by placing their precedence between a sequence and a choice.

Assume we have an operator -> that assigns an action to the previous sequence expression, and all to the right of the operator until the end of the line is the action. Then we could have:

    Value    <- Object        -> dict
              / Array         -> list
              / String        -> json_decode
              / Number        -> float
              / TRUE          -> constant(True)
              / FALSE         -> constant(False)
              / NULL          -> constant(None)

For security, the action should not be simply ran through eval, so it might need to be parsed as well, e.g., as some subset of Python. Above it takes some callable that will be called using the args at that point, but it could instead use the result of an expression:

    Value    <- Object        -> dict(args)
              / Array         -> list(args)
              / String        -> json_decode(s)
              / Number        -> float(s)
              / TRUE          -> True
              / FALSE         -> False
              / NULL          -> None

This could also make it convenient for defining failures as well:

    Value    <- Object        -> dict(args)
              ...
              / .             -> fail(f"unexpected JSON value: {s}")

Captured choices not working with Cython machine parser

With the cythonized machine parser, a pattern such as ~("a" / "b") will emit the matched string for "b" but not for "a". This is because the 'capture' flag is being set on the last instruction of the choice's subprogram and not after the entire choice.

Add common patterns in code

The documentation has some common patterns defined, but it would also be useful to have some defined in code. But until there's a feature to add external definitions (e.g., a parameter to pe.compile() or some kind of import syntax within a grammar), these patterns will only be useful for non-parsed grammars using operator primitives.

More inspiration:

Lark has some basics, like digits and whitespace (link)
PEG.js includes unicode categories (link)

Auto-ignore (e.g., whitespace)

Being able to ignore whitespace when parsing without explicitly putting it in the grammar can make the grammars much shorter and easier to write and comprehend (assuming one understands the implicit whitespace exists). There are some challenges:

There needs to be a way to enable/disable the implicit whitespace, e.g., with some rule operator (pegged uses < (link))
There needs to be a way to customize what is ignored
There should ideally be a way for the parser to not look for whitespace too often. E.g., if it occurs between every item in a sequence and we have:
```
A <- "a" B
B <- "b" "c"
```
...then a naive system would look for and discard whitespace after "a" and before B, then again before "b" in the B rule (that is, twice in a row)

Rename pos to start in Match objects

The pos attribute on re.Match objects is not the start of a match but the position in the string where the matching started. With re.match() this is the same, but it can be different with re.search().

Since pe is modeled on re, this difference would be confusing. To use the same API would require Match.start() and Match.end() methods. This is fine, but currently there's no way to get the start and end of specific groups. So perhaps just changing pos to a start attrribute (as end currently is) is sufficient. However it's possible that a later version could allow one to get the start and end of specific subexpressions (perhaps with the memo?) so using methods now would reduce the need to break the API later.

Lint with ruff instead of flake8

Aside from the often-advertised speed gains, ruff also allows for configuration via pyproject.toml, unlike flake8.

https://docs.astral.sh/ruff/

Multiple repeat operators

The following snippet

A <- B C?
B <- "B"
C <- [a-z]+

produces the error

pe._errors.GrammarError: multiple repeat operators

upon compiling it.

It seems like pe thinks that the ? and the + operator follow each other immediately and ignores the rule indirection. However, I think a grammar like the above seems perfectly normal.
Therefore, this seems like a bug to me.

Inefficiencies in regex optimization

The regex optimization is sometimes producing overly complex regexes. Consider:

>>> _ = pe.compile(r'"a"*', flags=pe.DEBUG|pe.OPTIMIZE)
## Grammar ##
Start <- "a"*
## Modified Grammar ##
Start <- `(?=(?P<_1>(?:a)*))(?P=_1)`

Here, the single character 'a' does not need a non-capturing group as the following are equivalent:

(?=(?P<_1>(?:a)*))(?P=_1)
(?=(?P<_1>a*))(?P=_1)

I doubt this has much effect on performance, however. Now there's this:

>>> _ = pe.compile(r'"a"* / "b"', flags=pe.DEBUG|pe.OPTIMIZE)  # collapses to single regex
## Grammar ##
Start <- "a"* / "b"
## Modified Grammar ##
Start <- `(?=(?P<_2>(?=(?P<_1>(?:a)*))(?P=_1)|b))(?P=_2)`
>>> _ = pe.compile(r'"a"* / ~"b"', flags=pe.DEBUG|pe.OPTIMIZE)  # alternative cannot be collapsed
## Grammar ##
Start <- "a"* / ~"b"
## Modified Grammar ##
Start <- `(?=(?P<_2>(?=(?P<_1>(?:a)*))(?P=_1)))(?P=_2)` / ~`b`

There's no issue (aside from the superfluous non-capturing group) in the first one, but in the second where the alternative is blocked from collapsing into a single regex because of the semantic effect of the capture, the second lookahead/backreference is still there. That is, these are equivalent:

(?=(?P<_2>(?=(?P<_1>(?:a)*))(?P=_1)))(?P=_2)
(?=(?P<_1>(?:a)*))(?P=_1)

The example is from https://lists.csail.mit.edu/pipermail/peg/2021-October/000793.html. The parsing behavior is correct here, but the regex could be cleaner, which might help with debugging.

Dispatch operator

Add an operator that computes the first character of each alternative of a Choice and then does an immediate jump when one is unambiguous.

E.g.,

A <- "abc" / [x-z] / E / .
E <- "e"

Could be a dispatch like:

'a' >> "abc"
'x' >> [x-z]
'y' >> [x-z]
'z' >> [x-z]
'e' >> E
else >> .

Some situations could get tricky, so this should back off to a regular ordered choice if it's not clear:

A <- "abc" / "d" / "aaa"

becomes

'd' >>> "d"
'a' >>> "abc" / "aaa"

This is basically computing "first sets". The Dispatch() operator function (or whatever it's called) would just be an optimization and wouldn't necessarily get PEG notation associated with it.

Syntax trees

pe does not work with overt syntax trees, whether concrete (CST) or abstract (AST), but they can be useful for debugging. I think it is straightforward to implement both:

For CSTs, a recursive function similar to the optimizers would traverse the parse graph and make each expression a rule with an action that generates a CST node. This modified grammar is separate from the one normally used for parsing and is used only for debugging.

For ASTs, node-generating rules are inserted only where it makes sense. One method is to replace existing actions with the AST node action, since nodes with actions are generally semantically meaningful. It would also be nice to accept a collection of nonterminal names and those specifically get node actions, which allows the user to create arbitrary ASTs and can also show what the normal semantic actions below those nodes are returning. Semantic actions above AST nodes would either have to be disabled, or the AST nodes need to somehow pass up their values so the actions don't break on unexpected input.

Remove packrat parser

The machine parser, in pure python mode, is currently about 10% slower than the packrat parser on the JSON benchmark. In PyPy, however, the machine parser is roughly twice as fast. The Cython version is also roughly the same speed as the PyPy version. This means the packrat parser is only more performant in pure python without PyPy, so maybe it's not worth the maintenance burden of having both around.

The caveat is that the machine parser does not have memoization yet (see #15), so it will not do well on grammars where that makes a difference. I think the packrat parser can go once memoization is added to the machine parser.

Dump and load parser objects

For a serious user of pe in some other package, the startup time to parse and optimize a grammar can become significant, particularly for larger grammars. It would therefore be helpful to be able to save and restore the parser object. This could probably be done with pickle, but it might be better to print out a Python module that constructs the parser directly (which is much faster than parsing and optimizing a new one). This could also make the parser object more inspectable.

Bug: Newlines make the debug output difficult to read.

When debugging multiline strings the newlines mess up the output. Here's the most minimal non-trivial grammar I could come up with.

import pe
from pe.actions import Capture

multiline_parser = pe.compile(
    r'''
    File    <- Spacing Integer (Spacing Integer)* Spacing EOF
    Integer  <- "-"? ("0" / [1-9] [0-9]*)
    Spacing  <- [\t\n\f\r ]*
    EOF      <- !.
    ''',
    actions={
        'Integer': Capture(int),
    },
    flags=pe.DEBUG,
)


test = """
1
2
"""

multiline_parser.match(test).groups()

This first few lines of this looks like:

## Grammar ##
File    <- Spacing Integer (Spacing Integer)* Spacing EOF
Integer <- "-"? ("0" / [1-9] [0-9]*)  -> Capture(<class 'int'>)
Spacing <- [\t\n\f\r ]*
EOF     <- !.

1
2
        |                         Spacing Integer (Spacing Integer)* Spacing EOF

1
2
        |                           Spacing

1
2
        |                             [\t\n\f\r ]*

1
2
        |                               [\t\n\f\r ]
1
2
         |                               [\t\n\f\r ]

With the change suggested in #30 this would change to:

## Grammar ##
File    <- Spacing Integer (Spacing Integer)* Spacing EOF
Integer <- "-"? ("0" / [1-9] [0-9]*)  -> Capture(<class 'int'>)
Spacing <- [\t\n\f\r ]*
EOF     <- !.
\n1\n2\n     |                         Spacing Integer (Spacing Integer)* Spacing EOF
\n1\n2\n     |                           Spacing
\n1\n2\n     |                             [\t\n\f\r ]*
\n1\n2\n     |                               [\t\n\f\r ]
1\n2\n       |                               [\t\n\f\r ]
1\n2\n       |                           Integer
1\n2\n       |                             "-"? ("0" / [1-9] [0-9]*)  -> Capture(<class 'int'>)
1\n2\n       |                               "-"? ("0" / [1-9] [0-9]*)
1\n2\n       |                                 "-"?
1\n2\n       |                                   "-"
1\n2\n       |                                 "0" / [1-9] [0-9]*
1\n2\n       |                                   "0"
1\n2\n       |                                   [1-9] [0-9]*
1\n2\n       |                                     [1-9]
\n2\n        |                                     [0-9]*
\n2\n        |                                       [0-9]
\n2\n        |                           (Spacing Integer)*

Pair action

An action that pairs off emitted values before applying a function would be useful. Consider something that constructs a simple dictionary:

Dict   <- '{' Members? '}'
Values <- Member (',' Member)*
Member <- Key ':' Value

With some actions like this:

actions = {
    'Dict': Pack(dict),
    'Member': Pack(tuple)
}

Considering that Pack(tuple) takes 2 function calls (Pack, then tuple) for every member, this takes 2N + 1 function calls to construct the list. Consider instead this action:

actions = {
    'Dict': Pair(dict)
}

where Pair(dict) does something like dict(zip(args[::2], args[1::2])), that's just 3 calls (Pair, zip, then dict), plus the 2 __getitem__ calls on args. Here I'm only counting calls from user code, as presumably CPython would be more efficient in constructing tuples in zip. Preliminary tests show only slight gains in speed for small lists, but it seems to get better for longer lists. A bigger potential benefit is the reduction in recursive calls. In one example the recursion limit (e.g., depth of nested structure) went from 70 to 100 from this change alone.

How to go beyond pure Python?

The goal, particularly for the machine parser, is to use a faster platform for parsing than standard Python, while still being a Python module. It's not clear what is the best option, in terms of speed and efficiency, startup time, availability/portability, maintainability, licensing, etc. Here are some options:

PyPy -- the nice thing here is that nothing is really required besides using PyPy
Cython -- works well with CPython, can get very fast
CPython extension modules -- even more control, but more expertise needed
llvmlite -- maybe good for a state machine, but it's not clear how to work with string data instead of numbers

Character classes in machine parser fail at odd times

Sometimes character classes fail unexpectedly. I noticed this when a double-quoted string had a pattern like "a1", but it didn't fail on "1a".

It turns out this is because the CharacterClass scanner for the machine parsers checks each range with a loop index, but it doesn't reset this index for the next character.

Grammar formatting

Take a Grammar object and print the PEG notation for it.

This would help with debugging.

Add memoization to machine parser

The machine parser is currently working well, except that it doesn't do memoization so it is susceptible to bad performance on some kinds of grammars. Memoization needs to be added.

Arbitrary memo access

This sounds like it might be a bad idea, but it could be very useful for an expression or an action to have direct access to the memo for storing or retrieving arbitrary information. The purpose is for expressions that accumulate information across different branches. It's something like persistent bound values.

I can imagine some expression like foo@Bar (syntax undecided) that stores the values of Bar in the memo at key foo, but that might not be enough. Perhaps the Action class can take the memo as another argument, then some Memo action does the same as the expression. Users could create subclasses of Action or Memo to do more than just store the value.

This would need some restrictions, such as only allowing string keys so they don't overwrite actual memo data.

Update Python versions

Pe lists the supported Python versions as 3.6--3.9, but the currently active versions are 3.7--3.11. This should be aligned.

Left recursion

So far I have not seen a need for left recursion, particularly because pe emits values in a flat list. But some people seem to really care about it, so this issue is for tracking its implementation.

Using the memo to escape the left-recursive loop looks promising, and this method was chosen for Python's new pegen parser. Guido wrote about it here: https://medium.com/@gvanrossum_83706/left-recursive-peg-grammars-65dab3c580e1

"Sidecar" objects for accumulative parsing

This is a feature where an object is created as parsing begins and it can be used in actions during parsing.

Imagine parsing something like a large TOML file, where the top-level object is not created until the full file has been read in. You won't know if there's a table or key collision until the very end. Instead, if we could create an object that assembles the document as it parses, we could check as we go. E.g.:

from pe.actions import Sidecar
...

class TOMLSidecar:
    def __init__(self):
        self.doc = {}
    def register_table(self, name: str) -> None:
        if name in doc:  # simplified for convenience
            raise TOMLDecodeError(f"table already defined: {name}")
        self.doc[name] = {}
    ...

TOML_GRAMMAR = """
...
Table <- "[" WS Key WS "]"
...
"""

toml_parser = pe.compile(
    TOML_GRAMMAR,
    actions={
        "Table": Sidecar.call("register_table"),
        ...
    },
    sidecar_factory=TOMLSidecar,
)

The Sidecar.call("register_table") does something like this at parse time:

getattr(self.sidecar, "register_table")(*args, **kwargs)

This would probably subsume the use case of #10.

Some notes:

The mutability of the sidecar object means that it might be changed by parsing paths that ultimately fail.
It's not an Action object even though it is used in actions={...}, so maybe it doesn't belong in pe.actions
The "sidecar" name is not certain. It's not the same as the Sidecar Pattern for applications. Alternatives:
- "Parse-time object" is a bit long
- "Proxy object"?

lazy value stack

Values are now computed as soon as an expression succeeds, even if that value will later be discarded. Since a parse is inherently a tree, a stack should be able to compute the full derivation. Thus we could have a value stack with (start, end, function) triples. This means that a new tuple is created for each expression that emits values, but that is already the case as the emitted values args is an immutable tuple and not a list. The difference then is that the function is not computed until it is needed. Since there are now callables for Capture and Bind, all features can be represented this way (although in some tests, Capture and Bind were a bit slower than the ~ and : operators).

Change implicit optional types to explicit

The default behavior of mypy has changed.

def foo(x: str = None) -> str:
    ...

should now be

from typing import Optional

def foo(x: Optional[str] = None) -> str:
    ...

(Helpful) Error messages

It would be very handy if parsers generated by pe were able to produce proper error messages when they fail to parse a given input. This includes pointing out the longest possible sequence of the input that the parser was able to match and then telling the user what the parser would have expected in order to continue matching from there.

E.g. when using a grammar such as

match < "Hello" "world"

matching an input of Hello should say something like At line 1, col 5: Expected "world" or if there were multiple alternatives:

match < "Hello" ("world" / "universe")

At line 1, col 5: Expected one of: "world", "universe"

For character classes one can simply print their definition as the expected value and for rule references print the name of the rule.

See e.g. the error messages that ANTLR produces.

Use of [, ], and - in character classes needs refinement

The specification is very detailed about how the - character is interpreted inside a character class (link), but it isn't entirely accurate or complete.

It says that a - at the end of a character class is a literal character, but not if the class has two characters (e.g., [+-]). In this case, it is in fact a range to the ] character, which is treated literally in this context. This causes some problems in that the character class is not terminated when the user expects it to be.

The trouble is that this is how the PEG grammar is defined, so changing this would be an non-monotonic extension (current extensions only add to the syntax, not take away). This might be a good case to use a warn action. If the user wants to silence the warning, they can escape the - or the ], depending on what they intended.

Bounded repetitions

A simple but useful extension is for bounded repetitions. The main issue is the syntax to use for it.
One obvious choice is {count} or {start,end} as in regex. A counterpoint to this is that {...} is used for semantic actions by some parsing frameworks, but I don't anticipate Pe using those.

The behavior would follow regular expressions. Example:

>>> pe.match(".{1,3}", "abcdefg")
<Match object; span=(0, 3), match='abc'>
>>> pe.match(".{2}", "abcdefg")
<Match object; span=(0, 2), match='ab'>
>>> pe.match(".{2,}", "abcdefg")
<Match object; span=(0, 7), match='abcdefg'>
>>> pe.match(".{,2}", "abcdefg")
<Match object; span=(0, 2), match='ab'>

In the {start,end} case, either operand can be omitted to indicate an open range, from 0 to infinity. Thus, .{,} is equivalent to .*, and .{1,} is equivalent to .+.

Add python version to publishing action

In publishing the last release (link to run), the following warning came up:

Build source distribution
The python-version input is not set. The version of Python currently in PATH will be used.

It seems the workflow needs to be updated.