Comments (3)
keep the original grammar as well and if there is a parse error re-parse with the unoptimized grammar in order to provide more helpful error messages.
Pe already stores the original and optimized grammars in the parser objects:
Lines 10 to 13 in 4167657
The intended reason for storing both is precisely what you describe, however I have not yet implemented the reparsing feature. One challenge is that I don't see an easy way to hand over parser state at a point to the unoptimized grammar. I can reparse the whole input, but that can take a while, especially with the unoptimized grammar.
from pe.
I agree, and I think better error messages are the thing most needed by this project.
If you use the packrat parser, you will get close to what you are asking with the pe.STRICT
flag:
>>> pe.match('"a"+ "b"+', "aacc", flags=pe.STRICT)
Traceback (most recent call last):
[...]
pe._errors.ParseError: ParseError: failed to parse; use memoization for more details
Also use the pe.MEMOIZE
flag to get more precise error messages:
>>> pe.match('"a"+ "b"+', "aacc", flags=pe.STRICT|pe.MEMOIZE)
Traceback (most recent call last):
[...]
pe._errors.ParseError:
line 0, character 0
aacc
^
ParseError: `(?=(?P<_1>a+))(?P=_1)(?=(?P<_2>b+))(?P=_2)`
The next issue is that the patterns are optimized, which means:
- Multiple patterns/rules get merged into a single terminal regex, making pe unable to see parse failures internal to the regex (thus the
^
points to the firsta
and not the firstc
) - The optimized regex is hard to follow and doesn't resemble the original grammar
To overcome this, you will need to first compile the grammar and then match, because the flags
option of pe.match()
is only for the matching and not the compiling (the compilation and matching flags are currently conflated... I may separate them in a future version):
>>> p = pe.compile('"a"+ "b"+', parser="packrat", flags=pe.NONE)
>>> p.match("aacc", flags=pe.STRICT|pe.MEMOIZE)
Traceback (most recent call last):
[...]
pe._errors.ParseError:
line 0, character 2
aacc
^
ParseError: `a`, `b`
Another strategy is to use the pe.DEBUG
flag at compilation time, which is also currently only available for the packrat parser:
>>> p = pe.compile('"a"+ "b"+', parser="packrat", flags=pe.DEBUG)
## Grammar ##
Start <- "a"+ "b"+
>>> p.match("aacc", flags=pe.STRICT|pe.MEMOIZE)
aacc | "a"+ "b"+
aacc | "a"+
aacc | "a"
acc | "a"
cc | "a"
cc | "b"+
cc | "b"
Traceback (most recent call last):
[...]
pe._errors.ParseError:
line 0, character 2
aacc
^
ParseError: `a`, `b`
In a terminal that supports ANSI colors, you'll see the terminals that failed in red and those that succeeded in green.
Getting good error messages from a general-purpose recursive descent parser is a significant challenge in itself, but doing that when you have an optimized grammar is even more difficult. I welcome any help in this area.
Since the machine parser does not yet do memoization or debug rules, a strategy for grammar development is to test things out with the packrat parser, then switch to the machine parser for speed once it's working.
from pe.
Yeah, producing good error messages is incredibly difficult.
For the issue with having an optimized grammar, one potential workaround could be to keep the original grammar as well and if there is a parse error re-parse with the unoptimized grammar in order to provide more helpful error messages. That should probably be optional though due to the additional overhead of parsing the same input twice.
from pe.
Related Issues (20)
- Remove packrat parser
- Captured choices not working with Cython machine parser
- Character classes in machine parser fail at odd times
- Inefficiencies in regex optimization
- More "common" optimizations HOT 1
- Separate error type for failing to parse a grammar
- Change implicit optional types to explicit HOT 1
- Update Python versions HOT 1
- Make implicit Optionals into explicit for current Mypy HOT 1
- Add common patterns in code
- Bug: Newlines make the debug output difficult to read. HOT 1
- Add python version to publishing action
- Update to Cython 3.0
- Lint with ruff instead of flake8
- Multiple repeat operators HOT 4
- Difference with machine/packrat parsers and captures HOT 1
- Common optimization misbehaving on character classes HOT 1
- "Sidecar" objects for accumulative parsing
- Bounded repetitions
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pe.