Comments (17)
There are definitely a few space (and run-time) issues with explicit ambiguity, that I need to work on.
In the mean time, is it really necessary? I didn't know Fortran is an ambiguous language.
from lark.
Fortran is ambiguous. Without semantic analysis there are many ambiguities. Function calls look like accessing array elements, defining one-line internal functions look like array assignments, keywords are not protected (they can be used as variable names), etc.
The advantage of Earley parsers is that you do not have to waste time analyzing and perfecting the grammar but can resolve ambiguities during semantic analysis. Also there are many incompatible extensions of older compilers which makes managing (and fixing) several similar but subtly different grammars rather painful.
In any case, just collecting the minimal amount of information needed to construct parse forests during parsing may be an idea.
I haven't checked your code in detail but if you have a rule of the form A : B_1 B_2 ... B_n where each of the B_i allow two parse trees then building the parse forest immediately means building 2**n trees.
The size of the parse forest in the examples above seems to indicate that something like that is happening.
One way around this is to keep pointers from A to each of the subtrees B_i and insert a step resolving at least some of the ambiguities before building complete trees.
from lark.
I have a slightly different version of Earley, that might solve your issue. I didn't test it thoroughly, but it's worth giving it a try.
Checkout the branch earley_fix2
and let me know if it improved anything.
from lark.
Thank you.
The simple test cases all work very fast now. 0.03~0.14s
However, for one of my larger test files (which can be parsed with another Earley parser) I get the exception below. You might be pruning too severely somewhere.
Traceback (most recent call last):
File "", line 1, in
File " ..lark_parser-0.4.1-py3.6.egg/lark/lark.py", line 188, in parse
return self.parser.parse(text)
File "..lark_parser-0.4.1-py3.6.egg/lark/parser_frontends.py", line 142, in parse
return self.parser.parse(text)
File "..lark_parser-0.4.1-py3.6.egg/lark/parsers/xearley.py", line 147, in parse
raise ParseError('Incomplete parse: Could not find a solution to input')
lark.common.ParseError: Incomplete parse: Could not find a solution to input
from lark.
That's very likely. This suspicion is the main reason I haven't merged this fix to main yet. If you can provide more details, I'll appreciate it. It might help me narrow down on the issue.
from lark.
I'll give it a try. It may take a few days.
from lark.
Great, thanks.
from lark.
I have mixed news.
- the incomplete parse was a mistake in the grammar conversion script a missing space (causing problems for very long token name)
- earley_fix2 version 0155d3d works ok.
- Simple programs are much faster than my parser (<0.1s vs 1s)
- slightly more complicate programs about the same 2s vs 2s
- but larger programs either take much longer 5s vs 2.5s, 16s vs 4s, 100s vs 5s or run out of memory (16GB) after a while.
- now I tried to set up version ec7da3b (0155... does not exist any longer it seems) on a 64GB desktop. I'll let it run over night and see. EDIT: ran out of memory after 2 hours
EDIT2:
I ran the parser on some 600 fortran files with about 200 include files altogether about 12k lines, using a 2min timeout
- there were about 50 timeouts
- one file which contained 7 consecutive function definitions failed but once I split it into 2 files with 3 and 4 functions respectively both files worked fine. Two relevant rules are given below (executable_program is the start symbol). It seems that parsing 3 or 4 program units works fine but 7 program units fail. Since memory usage was below 1% the whole time according to top it is unlikely that memory was the problem.
Two grammar rules:
executable_program : comment* program_unit program_unit*
program_unit : block_data
| function_subprogram
| main_program
| module
| subroutine_subprogram
from lark.
Simple programs are much faster than my parser
What do you mean? What are you comparing?
It seems that parsing 3 or 4 program units works fine but 7 program units fail
If you can provide an example (ideally, a minimal grammar + input) that behaves badly, I'll be more than happy to look into it.
from lark.
Sorry I should have been clearer. I have a working python Earley parser which is part of a larger in-house system (not for distribution) which is what I use for comparison. I try to build a similar system using open source tools which then can be open sourced. As far as the parser goes there are two conditions: all context free grammars + proper handling of ambiguous parses. So far lark is my main choice.
Unfortunately the example which triggers the problem is covered by a NDA.
I'll see what I can do in terms of coming up with a simple example which triggers the problem but it may take a while. So far all simple examples work fine.
from lark.
The problem seems size related.
I have a string to be parsed of length 131070 (a stream of tokens generated from Fortran code). Trying to parse generates the 'incomplete parse' message. If I simplify the original Fortran code just a tiny bit, for example by removing a single argument from a function call or a function definition, or by simplifying a condition in a if statement, or by removing an arbitrary assignment somewhere and regenerate the token stream (results in a string of length ~131060) then parsing works. I tried about 20 different simplifications and they all worked.
So it seems there is a hidden size limit somewhere. My first guess was that there is a limit for match objects in python and 131072 = 2**17 looks suspicious like a 'large constant' people would use but I checked the python source code of the re module and did not find anything suspicious (which of course does not mean much). I suspected the re module because of the 99 group limit which I have run in before.
from lark.
That's a very strange error. I can't think of anything in the implementation that would care about the length of the input string (except when it comes to run-time and memory, of course).
You should try to run it in pypy
if you can. They have their own regexp implementation, and it seems better than CPython's.
from lark.
I tried replacing re by regex as well as using pypy but neither changes anything.
from lark.
I close the issue for now. I played around a bit more without getting any new insights. The only next step I can think of is to either study the python regex library properly or to reimplement it and I don't really have the time/energy to do so at the moment.
from lark.
Well, another possible step is to provide me with an example that demonstrates this behavior, so that I can try to solve it.
from lark.
As I said before the only examples I have which trigger the problem are covered by NDAs. I have not been able to construct smaller examples. Also all examples I tried and which I could have shared (with some extra effort) run out of memory before the problem shows up.
So all I can say is that some large examples (token strings generated from Fortran code) fail to parse although all types of very minor reductions (removing an assignment somewhere, reducing the number of variables in a function call, removing a declaration) parse correctly.
Sorry for not being able to provide you with such an example.
from lark.
No worries, I'll get to it one way or another. When I make meaningful improvements to the algorithm, I hope I can ask you to check again.
from lark.
Related Issues (20)
- Data structure for getting possible terminal sequences? HOT 2
- AssertionError when using templates HOT 4
- lark.exceptions.UnexpectedCharacters: No terminal matches ',' in the current parser context, at line 1 col 8 HOT 1
- Ability to inline branches not documented in the grammar reference HOT 3
- Parser error messages are not deterministic HOT 2
- Docs: Example link to blog.erezsh.com dead HOT 1
- parsing.py example - lark.exceptions.GrammarError HOT 1
- Please remove the duplicate PYPI record HOT 1
- Transformer raises AttributeError when a tree is only a token HOT 1
- Import lark grammar written in one python project into another HOT 2
- Need help figuring out why some characters are captured in __ANON_ HOT 2
- Need help with terminals not showing up as expected
- Transforming tree after standalone parser results in different AST HOT 4
- Making a comment by using regular expression HOT 5
- earley very, very slow HOT 24
- Cant read `meta` from Tree or Token? HOT 5
- How to define lark grammar for best parsing performance HOT 8
- Unable to parse Arabic text HOT 3
- Incorrect start_pos / end_pos in the tree HOT 8
- Add `outlines` in the list of projects using Lark HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lark.