I am trying to parse Fortran95 programs with lark. The grammar is medium size but most

I have mixed news. the incomplete parse was a mistake in the g

Simple programs are much faster than my parser <p dir="

parsing time seems to increase dramatically with depth of parse tree about lark HOT 17 CLOSED

lark-parser commented on May 16, 2024

parsing time seems to increase dramatically with depth of parse tree

from lark.

Comments (17)

erezsh commented on May 16, 2024

There are definitely a few space (and run-time) issues with explicit ambiguity, that I need to work on.

In the mean time, is it really necessary? I didn't know Fortran is an ambiguous language.

from lark.

till314 commented on May 16, 2024

Fortran is ambiguous. Without semantic analysis there are many ambiguities. Function calls look like accessing array elements, defining one-line internal functions look like array assignments, keywords are not protected (they can be used as variable names), etc.

The advantage of Earley parsers is that you do not have to waste time analyzing and perfecting the grammar but can resolve ambiguities during semantic analysis. Also there are many incompatible extensions of older compilers which makes managing (and fixing) several similar but subtly different grammars rather painful.

In any case, just collecting the minimal amount of information needed to construct parse forests during parsing may be an idea.

I haven't checked your code in detail but if you have a rule of the form A : B_1 B_2 ... B_n where each of the B_i allow two parse trees then building the parse forest immediately means building 2**n trees.
The size of the parse forest in the examples above seems to indicate that something like that is happening.
One way around this is to keep pointers from A to each of the subtrees B_i and insert a step resolving at least some of the ambiguities before building complete trees.

from lark.

erezsh commented on May 16, 2024

I have a slightly different version of Earley, that might solve your issue. I didn't test it thoroughly, but it's worth giving it a try.

Checkout the branch earley_fix2 and let me know if it improved anything.

from lark.

till314 commented on May 16, 2024

Thank you.

The simple test cases all work very fast now. 0.03~0.14s
However, for one of my larger test files (which can be parsed with another Earley parser) I get the exception below. You might be pruning too severely somewhere.

Traceback (most recent call last):
File "", line 1, in
File " ..lark_parser-0.4.1-py3.6.egg/lark/lark.py", line 188, in parse
return self.parser.parse(text)
File "..lark_parser-0.4.1-py3.6.egg/lark/parser_frontends.py", line 142, in parse
return self.parser.parse(text)
File "..lark_parser-0.4.1-py3.6.egg/lark/parsers/xearley.py", line 147, in parse
raise ParseError('Incomplete parse: Could not find a solution to input')
lark.common.ParseError: Incomplete parse: Could not find a solution to input

from lark.

erezsh commented on May 16, 2024

That's very likely. This suspicion is the main reason I haven't merged this fix to main yet. If you can provide more details, I'll appreciate it. It might help me narrow down on the issue.

from lark.

till314 commented on May 16, 2024

I'll give it a try. It may take a few days.

from lark.

erezsh commented on May 16, 2024

Great, thanks.

from lark.

till314 commented on May 16, 2024

I have mixed news.

the incomplete parse was a mistake in the grammar conversion script a missing space (causing problems for very long token name)
earley_fix2 version 0155d3d works ok.

Simple programs are much faster than my parser (<0.1s vs 1s)
slightly more complicate programs about the same 2s vs 2s
but larger programs either take much longer 5s vs 2.5s, 16s vs 4s, 100s vs 5s or run out of memory (16GB) after a while.

now I tried to set up version ec7da3b (0155... does not exist any longer it seems) on a 64GB desktop. I'll let it run over night and see. EDIT: ran out of memory after 2 hours

EDIT2:
I ran the parser on some 600 fortran files with about 200 include files altogether about 12k lines, using a 2min timeout

there were about 50 timeouts
one file which contained 7 consecutive function definitions failed but once I split it into 2 files with 3 and 4 functions respectively both files worked fine. Two relevant rules are given below (executable_program is the start symbol). It seems that parsing 3 or 4 program units works fine but 7 program units fail. Since memory usage was below 1% the whole time according to top it is unlikely that memory was the problem.

Two grammar rules:
executable_program : comment* program_unit program_unit*
program_unit : block_data
| function_subprogram
| main_program
| module
| subroutine_subprogram

from lark.

erezsh commented on May 16, 2024

Simple programs are much faster than my parser

What do you mean? What are you comparing?

It seems that parsing 3 or 4 program units works fine but 7 program units fail

If you can provide an example (ideally, a minimal grammar + input) that behaves badly, I'll be more than happy to look into it.

from lark.

till314 commented on May 16, 2024

Sorry I should have been clearer. I have a working python Earley parser which is part of a larger in-house system (not for distribution) which is what I use for comparison. I try to build a similar system using open source tools which then can be open sourced. As far as the parser goes there are two conditions: all context free grammars + proper handling of ambiguous parses. So far lark is my main choice.

Unfortunately the example which triggers the problem is covered by a NDA.

I'll see what I can do in terms of coming up with a simple example which triggers the problem but it may take a while. So far all simple examples work fine.

from lark.

till314 commented on May 16, 2024

The problem seems size related.

I have a string to be parsed of length 131070 (a stream of tokens generated from Fortran code). Trying to parse generates the 'incomplete parse' message. If I simplify the original Fortran code just a tiny bit, for example by removing a single argument from a function call or a function definition, or by simplifying a condition in a if statement, or by removing an arbitrary assignment somewhere and regenerate the token stream (results in a string of length ~131060) then parsing works. I tried about 20 different simplifications and they all worked.

So it seems there is a hidden size limit somewhere. My first guess was that there is a limit for match objects in python and 131072 = 2**17 looks suspicious like a 'large constant' people would use but I checked the python source code of the re module and did not find anything suspicious (which of course does not mean much). I suspected the re module because of the 99 group limit which I have run in before.

from lark.

erezsh commented on May 16, 2024

That's a very strange error. I can't think of anything in the implementation that would care about the length of the input string (except when it comes to run-time and memory, of course).

You should try to run it in pypy if you can. They have their own regexp implementation, and it seems better than CPython's.

from lark.

till314 commented on May 16, 2024

I tried replacing re by regex as well as using pypy but neither changes anything.

from lark.

till314 commented on May 16, 2024

I close the issue for now. I played around a bit more without getting any new insights. The only next step I can think of is to either study the python regex library properly or to reimplement it and I don't really have the time/energy to do so at the moment.

from lark.

erezsh commented on May 16, 2024

Well, another possible step is to provide me with an example that demonstrates this behavior, so that I can try to solve it.

from lark.

till314 commented on May 16, 2024

As I said before the only examples I have which trigger the problem are covered by NDAs. I have not been able to construct smaller examples. Also all examples I tried and which I could have shared (with some extra effort) run out of memory before the problem shows up.

So all I can say is that some large examples (token strings generated from Fortran code) fail to parse although all types of very minor reductions (removing an assignment somewhere, reducing the number of variables in a function call, removing a declaration) parse correctly.

Sorry for not being able to provide you with such an example.

from lark.

erezsh commented on May 16, 2024

No worries, I'll get to it one way or another. When I make meaningful improvements to the algorithm, I hope I can ask you to check again.

from lark.

parsing time seems to increase dramatically with depth of parse tree about lark HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent