Comments (32)
Hello :-),
Firstly, the token NameStartChar
could be written \w+
instead of just \w
, it will save CPU time and simplify the grammar. Secondly, here is my trace with your original grammar:
> enter Production (#Production)
> > enter NCName
token NameStartChar, consumed A
> > > enter 6
> > > > enter 5 (#NCName)
token NameStartChar, consumed b
< < < < ekzit 5
< < < ekzit 6
< < ekzit NCName
token Production_a, consumed ::=
> > enter Choice
> > > enter SequenceOrDifference
> > > > enter 17
> > > > > enter NCName
token NameStartChar, consumed C
> > > > > > enter 6
< < < < < < ekzit 6
< < < < < ekzit NCName
> > > > > enter 16
> > > > > > enter 15 (#SequenceOrDifference)
> > > > > > > enter 14
> > > > > > > > enter NCName
token NameStartChar, consumed d
> > > > > > > > > enter 6
< < < < < < < < < ekzit 6
< < < < < < < < ekzit NCName
< < < < < < < ekzit 14
< < < < < < ekzit 15
< < < < < ekzit 16
< < < < ekzit 17
< < < ekzit SequenceOrDifference
> > > enter 10
< < < ekzit 10
< < ekzit Choice
< ekzit Production
I don't have enter 17
as you have. Are you sure of your trace?
from compiler.
Hi, a NameStartChar
is just one ... "char" so i will not add the +
after \w
. Performance is not an issue. But thanks for the tip.
I don't know why in your trace all the trace nodes get different numbers. In your trace it is:
> > > > > > enter 6
on line 16
After that line i expect:
> > > > > > > enter 5 (#NCName)
similar like line 4 (in your trace)
from compiler.
@flip111 It's not only about performances here, but your AST will have one leaves per “char”, which does not make any sense, but ok, anyway, you will see.
We should have the same trace, exactly. The compiler is deterministic. How do you get the trace?
from compiler.
@Hywan in IRC you say:
::token:: will consume the token but it will not be present in the AST, while <token> will consume the token and it will be present in the AST.
Since i use ::NameStartChar::
and not <NameStartChar>
individual chars should NOT be present in the AST by themselves? At least how i understand???
I get the trace with the command that you showed me on the command line (compiler shell with dump command) .. unfortunately i can not get to that computer right now, so i will report back on that.
from compiler.
Yes, <token>
will be present in the AST while ::token::
will just be consumed.
from compiler.
I have the command i use, it's bin\hoa compiler:pp src/case.pp src/case.ebnf -v dump -s -t
from compiler.
And the grammar is identical with the one you gave me?
from compiler.
I might have pasted a SLIGHTLY different version of the trace, i get the same trace now as you. But the original problem stays the same.
from compiler.
Well, we can't do that. It is part to you to write the grammar is order to remove those ambiguity. This is top-down parser, so I can't solve your issue without rewriting the grammar ;-).
from compiler.
Ok but the compiler gives me no information on why it makes this choice, so it becomes really hard for grammar writers. I think the solution is indeed to rewrite the grammar. But i like to program not in a style of "trial & error" but to know why something goes in a certain way.
from compiler.
These is no error. You have two paths. The first one is used because this is a top-down parser. Also, all of this happens because your grammar is not well-written… :-).
from compiler.
It's not the compiler error, but my error in writing the grammar in this way. What do you mean with "the first one is used"? The rule #SequenceOrDifference
comes after #NCName
.
from compiler.
Not if you unfold the paths. Try to see. Unfold the whole grammar.
from compiler.
So you are saying it will always try to match things higher up in the hierarchy, rather then lower elements?
from compiler.
Normally no. I didn't look your example in details but the behavior is as expected. Just change your grammar ;-).
from compiler.
Is the issue closed?
from compiler.
No not yet, as we discussed i will make a PR first to add support to shows paths to the compiler, then i will continue to debug this issue
from compiler.
Ok so it took 3 months and it was really difficult, but the PR is ready here #22 i will await comments on this PR first.
from compiler.
So i unfolded all the paths, but actually there is not much to unfold. The possible path that is taken is this:
1: NameStartChar > NameStartChar * > Production_a > NameStartChar > NameStartChar * > ( SequenceOrDifference_a > NameStartChar > NameStartChar * ) ?
This is how the tokens are matched:
1: NameStartChar [TOKEN A] > NameStartChar * [TOKEN b] > Production_a [TOKEN ::=] > NameStartChar [TOKEN C] > NameStartChar * > ( SequenceOrDifference_a > NameStartChar [TOKEN d] > NameStartChar * ) ?
This is how i expected them to be matched:
1: NameStartChar [TOKEN A] > NameStartChar * [TOKEN b] > Production_a [TOKEN ::=] > NameStartChar [TOKEN C] > NameStartChar * [TOKEN d] > ( SequenceOrDifference_a > NameStartChar > NameStartChar * ) ?
This is a trace of the paths, starting with red arrow for tokens A and B (the first time NCName is processed). Then token C in blue (the second time NCName is processed). Then green where i expect it to go, and orange where i think it's going.
So my conclusion is that the repetition is not greedy and thus leaving a rule takes precedence. Please confirm on this assumption.
from compiler.
Ping @Hywan
from compiler.
@flip111 I suppose yes.
from compiler.
@Hywan is this intended behavior by design? I feel like trying to match the current active rule should be the highest priority, but i don't have the experience or research (reference) to compare this with other compilers. Do you think the way it is now is the best choice for how the compiler works/should work?
I also realize now i could just rewrite the rule to ::NameStartChar:: +
but still if this is the intended mode of operation then one should be very careful when designing the grammar.
from compiler.
I assign @CircleCode to this issue :-)
from compiler.
ping?
from compiler.
pong
from compiler.
No more activity on this issue. Can we close or is it not resolved yet?
from compiler.
As you can see, you assigned @CircleCode but he did not reply. The open question is this intended behavior by design? can not be answered on my own (because i'm not the designer of this library). So yes it's still open...
from compiler.
@CircleCode is on the opposite side of the earth (based on its usual location) for a while. I take the lead on this issue and try to answer to the question as soon as possible.
from compiler.
Ok thank you :)
from compiler.
Ok after two years i think this issue is resolved. What happens is that even though the parser is DEPTH-FIRST and yes it should match more (::NameStartChar::)*
in rule NCName
. What i think happens is that it runs out of tokens this way (which is good) but then it goes a level higher and comes back to rule SequenceOrDifference
and this rule is (NCName() (::SequenceOrDifference_a:: NCName() | NCName()*))?
and now it consumed all the tokens when diving into the first NCName()
so then it continues with (::SequenceOrDifference_a:: NCName() | NCName()*)
And here something very sneaky but logical occurs. Since it's looking for ( something here )
which in itself is NOT optional or zero or more
it will have problems closing this rule, therefor it will backtrack on the path which was just taken to see if it can "free up any tokens", it frees up one ::NameStartChar::
. Now it can match this token in the second NCName()
in (::SequenceOrDifference_a:: NCName() | NCName()*)
. EVEN THOUGH this part has a *
zero or more
quantifier behind it (so this inner-part doesn't infact have to match).
So the confusion here (on my part) was to assume that the parser knows that it doesn't have to match the outer ( )
because one of the inner choices was entirely optional. In a nutshell foo*
is not the same as (foo*)
from compiler.
Good to hear!
from compiler.
This should be documented that (foo*)
and foo*
are not the same.
from compiler.
Related Issues (20)
- Multiple start-symbols support
- Parsing tree is just the first token HOT 1
- Dependabot can't resolve your PHP dependency files
- Backtrack issue when rules overlap HOT 1
- PHP 7.4 deprecation warning in Bin/Pp.php
- Future of Compiler package HOT 4
- Enhance context output of UnrecognizedToken exception
- Enhance context output of UnexpectedToken exception
- Unexpected namespace assignment for PCRE containing colon
- Dependabot can't resolve your PHP dependency files
- Dependabot can't resolve your PHP dependency files
- mbstring problem
- Debug grammar tooling problems
- Bug when saving parser class
- Question: how to access/traverse nodes of grammar HOT 5
- Inlining code of the Parser and license HOT 2
- Madness with exceptions HOT 5
- Remove dependency to `ext/ctype` HOT 4
- Broken visualization of invalid input token in multiline input HOT 7
- Unrecognized Token in Lexer always reports Line 1? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from compiler.