Problems with debugging PP about compiler HOT 32 CLOSED

hoaproject commented on May 16, 2024

Problems with debugging PP

from compiler.

Comments (32)

Hywan commented on May 16, 2024

Hello :-),

Firstly, the token NameStartChar could be written \w+ instead of just \w, it will save CPU time and simplify the grammar. Secondly, here is my trace with your original grammar:

>  enter Production (#Production)
>  >  enter NCName
         token NameStartChar, consumed A
>  >  >  enter 6
>  >  >  >  enter 5 (#NCName)
               token NameStartChar, consumed b
<  <  <  <  ekzit 5
<  <  <  ekzit 6
<  <  ekzit NCName
      token Production_a, consumed ::=
>  >  enter Choice
>  >  >  enter SequenceOrDifference
>  >  >  >  enter 17
>  >  >  >  >  enter NCName
                  token NameStartChar, consumed C
>  >  >  >  >  >  enter 6
<  <  <  <  <  <  ekzit 6
<  <  <  <  <  ekzit NCName
>  >  >  >  >  enter 16
>  >  >  >  >  >  enter 15 (#SequenceOrDifference)
>  >  >  >  >  >  >  enter 14
>  >  >  >  >  >  >  >  enter NCName
                           token NameStartChar, consumed d
>  >  >  >  >  >  >  >  >  enter 6
<  <  <  <  <  <  <  <  <  ekzit 6
<  <  <  <  <  <  <  <  ekzit NCName
<  <  <  <  <  <  <  ekzit 14
<  <  <  <  <  <  ekzit 15
<  <  <  <  <  ekzit 16
<  <  <  <  ekzit 17
<  <  <  ekzit SequenceOrDifference
>  >  >  enter 10
<  <  <  ekzit 10
<  <  ekzit Choice
<  ekzit Production

I don't have enter 17 as you have. Are you sure of your trace?

from compiler.

flip111 commented on May 16, 2024

Hi, a NameStartChar is just one ... "char" so i will not add the + after \w. Performance is not an issue. But thanks for the tip.

I don't know why in your trace all the trace nodes get different numbers. In your trace it is:

>  >  >  >  >  >  enter 6

on line 16

After that line i expect:

>  >  >  >  >  >  >  enter 5 (#NCName)

similar like line 4 (in your trace)

from compiler.

Hywan commented on May 16, 2024

@flip111 It's not only about performances here, but your AST will have one leaves per “char”, which does not make any sense, but ok, anyway, you will see.

We should have the same trace, exactly. The compiler is deterministic. How do you get the trace?

from compiler.

flip111 commented on May 16, 2024

@Hywan in IRC you say:
::token:: will consume the token but it will not be present in the AST, while <token> will consume the token and it will be present in the AST.
Since i use ::NameStartChar:: and not <NameStartChar> individual chars should NOT be present in the AST by themselves? At least how i understand???

I get the trace with the command that you showed me on the command line (compiler shell with dump command) .. unfortunately i can not get to that computer right now, so i will report back on that.

from compiler.

Hywan commented on May 16, 2024

Yes, <token> will be present in the AST while ::token:: will just be consumed.

from compiler.

flip111 commented on May 16, 2024

I have the command i use, it's bin\hoa compiler:pp src/case.pp src/case.ebnf -v dump -s -t

from compiler.

Hywan commented on May 16, 2024

And the grammar is identical with the one you gave me?

from compiler.

flip111 commented on May 16, 2024

I might have pasted a SLIGHTLY different version of the trace, i get the same trace now as you. But the original problem stays the same.

from compiler.

Hywan commented on May 16, 2024

Well, we can't do that. It is part to you to write the grammar is order to remove those ambiguity. This is top-down parser, so I can't solve your issue without rewriting the grammar ;-).

from compiler.

flip111 commented on May 16, 2024

Ok but the compiler gives me no information on why it makes this choice, so it becomes really hard for grammar writers. I think the solution is indeed to rewrite the grammar. But i like to program not in a style of "trial & error" but to know why something goes in a certain way.

from compiler.

Hywan commented on May 16, 2024

These is no error. You have two paths. The first one is used because this is a top-down parser. Also, all of this happens because your grammar is not well-written… :-).

from compiler.

flip111 commented on May 16, 2024

It's not the compiler error, but my error in writing the grammar in this way. What do you mean with "the first one is used"? The rule #SequenceOrDifference comes after #NCName.

from compiler.

Hywan commented on May 16, 2024

Not if you unfold the paths. Try to see. Unfold the whole grammar.

from compiler.

flip111 commented on May 16, 2024

So you are saying it will always try to match things higher up in the hierarchy, rather then lower elements?

from compiler.

Hywan commented on May 16, 2024

Normally no. I didn't look your example in details but the behavior is as expected. Just change your grammar ;-).

from compiler.

Hywan commented on May 16, 2024

Is the issue closed?

from compiler.

flip111 commented on May 16, 2024

No not yet, as we discussed i will make a PR first to add support to shows paths to the compiler, then i will continue to debug this issue

from compiler.

flip111 commented on May 16, 2024

Ok so it took 3 months and it was really difficult, but the PR is ready here #22 i will await comments on this PR first.

from compiler.

flip111 commented on May 16, 2024

So i unfolded all the paths, but actually there is not much to unfold. The possible path that is taken is this:

1: NameStartChar > NameStartChar * > Production_a > NameStartChar > NameStartChar * > ( SequenceOrDifference_a > NameStartChar > NameStartChar * ) ?

This is how the tokens are matched:

1: NameStartChar [TOKEN A] > NameStartChar * [TOKEN b] > Production_a [TOKEN ::=] > NameStartChar [TOKEN C] > NameStartChar * > ( SequenceOrDifference_a > NameStartChar [TOKEN d] > NameStartChar * ) ?

This is how i expected them to be matched:

1: NameStartChar [TOKEN A] > NameStartChar * [TOKEN b] > Production_a [TOKEN ::=] > NameStartChar [TOKEN C] > NameStartChar * [TOKEN d] > ( SequenceOrDifference_a > NameStartChar > NameStartChar * ) ?

This is a trace of the paths, starting with red arrow for tokens A and B (the first time NCName is processed). Then token C in blue (the second time NCName is processed). Then green where i expect it to go, and orange where i think it's going.

So my conclusion is that the repetition is not greedy and thus leaving a rule takes precedence. Please confirm on this assumption.

from compiler.

flip111 commented on May 16, 2024

Ping @Hywan

from compiler.

Hywan commented on May 16, 2024

@flip111 I suppose yes.

from compiler.

flip111 commented on May 16, 2024

@Hywan is this intended behavior by design? I feel like trying to match the current active rule should be the highest priority, but i don't have the experience or research (reference) to compare this with other compilers. Do you think the way it is now is the best choice for how the compiler works/should work?

I also realize now i could just rewrite the rule to ::NameStartChar:: + but still if this is the intended mode of operation then one should be very careful when designing the grammar.

from compiler.

Hywan commented on May 16, 2024

I assign @CircleCode to this issue :-)

from compiler.

Hywan commented on May 16, 2024

ping?

from compiler.

flip111 commented on May 16, 2024

pong

from compiler.

Hywan commented on May 16, 2024

No more activity on this issue. Can we close or is it not resolved yet?

from compiler.

flip111 commented on May 16, 2024

As you can see, you assigned @CircleCode but he did not reply. The open question is this intended behavior by design? can not be answered on my own (because i'm not the designer of this library). So yes it's still open...

from compiler.

Hywan commented on May 16, 2024

@CircleCode is on the opposite side of the earth (based on its usual location) for a while. I take the lead on this issue and try to answer to the question as soon as possible.

from compiler.

flip111 commented on May 16, 2024

Ok thank you :)

from compiler.

flip111 commented on May 16, 2024

Ok after two years i think this issue is resolved. What happens is that even though the parser is DEPTH-FIRST and yes it should match more (::NameStartChar::)* in rule NCName. What i think happens is that it runs out of tokens this way (which is good) but then it goes a level higher and comes back to rule SequenceOrDifference and this rule is (NCName() (::SequenceOrDifference_a:: NCName() | NCName()*))? and now it consumed all the tokens when diving into the first NCName() so then it continues with (::SequenceOrDifference_a:: NCName() | NCName()*)

And here something very sneaky but logical occurs. Since it's looking for ( something here ) which in itself is NOT optional or zero or more it will have problems closing this rule, therefor it will backtrack on the path which was just taken to see if it can "free up any tokens", it frees up one ::NameStartChar::. Now it can match this token in the second NCName() in (::SequenceOrDifference_a:: NCName() | NCName()*). EVEN THOUGH this part has a * zero or more quantifier behind it (so this inner-part doesn't infact have to match).

So the confusion here (on my part) was to assume that the parser knows that it doesn't have to match the outer ( ) because one of the inner choices was entirely optional. In a nutshell foo* is not the same as (foo*)

from compiler.

Hywan commented on May 16, 2024

Good to hear!

from compiler.

flip111 commented on May 16, 2024

This should be documented that (foo*) and foo* are not the same.

from compiler.

Problems with debugging PP about compiler HOT 32 CLOSED

Comments (32)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent