Grammar: <div class="snippet-clipboard-content notranslate position-relative overf

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Sorry, I was a bit late, but in <a class="commit-link" data-hovercard-type="commit" da

In <a class="commit-link" data-hovercard-type="commit" data-hovercard-url="https://git

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Parser drops part of input about tatsu HOT 15 CLOSED

neogeny commented on May 16, 2024

Parser drops part of input

from tatsu.

Comments (15)

apalala commented on May 16, 2024

A parser will be happy with part of the input unless the end-of-file expression ($) is mentioned in the grammar. This version should work as expected:

identifier = /\w+/ ;
expre = '{' expre '}' | expre '->' identifier | identifier ;
start = expre $ ;

from tatsu.

manueljacob commented on May 16, 2024

The fact that TatSu will drop the end of the input by default is a strange feature IMHO, but I got used to that already and this bug report is about a different issue. :)

In this case the parser drops the start of the input, not the end. However, the problem can also happen in the middle of the input: for example, in { { size } test }, { size } is dropped.

from tatsu.

apalala commented on May 16, 2024

@manueljacob Could you write a failing unit test for this case? I have some upcoming time to take a good look at issues like this.

from tatsu.

apalala commented on May 16, 2024

BTW, nobody ever told me that being specific about a grammar expecting to see the end of input was unexpected for users.

It would seriously break backwards compatibility to change that behavior, but I'll think about it.

from tatsu.

manueljacob commented on May 16, 2024

Sorry, I was a bit late, but in 6a84bfc I added a simplified test case.

The problem may be in the _recursive_call() method of the ParseContext class. When parsing "foo bar", in the beginning of the method the beginning "foo" is parsed with the right side of the expr rule (identifier). Then, because expr is a recursive rule, in the end of the method the rule is tried again, but this time with the buffer position at "bar".

from tatsu.

manueljacob commented on May 16, 2024

In 825ffde I moved and renamed the test, and modified it to look like the other tests in this file.
EDIT: More precisely, I created a fresh branch in my fork with one commit, including these changes.

from tatsu.

apalala commented on May 16, 2024

@manueljacob My fix for this bug broke several other things, so I reverted it.

from tatsu.

frozenblit commented on May 16, 2024

Hi, I've noticed a similar bug after adding an unary operator to the example calc grammar:

    start = expression $ ;
    expression
        =
        | expression ('+'|'-') term
        | term
        ;
    term
        =
        | term ('*'|'/') unary
        | unary
        ;
    unary
        =
        | '-' factor
        | factor
        ;
    factor
        =
        | '(' ~ @:expression ')'
        | '-' factor
        | number
        ;
    number = /\d+/ ;

For the input -3*-3*-5, the parser returns

[ [ ['-', '3'], '*', ['-', '3']], '*', ['-', '5']]

However, -3*-3*-5-5 returns only ['-', '5']. Opening the left-recursion in term by hand parses correctly.

Are those bugs related?

from tatsu.

Victorious3 commented on May 16, 2024

I did a test with Whimsy using this grammar:
(Based on this failing test case https://github.com/neogeny/TatSu/blob/master/test/grammar/left_recursion_test.py#L349)

class DroppedInputGrammar: Grammar() {

    fun identifier() = word { build_str (
        syntax = { repeat1 { alphanum() } },
        value = { it.toString() }
    )}

    val expression = leftrec {
        self -> choice {
            self() && word(",") && self() ||
            identifier()
        }
    }

    override fun root() = expression()
}

With the following results:

foo      -> [foo]
foo bar  -> [foo, bar]
foo, bar -> [foo, bar]

Different, but not quite the expected output either. I might have done something wrong formulating the grammar tho. (Tagging @norswap again, thanks for your great help so far!)

What I'm suspecting is that nothing is telling the parser to stop when doing the left recursion.
It keeps matching "expression->identifier" until the input position doesn't grow anymore. In this case that means eating the first match. Tatsu returns only the last one and Whimsy returns every match.

~~Something would have to check if we are matching a non left recursive branch and then fail when this happens multiple times in sequence...~~
Could also be a problem with memoization since I haven't changed anything specific to that yet, but I don't know if it applies to this case, since the output I get from Whimsy isn't the expected one either.

from tatsu.

apalala commented on May 16, 2024

Interesting experiment. Why didn't I think of trying other parser generators? :-) It would be good to know what something YACC-like does with that grammar.

Memoization can be turned off in TatSu, but I think you have to instantiate Parser() to do it. It can also be switched off temporarily. I think it is done to handle negative lookaheads.

from tatsu.

norswap commented on May 16, 2024

@Victorious3 This really made me go WTF until I came to my senses.

There's an issue in the grammar, you didn't put the sequential stuff in a seq:

val expression = leftrec {
        self -> choice {
            seq { self() && word(",") && self() } ||
            identifier()
        }
    }

Without it what happens is that on round 1 you match an identifier. On round 2, the first call to self() recalls this identifier (the input position is restored past "foo"), but when it fails to match the comma, it tries the other side of the || expression without resetting the input position to the start of the input, causing it to match a second identifier after the first one. Hence the unexpected result.

It's way too easy to make such a mistake. This is terrible UX and I'm sorry. No such problem in Autumn 4 :p

Oh and thank you too! I's cool to have people scrutinizing all this stuff, it's how good software gets made. Keep up the good work 👍

from tatsu.

Victorious3 commented on May 16, 2024

@norswap Getting the correct output now.

Tatsu's Trace:

↙start ~1:1
foo bar
↙expr↙start ~1:1
foo bar
↙expr↙expr↙start ~1:1
foo bar
⟲ expr↙expr↙start ~1:1
foo bar
↙identifier↙expr↙start ~1:1
foo bar
≡'foo' /\w+/
 bar
≡identifier↙expr↙start ~1:4
 bar
↙expr↙expr↙start ~1:5
bar
≡expr↙expr↙start ~1:5
bar
≢','
bar
↙identifier↙expr↙start ~1:5
bar
≡'bar' /\w+/
≡identifier↙expr↙start
↙expr↙expr↙start
≡expr↙expr↙start
≢','
↙identifier↙expr↙start
≢'' /\w+/
≡expr↙start
≡start

It's good until the first comma fails, then it tries identifier at the same position again...
What does wrapping it in seq do exactly?

@apalala Maybe the changes from #32 would apply now with my fix
EDIT: No, I just tried it. At least not without changes. The other tests still fail
Tatsu doesn't follow https://github.com/ncellar/autumn_v1/blob/61a2bc95403b06859b9dc7e12f19a95b539c2002/src/com/norswap/autumn/extensions/leftrec/LeftRecursive.java exactly, I suppose #32 was meant to rectify that in some way

from tatsu.

norswap commented on May 16, 2024

It's what I explain in my last message, seq makes sure that if the predicate (the &&-separated chain of parser calls) fail, the input position is reset to what it was at the start of the chain.

The basic heuristic: you're never supposed to mix && and || when calling parsers: && go inside seq only and || inside choice.

from tatsu.

Victorious3 commented on May 16, 2024

I was wondering because whatever Tatsu does, it does not reset the input position, similar to the behavior you described when leaving out seq. Something could be wrong in how Tatsu does sequences, but since they work fine everywhere else I find it unlikely.
It's probably the implementation of the bounded left recursion that's lacking, I might try again from scratch on that one.

from tatsu.

Victorious3 commented on May 16, 2024

Trying to get a fix for this into #75 as well, having some issues with whitespace (failing to retrieve seeds/saving them at the wrong position) but other than it seems to work. I also disabled memoization completely for the time being.

from tatsu.

Parser drops part of input about tatsu HOT 15 CLOSED

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent