Is it possible to parse one token at a time, building a parse tree incrementally in th

It does build the parse-tree incrementally. You can find it at <code class="notranslat

You define them. In the grammar. In the above grammar, it's the <code class="notransla

Partial parsing about lark HOT 11 OPEN

Daniel63656 commented on June 12, 2024

Partial parsing

from lark.

Comments (11)

erezsh commented on June 12, 2024

It does build the parse-tree incrementally. You can find it at ip.parser_state.value_stack, where ip: InteractiveParser. But it might be a bit tricky to use it effectively.

from lark.

Daniel63656 commented on June 12, 2024

I am a bit confused about this incremental parsing. I want to feed it one token at a time but it requires a Token class that also has a type. This type is the current production rule I guess? Why is this necessary when the parse tree is constructed incrementally? The current rule should be obvious then or not? Also value_stack is a stack and not a tree.

Edit: It seems these types are generated by the parser but I don't know them. I only have the tokens themselves (which should be enough why link them all to a different string anyway?)

from lark.

erezsh commented on June 12, 2024

Like you said, tokens have a type and a value. The value is the actual bit of text being parsed, and the type is the category of that text. The parser doesn't look at the value at all. It only looks at the type, in order to choose how to update the state-machine.

value_stack is a stack, of course.. it's a stack of trees (or tokens).

from lark.

Daniel63656 commented on June 12, 2024

Yes but I don't know the type. I just want to add tokens (values) one at a time. I thought at first type has something to do with the rules they are used in but it's just another string that the parser uses instead of the actual (value) string. Seems kind of unnecessary. How would I even get these types?

from lark.

MegaIng commented on June 12, 2024

You define them. In the grammar. In the above grammar, it's the A, B, C, i.e. the left side of the :. What exactly the token types are depends on your grammar. If you use literals directly in your grammar, lark with automatically choose names. If you print parser.terminals, you can see a list of all defined terminals. This isn't unnecessary: The parser doesn't care about the original text at all, it only cares about the token types. One token type can represent an infinite number of actual text strings. If you want to skip the lexer step, you need to do the step of the lexer and provide the token types.

from lark.

Daniel63656 commented on June 12, 2024

I'm sorry I am new to this. I have following grammar:

E: "treble" | "bass" | "staff_change"
R: /rest_[0-7]/
P: "start" | "continue" | "stop"
D: "up" | "dn"
A: "#" | "b" | "nat" | "x" | "bb"
N: "C" | "D" | "E" | "F" | "G" | "A" | "B"
O: "8va" | "8vb"
clef: "gclef" | "fclef"
digit: /[0-9]/

The expressions left of the ':' are my non-terminals. Now if I print the terminals
parser = Lark(grammar, parser="lalr", start='score')
print(parser.terminals)

I get for example
TerminalDef('E', '(?:staff_change|treble|bass)')
TerminalDef('R', 'rest_[0-7]')
TerminalDef('VBAR', '\|')
TerminalDef('GRACE', 'grace')
TerminalDef('DOT', '\.')
TerminalDef('__ANON_0', '[0-9]')

I discovered that the parser works when I use these categories as types e.g. Token('DOT', '.'). But I didn't define (and know) these types. If a terminal only appears on the right side of productions like ".", "grace", or "flag" then they get their own new type like
DOT: "." or GRACE:"grace" ? Should I define them myself (does every non-terminal need a production rule like nonterminal:terminal to cover them)?

Some types like 'E', 'R', and so on are defined as sets by me but still why is 'digit' turned into __ANON_0?

You said a lexer would typically do this. Can I then also partially lex one token at a time to get the type?

from lark.

MegaIng commented on June 12, 2024

If a terminal only appears on the right side of productions like ".", "grace", or "flag" then they get their own new type like
DOT: "." or GRACE:"grace" ?

Yes, lark automatically gives names, as I mentioned before.

Should I define them myself ?

That makes the names more predicable, but it's not necessary.

does every non-terminal need a production rule like nonterminal:terminal to cover them

Not sure what you mean. Are you sure you are using the terms nonterminal and terminal correctly?

In lark, everything in uppercase is a terminal, everything in lowercase is a nonterminal.

Some types like 'E', 'R', and so on are defined as sets by me but still why is 'digit' turned into __ANON_0?

Because lark couldn't find a better name. Every terminal needs a name.

You said a lexer would typically do this. Can I then also partially lex one token at a time to get the type?

Sure, you can try use parser.lex for that.

from lark.

Daniel63656 commented on June 12, 2024

Not sure what you mean. Are you sure you are using the terms nonterminal and terminal correctly?

I am not sure. From the json_tutorial:
rule_name : list of rules and TERMINALS to match
| another possible list of items
| etc.

TERMINAL: "some text to match"

here rule_name is a non-terminal. And an example for a TERMINAL could be :
NUMBER : /-?\d+(.\d+)?([eE][+-]?\d+)?/
STRING : /".*?(?<!\)"/

What throws me off is that I read that left side of production rules must be non-terminals?! I also thought terminals are tokens like "flag" or "." in my case. But you say terminals are these categories?

Because lark couldn't find a better name. Every terminal needs a name.

But I defined a name: digit. That's kind of my point, if lark switches these names around then I can't reliably provide them myself. I didn't think of the Lexer. I thought a Lexer only breaks a string of tokens into individual tokens tbh. Is generating these types also part of lexing and will they then be the same as parsers?

from lark.

MegaIng commented on June 12, 2024

But I defined a name: digit

You defined a name for the nonterminal, not for the terminal. You need to define a name for the terminal, in uppercase.

I thought a Lexer only breaks a string of tokens into individual tokens tbh. Is generating these types also part of lexing and will they then be the same as parsers?

As I mentioned above: A big part of the job of the lexer is generating these token types. For the same grammar it will always be the same terminal names.

from lark.

Daniel63656 commented on June 12, 2024

You defined a name for the nonterminal, not for the terminal. You need to define a name for the terminal, in uppercase.

Ah because I wrote it in lowercase? I had this in a university course and there nonterminals got capital letters I guess thats where the confusion comes from 😖
And the type will then be the defined terminal got it. I should then probably change my grammar to:

...
E: "treble" | "bass" | "staff_change"
R: /rest_[0-7]/
P: "start" | "continue" | "stop"
D: "up" | "dn"
A: "#" | "b" | "nat" | "x" | "bb"
N: "C" | "D" | "E" | "F" | "G" | "A" | "B"
O: "8va" | "8vb"
CLEF: "gclef" | "fclef"
DIGIT: /[0-9]/

But now I still need a way to get the type for a token

Sure, you can try use parser.lex for that.

next(parser.lex('bos')) returns a token but the wrong one in this case:
Token('KEY', 'b')
shouldn't the lexer match to Token('BOS', 'bos')?

This works fine when using the parser on entire input strings so I guess the question remains: How to use a lexer incrementally?

from lark.

Daniel63656 commented on June 12, 2024

After searching the source code for a while I figured it is probably easiest to do this myself:

import re

class DynamicLexer:
    def __init__(self, terminals):
        self.terminals = [(terminal.name, terminal.pattern.to_regexp()) for terminal in terminals]

    def lex_token(self, token_string):
        for name, pattern_str in self.terminals:
            if re.fullmatch(pattern_str, token_string):
                return Token(name, token_string)
        raise ValueError(f"Token '{token_string}' could not be matched.")
        
        
        
# example usage
lexer = DynamicLexer(parser.terminals)
print(lexer.lex_token('bos'))
print(lexer.lex_token('bb'))
# expect this to raise error
print(lexer.lex_token('bbt'))

This very simple class uses the terminals from the parser (so they are always same as your defined grammar) and returns a correct token with type if and only if the token can be fully matched. No clue if this is redundant or something but I don't see how to make it work wit lark directly.

One can now wrap a parser to work on token strings only:

class IncrementalParser:
    def __init__(self, parser):
        self.terminals = [(terminal.name, terminal.pattern.to_regexp()) for terminal in parser.terminals]
        self.parser = parser.parse_interactive()

    def lex_token(self, token_string):
        for name, pattern_str in self.terminals:
            if re.fullmatch(pattern_str, token_string):
                return Token(name, token_string)
        raise ValueError(f"Token '{token_string}' could not be matched.")
    
    def feed_token(self, token_string):
        token = self.lex_token(token_string)
        self.parser.feed_token(token)

    def accepts(self):
        return self.parser.accepts()

from lark.

Partial parsing about lark HOT 11 OPEN

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent