Code Monkey home page Code Monkey logo

Comments (11)

erezsh avatar erezsh commented on June 12, 2024

It does build the parse-tree incrementally. You can find it at ip.parser_state.value_stack, where ip: InteractiveParser. But it might be a bit tricky to use it effectively.

from lark.

Daniel63656 avatar Daniel63656 commented on June 12, 2024

I am a bit confused about this incremental parsing. I want to feed it one token at a time but it requires a Token class that also has a type. This type is the current production rule I guess? Why is this necessary when the parse tree is constructed incrementally? The current rule should be obvious then or not? Also value_stack is a stack and not a tree.

Edit: It seems these types are generated by the parser but I don't know them. I only have the tokens themselves (which should be enough why link them all to a different string anyway?)

from lark.

erezsh avatar erezsh commented on June 12, 2024

Like you said, tokens have a type and a value. The value is the actual bit of text being parsed, and the type is the category of that text. The parser doesn't look at the value at all. It only looks at the type, in order to choose how to update the state-machine.

value_stack is a stack, of course.. it's a stack of trees (or tokens).

from lark.

Daniel63656 avatar Daniel63656 commented on June 12, 2024

Yes but I don't know the type. I just want to add tokens (values) one at a time. I thought at first type has something to do with the rules they are used in but it's just another string that the parser uses instead of the actual (value) string. Seems kind of unnecessary. How would I even get these types?

from lark.

MegaIng avatar MegaIng commented on June 12, 2024

You define them. In the grammar. In the above grammar, it's the A, B, C, i.e. the left side of the :. What exactly the token types are depends on your grammar. If you use literals directly in your grammar, lark with automatically choose names. If you print parser.terminals, you can see a list of all defined terminals. This isn't unnecessary: The parser doesn't care about the original text at all, it only cares about the token types. One token type can represent an infinite number of actual text strings. If you want to skip the lexer step, you need to do the step of the lexer and provide the token types.

from lark.

Daniel63656 avatar Daniel63656 commented on June 12, 2024

I'm sorry I am new to this. I have following grammar:

score: "bos" event "eos"
event: E meta* ottava? group+ "|"?
group: rest | chord | "grace" chord
rest: R P? "."*
chord: "whole" note+ "."* | "half" D note+ "."* | D note+ "."* | D P? "flag"+ note+ "."*
note: A? "tie"? N "tie"?
meta: clef | time | key
time: digit+ "over" digit+
key: "#"+ | "b"+ | "nat"+
ottava: O P

E: "treble" | "bass" | "staff_change"
R: /rest_[0-7]/
P: "start" | "continue" | "stop"
D: "up" | "dn"
A: "#" | "b" | "nat" | "x" | "bb"
N: "C" | "D" | "E" | "F" | "G" | "A" | "B"
O: "8va" | "8vb"
clef: "gclef" | "fclef"
digit: /[0-9]/

The expressions left of the ':' are my non-terminals. Now if I print the terminals
parser = Lark(grammar, parser="lalr", start='score')
print(parser.terminals)

I get for example
TerminalDef('E', '(?:staff_change|treble|bass)')
TerminalDef('R', 'rest_[0-7]')
TerminalDef('VBAR', '\|')
TerminalDef('GRACE', 'grace')
TerminalDef('DOT', '\.')
TerminalDef('__ANON_0', '[0-9]')

I discovered that the parser works when I use these categories as types e.g. Token('DOT', '.'). But I didn't define (and know) these types. If a terminal only appears on the right side of productions like ".", "grace", or "flag" then they get their own new type like
DOT: "." or GRACE:"grace" ? Should I define them myself (does every non-terminal need a production rule like nonterminal:terminal to cover them)?

Some types like 'E', 'R', and so on are defined as sets by me but still why is 'digit' turned into __ANON_0?

You said a lexer would typically do this. Can I then also partially lex one token at a time to get the type?

from lark.

MegaIng avatar MegaIng commented on June 12, 2024

If a terminal only appears on the right side of productions like ".", "grace", or "flag" then they get their own new type like
DOT: "." or GRACE:"grace" ?

Yes, lark automatically gives names, as I mentioned before.

Should I define them myself ?

That makes the names more predicable, but it's not necessary.

does every non-terminal need a production rule like nonterminal:terminal to cover them

Not sure what you mean. Are you sure you are using the terms nonterminal and terminal correctly?

In lark, everything in uppercase is a terminal, everything in lowercase is a nonterminal.

Some types like 'E', 'R', and so on are defined as sets by me but still why is 'digit' turned into __ANON_0?

Because lark couldn't find a better name. Every terminal needs a name.

You said a lexer would typically do this. Can I then also partially lex one token at a time to get the type?

Sure, you can try use parser.lex for that.

from lark.

Daniel63656 avatar Daniel63656 commented on June 12, 2024

Not sure what you mean. Are you sure you are using the terms nonterminal and terminal correctly?

I am not sure. From the json_tutorial:
rule_name : list of rules and TERMINALS to match
| another possible list of items
| etc.

TERMINAL: "some text to match"

here rule_name is a non-terminal. And an example for a TERMINAL could be :
NUMBER : /-?\d+(.\d+)?([eE][+-]?\d+)?/
STRING : /".*?(?<!\)"/

What throws me off is that I read that left side of production rules must be non-terminals?! I also thought terminals are tokens like "flag" or "." in my case. But you say terminals are these categories?

Because lark couldn't find a better name. Every terminal needs a name.

But I defined a name: digit. That's kind of my point, if lark switches these names around then I can't reliably provide them myself. I didn't think of the Lexer. I thought a Lexer only breaks a string of tokens into individual tokens tbh. Is generating these types also part of lexing and will they then be the same as parsers?

from lark.

MegaIng avatar MegaIng commented on June 12, 2024

But I defined a name: digit

You defined a name for the nonterminal, not for the terminal. You need to define a name for the terminal, in uppercase.

I thought a Lexer only breaks a string of tokens into individual tokens tbh. Is generating these types also part of lexing and will they then be the same as parsers?

As I mentioned above: A big part of the job of the lexer is generating these token types. For the same grammar it will always be the same terminal names.

from lark.

Daniel63656 avatar Daniel63656 commented on June 12, 2024

You defined a name for the nonterminal, not for the terminal. You need to define a name for the terminal, in uppercase.

Ah because I wrote it in lowercase? I had this in a university course and there nonterminals got capital letters I guess thats where the confusion comes from 😖
And the type will then be the defined terminal got it. I should then probably change my grammar to:

...
E: "treble" | "bass" | "staff_change"
R: /rest_[0-7]/
P: "start" | "continue" | "stop"
D: "up" | "dn"
A: "#" | "b" | "nat" | "x" | "bb"
N: "C" | "D" | "E" | "F" | "G" | "A" | "B"
O: "8va" | "8vb"
CLEF: "gclef" | "fclef"
DIGIT: /[0-9]/

But now I still need a way to get the type for a token

Sure, you can try use parser.lex for that.

next(parser.lex('bos')) returns a token but the wrong one in this case:
Token('KEY', 'b')
shouldn't the lexer match to Token('BOS', 'bos')?

This works fine when using the parser on entire input strings so I guess the question remains: How to use a lexer incrementally?

from lark.

Daniel63656 avatar Daniel63656 commented on June 12, 2024

After searching the source code for a while I figured it is probably easiest to do this myself:

import re

class DynamicLexer:
    def __init__(self, terminals):
        self.terminals = [(terminal.name, terminal.pattern.to_regexp()) for terminal in terminals]

    def lex_token(self, token_string):
        for name, pattern_str in self.terminals:
            if re.fullmatch(pattern_str, token_string):
                return Token(name, token_string)
        raise ValueError(f"Token '{token_string}' could not be matched.")
        
        
        
# example usage
lexer = DynamicLexer(parser.terminals)
print(lexer.lex_token('bos'))
print(lexer.lex_token('bb'))
# expect this to raise error
print(lexer.lex_token('bbt'))

This very simple class uses the terminals from the parser (so they are always same as your defined grammar) and returns a correct token with type if and only if the token can be fully matched. No clue if this is redundant or something but I don't see how to make it work wit lark directly.

One can now wrap a parser to work on token strings only:

class IncrementalParser:
    def __init__(self, parser):
        self.terminals = [(terminal.name, terminal.pattern.to_regexp()) for terminal in parser.terminals]
        self.parser = parser.parse_interactive()

    def lex_token(self, token_string):
        for name, pattern_str in self.terminals:
            if re.fullmatch(pattern_str, token_string):
                return Token(name, token_string)
        raise ValueError(f"Token '{token_string}' could not be matched.")
    
    def feed_token(self, token_string):
        token = self.lex_token(token_string)
        self.parser.feed_token(token)

    def accepts(self):
        return self.parser.accepts()

from lark.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.