Comments (11)
It does build the parse-tree incrementally. You can find it at ip.parser_state.value_stack
, where ip: InteractiveParser
. But it might be a bit tricky to use it effectively.
from lark.
I am a bit confused about this incremental parsing. I want to feed it one token at a time but it requires a Token class that also has a type. This type is the current production rule I guess? Why is this necessary when the parse tree is constructed incrementally? The current rule should be obvious then or not? Also value_stack is a stack and not a tree.
Edit: It seems these types are generated by the parser but I don't know them. I only have the tokens themselves (which should be enough why link them all to a different string anyway?)
from lark.
Like you said, tokens have a type and a value. The value is the actual bit of text being parsed, and the type is the category of that text. The parser doesn't look at the value at all. It only looks at the type, in order to choose how to update the state-machine.
value_stack is a stack, of course.. it's a stack of trees (or tokens).
from lark.
Yes but I don't know the type. I just want to add tokens (values) one at a time. I thought at first type has something to do with the rules they are used in but it's just another string that the parser uses instead of the actual (value) string. Seems kind of unnecessary. How would I even get these types?
from lark.
You define them. In the grammar. In the above grammar, it's the A
, B
, C
, i.e. the left side of the :
. What exactly the token types are depends on your grammar. If you use literals directly in your grammar, lark with automatically choose names. If you print parser.terminals
, you can see a list of all defined terminals. This isn't unnecessary: The parser doesn't care about the original text at all, it only cares about the token types. One token type can represent an infinite number of actual text strings. If you want to skip the lexer step, you need to do the step of the lexer and provide the token types.
from lark.
I'm sorry I am new to this. I have following grammar:
score: "bos" event "eos"
event: E meta* ottava? group+ "|"?
group: rest | chord | "grace" chord
rest: R P? "."*
chord: "whole" note+ "."* | "half" D note+ "."* | D note+ "."* | D P? "flag"+ note+ "."*
note: A? "tie"? N "tie"?
meta: clef | time | key
time: digit+ "over" digit+
key: "#"+ | "b"+ | "nat"+
ottava: O P
E: "treble" | "bass" | "staff_change"
R: /rest_[0-7]/
P: "start" | "continue" | "stop"
D: "up" | "dn"
A: "#" | "b" | "nat" | "x" | "bb"
N: "C" | "D" | "E" | "F" | "G" | "A" | "B"
O: "8va" | "8vb"
clef: "gclef" | "fclef"
digit: /[0-9]/
The expressions left of the ':' are my non-terminals. Now if I print the terminals
parser = Lark(grammar, parser="lalr", start='score')
print(parser.terminals)
I get for example
TerminalDef('E', '(?:staff_change|treble|bass)')
TerminalDef('R', 'rest_[0-7]')
TerminalDef('VBAR', '\|')
TerminalDef('GRACE', 'grace')
TerminalDef('DOT', '\.')
TerminalDef('__ANON_0', '[0-9]')
I discovered that the parser works when I use these categories as types e.g. Token('DOT', '.'). But I didn't define (and know) these types. If a terminal only appears on the right side of productions like ".", "grace", or "flag" then they get their own new type like
DOT: "." or GRACE:"grace" ? Should I define them myself (does every non-terminal need a production rule like nonterminal:terminal to cover them)?
Some types like 'E', 'R', and so on are defined as sets by me but still why is 'digit' turned into __ANON_0?
You said a lexer would typically do this. Can I then also partially lex one token at a time to get the type?
from lark.
If a terminal only appears on the right side of productions like ".", "grace", or "flag" then they get their own new type like
DOT: "." or GRACE:"grace" ?
Yes, lark automatically gives names, as I mentioned before.
Should I define them myself ?
That makes the names more predicable, but it's not necessary.
does every non-terminal need a production rule like nonterminal:terminal to cover them
Not sure what you mean. Are you sure you are using the terms nonterminal and terminal correctly?
In lark, everything in uppercase is a terminal, everything in lowercase is a nonterminal.
Some types like 'E', 'R', and so on are defined as sets by me but still why is 'digit' turned into __ANON_0?
Because lark couldn't find a better name. Every terminal needs a name.
You said a lexer would typically do this. Can I then also partially lex one token at a time to get the type?
Sure, you can try use parser.lex
for that.
from lark.
Not sure what you mean. Are you sure you are using the terms nonterminal and terminal correctly?
I am not sure. From the json_tutorial:
rule_name : list of rules and TERMINALS to match
| another possible list of items
| etc.
TERMINAL: "some text to match"
here rule_name is a non-terminal. And an example for a TERMINAL could be :
NUMBER : /-?\d+(.\d+)?([eE][+-]?\d+)?/
STRING : /".*?(?<!\)"/
What throws me off is that I read that left side of production rules must be non-terminals?! I also thought terminals are tokens like "flag" or "." in my case. But you say terminals are these categories?
Because lark couldn't find a better name. Every terminal needs a name.
But I defined a name: digit. That's kind of my point, if lark switches these names around then I can't reliably provide them myself. I didn't think of the Lexer. I thought a Lexer only breaks a string of tokens into individual tokens tbh. Is generating these types also part of lexing and will they then be the same as parsers?
from lark.
But I defined a name: digit
You defined a name for the nonterminal, not for the terminal. You need to define a name for the terminal, in uppercase.
I thought a Lexer only breaks a string of tokens into individual tokens tbh. Is generating these types also part of lexing and will they then be the same as parsers?
As I mentioned above: A big part of the job of the lexer is generating these token types. For the same grammar it will always be the same terminal names.
from lark.
You defined a name for the nonterminal, not for the terminal. You need to define a name for the terminal, in uppercase.
Ah because I wrote it in lowercase? I had this in a university course and there nonterminals got capital letters I guess thats where the confusion comes from 😖
And the type will then be the defined terminal got it. I should then probably change my grammar to:
...
E: "treble" | "bass" | "staff_change"
R: /rest_[0-7]/
P: "start" | "continue" | "stop"
D: "up" | "dn"
A: "#" | "b" | "nat" | "x" | "bb"
N: "C" | "D" | "E" | "F" | "G" | "A" | "B"
O: "8va" | "8vb"
CLEF: "gclef" | "fclef"
DIGIT: /[0-9]/
But now I still need a way to get the type for a token
Sure, you can try use parser.lex for that.
next(parser.lex('bos')) returns a token but the wrong one in this case:
Token('KEY', 'b')
shouldn't the lexer match to Token('BOS', 'bos')?
This works fine when using the parser on entire input strings so I guess the question remains: How to use a lexer incrementally?
from lark.
After searching the source code for a while I figured it is probably easiest to do this myself:
import re
class DynamicLexer:
def __init__(self, terminals):
self.terminals = [(terminal.name, terminal.pattern.to_regexp()) for terminal in terminals]
def lex_token(self, token_string):
for name, pattern_str in self.terminals:
if re.fullmatch(pattern_str, token_string):
return Token(name, token_string)
raise ValueError(f"Token '{token_string}' could not be matched.")
# example usage
lexer = DynamicLexer(parser.terminals)
print(lexer.lex_token('bos'))
print(lexer.lex_token('bb'))
# expect this to raise error
print(lexer.lex_token('bbt'))
This very simple class uses the terminals from the parser (so they are always same as your defined grammar) and returns a correct token with type if and only if the token can be fully matched. No clue if this is redundant or something but I don't see how to make it work wit lark directly.
One can now wrap a parser to work on token strings only:
class IncrementalParser:
def __init__(self, parser):
self.terminals = [(terminal.name, terminal.pattern.to_regexp()) for terminal in parser.terminals]
self.parser = parser.parse_interactive()
def lex_token(self, token_string):
for name, pattern_str in self.terminals:
if re.fullmatch(pattern_str, token_string):
return Token(name, token_string)
raise ValueError(f"Token '{token_string}' could not be matched.")
def feed_token(self, token_string):
token = self.lex_token(token_string)
self.parser.feed_token(token)
def accepts(self):
return self.parser.accepts()
from lark.
Related Issues (20)
- Referencing capturing groups errors when using LALR HOT 5
- Breaking changes / docs out of date HOT 7
- GrammarError: Rule 'anycase' used but not defined (in rule pipesyn) HOT 5
- Can not chain or merge two transformers HOT 7
- Black formatter breaks Lark standalone parser generation
- Is it possible to parse a list of terminals? HOT 2
- Is there a way to receive callbacks when a rule finishes
- Checking for allowed tokens with accepts() triggers transformer callbacks HOT 4
- Directly used literals not returned by transformer HOT 1
- Ability to search for parseable substrings HOT 6
- Improve IMAP ID parser HOT 13
- multiline strings in python3 grammar HOT 1
- Type of `tree.data` is wrong. HOT 1
- File input to `parse` method gives TypeError: object of type '_io.TextIOWrapper' has no len() HOT 5
- CPython 3.11.7 breaks `regex` module compatible pattern width calculations HOT 9
- Exclude classes in create_tranformer by user provided pedicate
- Make Token inherit from Generic. HOT 2
- Data structure for getting possible terminal sequences? HOT 2
- AssertionError when using templates HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lark.