lalrpop / lalrpop Goto Github PK
View Code? Open in Web Editor NEWLR(1) parser generator for Rust
Home Page: http://lalrpop.github.io/lalrpop/
License: Apache License 2.0
LR(1) parser generator for Rust
Home Page: http://lalrpop.github.io/lalrpop/
License: Apache License 2.0
We want the ability to add your own tokenizer. This should permit very lightweight specifications but scale to really complex things. I envision the first step as generating a tokenizer based on the terminals that people use (#4) but it'd be nice to actually just permit tokenizer specifications as well, where people can write custom action code based on the strings that have been recognized.
Some things I think we'll want:
Tok
for just one token.()
, we expect you to return zero tokens.(Tok, Tok)
, you always return two tokens.Vec<Tok>
, we expect you to return a dynamic number of tokens.Just trying to follow the tutorial with the nightly (1.4) build under Windows I get E0277
warnings at src/lexer/nfa/mod.rs
lines 293 and 294 and then this at the end:
Compiling lalrpop v0.5.0
failed to run custom build command for `lalrpop v0.5.0`
Process didn't exit successfully: `...\target\debug\build\lalrpop-f7721ad1ac348df2\build-script-build` (exit code: 101)
--- stderr
thread '<main>' panicked at 'called `Result::unwrap()` on an `Err` value: Error { repr: Os { code: 5, message: "Access is denied." } }', ../src/libcore\result.rs:734
I have really no idea what could this be about, would you have any? FWIW build-script-build.exe
is executable and can be ran without arguments, but I don't know how does Cargo run it (it doesn't show any details even with -v
).
This works:
Phrase: String = <a:r#"""#> <s:r#"[^"]*"#> <b:r#"""#> => s.to_owned();
This doesn't;
Phrase: String = <a:r"\""> <s:r#"[^"]*"#> <b:r"\""> => s.to_owned();
lalrpop escapes " into \", which is not what I'd expect
If you define a rule like this:
thing = {
"a" => "("
};
You'll get a parse error at semicolon like this:
--- stdout
src/huh.lalrpop:5:2: 5:2 error: unexpected token: `;`
};
^
For the given grammar:
use std::str::FromStr;
grammar;
pub Id: String = <s:r"[_a-zA-Z][_a-zA-Z0-9]*"> => String::from(s);
pub Expr: Vec<String> = {
<i:(Id ",")*> <u:Id?> => {
let mut total_vec: Vec<String> = i.into_iter().map(|t|{t.0}).collect();
if u.is_some() {
total_vec.push(u.unwrap());
}
total_vec
}
};
pub Name: String = {
"HEADER(" <s:Id> ")" => s
};
pub SEMI_SEP = r";+";
pub NEWLINE_SEP = r"\n+"; // will only compile with `"\n+"; regex doesn't work here.
File<Sep>: (String, Vec<Vec<String>>) = {
<name:Name> <lines:(Sep Expr)*> => (name, lines.into_iter().map(|t|{t.1}).collect::<Vec<Vec<String>>>())
};
pub NewlineFile = File<NEWLINE_SEP>; // WILL NOT COMPILE
pub SemiFile = File<SEMI_SEP>;
and the following test:
pub mod test;
#[test]
fn semi_test() {
let (name, ids) = test::parse_SemiFile(r#"HEADER(this_is_the_name);id1,id2;id3;id4,id5;;id6"#).unwrap();
assert_eq!(String::from("this_is_the_name"), name);
assert_eq!(4, ids.len());
}
#[test]
fn newline_test() {
let (name, ids) = test::parse_NewlineFile(r#"HEADER(this_is_the_name)
id1,id2
id3
id4,id5
id6"#).unwrap();
assert_eq!(String::from("this_is_the_name"), name);
assert_eq!(4, ids.len());
}
fn main() {}
semi_test
works flawlessly; newline_test
fails. If using the raw string (i.e. the regular expression instead of the newline literal) it doesn't even compile, and if using the normal string, it simply fails.
So—am I misunderstanding something here, or is newline special cased to not work where the semicolon would work?
I'm trying to replicate part of the rust grammar as found at https://github.com/rust-lang/rust/blob/master/src/grammar/parser-lalr.y and operator precedence is vital for not having my grammar be a huge mess of layers that encode the precedence.
Naturally, LALRPop should self-host. I think that's a good primary goal to work towards. Here is what is needed:
I imagine that we can have a cargo package called something like "lalrpop-bootstrap" that is just a clone of a fixed version of lalrpop which we update periodically, and lalrpop can depend on that.
In the ParseError
variant UnrecognizedToken
, I included a spot to list the tokens that would have been accepted -- but I never wrote the code to fill it in. It might also be interested to include nonterminals, though I imagine many tools would just want to screen them out. One question is how to format this list -- do we use the names from the grammar? The regular expressions? I guess so, not much else to use, unless we add a way for users to configure what gets added in the lalrpop file.
cc @LukasKalbertodt, who was asking around this.
There are a number of constraints we need to enforce:
~X
and ~foo:Y
within one production is bad~~X
makes no sense (perhaps this should just not parse)If I try to use the version 0.8.0 of LALRPOP I get the following build failure on both a Windows 7 and a Windows 10 machine. The Rust version was 1.5 stable. I ran into #41 which isn't fixed in 0.7.0 so I am kinda blocked by this issue.
Compiling lalrpop v0.8.0
failed to run custom build command for `lalrpop v0.8.0`
Process didn't exit successfully: `C:\Users\Mikko\Desktop\lalrpop_test\target\debug\build\lalrpop29dda9a615e8fbeb\build-script-build` (exit code: 101)
--- stderr
thread '<main>' panicked at 'called `Result::unwrap()` on an `Err` value: Error
{ repr: Os { code: 5, message: "Access is denied." } }', ../src/libcore\result.rs:738
All the other parts of LALRPOP, i.e. lalrpop-util, lalrpop-snap and lalrpop-intern compile just fine.
We need to check that named and anon symbols are not combined. This should be part of a validation step.
I want to use this library to parse the OpenDDL language, which is interesting in that whitespace is irrelevant at any position. As such, is there a way to tell the library to ignore certain characters?
Also, there is a small exception to this. In a string, whitespace is relevant. So I would like to ignore whitespace in all places but one.
#[inline]
repr
itselfOne of the features I've wanted to have for a long time is inlined nonterminals. The idea is that if you declare a terminal as #[inline]
:
#[inline]
A = B | C | ();
then every use of A
will be expanded out to its various alternatives. In other words, a reference like:
X = A "foo";
would become:
X = B "foo" | C "foo" | "foo";
It has been my experience that inlining nonterminals is very helpful for removing shift-reduce conflicts, particularly around optional content. Accordingly, the X*
and X?
shorthands would expand to inlined nonterminals.
This should solve shift-reduce conflicts like:
T = "L" | "&" "L"? T
Here we get a conflict in the state & (*) L
, where it's unclear whether the L
is the optional L
or part of the T
. In the latter case, we'd have to reduce () => "L"?
. But if we inlined the "L"?
expansion, we'd have:
T = "L" | "&" T | "&" "L" T
In which case, we could just shift the "L"
and wait until see the next token to decide what to do.
User annotation is needed here because inlining affects the visible order of execution.
Note though that inlining is only really useful in an LR(1)
setting. If we chose to adopt some kind of universal algorithm by default this would become less important.
The ability to disable implicit skipping of whitespace would be useful. Is it possible to add this as an option at the top of the file or somewhere else? I'm trying to parse a language with significant whitespace before tokens which is being automatically trimmed. If disabling that behavior is already possible, I was unable to find out how.
title says it all.
I wondered if it would be possible to accept several types as input. Right now, as I understand it, only &str
is a valid input. Suppose someone already wrote a lexer and now has a token stream. Or someone actually has a non string "thing" that needs to be parsed somehow. Or, god forbid, someone parsing a non-UTF8 string (&[u8]
).
In my case: I already wrote a Java 8 lexer and wondered if I could use a/this parser generator on top of my lexer.
So are there any plans to parse something else than a &str
?
A bit offtopic:
Java 8 has global unicode escapes: You can write \uXXXX anywhere in your source file and the corresponding unicode char is substituted for the escape sequence. I wonder if this would be easily implementable with this parser generator. Of course one could preprocess the source string before passing it to lalrpop, but that is not particularly nice.
As the parser generated only need 1 lookahead, it should be able to use an iterator instead of a string, and not require it to end in EOF. (i.o.w. allow infinite data)
This has the benefit that when reading from for example a socket, you do not have to wait till all data has arrived before you can start parsing.
Specifically for an issue I am having, I want to use lalrpop in two stages, a tokenizer and a parser. It'd allow for reading tokens without it either reading all tokens at once, either never matching because it doesn't find EOF.
Take for example this grammar definition:
grammar;
pub a = {
"a" => ","
};
pub aa = {
"a" => "a"
};
src/a.lalrpop:8:12: 8:12 error: unterminated string literal; missing `"`?
"a" => "a"
^
Currently S/R and R/R failures give you some pretty opaque error messages that basically just dump the state of the LR-table construction process. We can do better. Menhir, for example, tries to give more "grammar-centric" errors, and I'd like to do the same. Another option is also to identify common scenarios and try to target those in particular, as well.
Some examples of scenarios worth targeting:
(Sorry if this isn't the right place for this question; this just seemed like the easiest way to get help on this)
I'm attemping to write a parser for JavaScript using lalrpop, and I'm struggling to disambiguate post-incrementation/decremtnation operands (++
, --
) from regular addition and subtraction (+
, -
). Currently, the parser file looks like this:
// Statement nonterminal
pub Stmt: Stmt = {
<Var> "=" <Exp> ";" => Stmt::Assign(<>),
"var" <Var> "=" <Exp> ";" => Stmt::Decl(<>),
};
// Expression nonterminal
pub Exp: Exp = {
<e:Exp> <o:AddOp> <m:MulExp> => Exp::BinExp(Box::new(e), o, Box::new(m)),
MulExp,
};
AddOp: BinOp = {
"+" => BinOp::Plus,
"-" => BinOp::Minus,
};
// Parses multiplication
MulExp: Exp = {
<m:MulExp> <o:MulOp> <t:Term> => Exp::BinExp(Box::new(m), o, Box::new(t)),
Term,
};
MulOp: BinOp = {
"*" => BinOp::Star,
"/" => BinOp::Slash,
};
// Parses numbers and parenthetical expressions, and variables
Term: Exp = {
Float => Exp::Float(<>),
Var => Exp::Var(<>),
"(" <Exp> ")",
};
Float: f64 = {
r"-?[0-9]+(\.[0-9]+)?" => f64::from_str(<>).unwrap()
};
Var: String = {
r"[A-Za-z_][0-9A-Za-z_]*" => String::from(<>)
};
And the AST looks like this (excluding method/trait implementations I've defined for the structs):
pub enum BinOp {
Minus,
Plus,
Slash,
Star,
}
pub enum Exp {
BinExp(Box<Exp>, BinOp, Box<Exp>),
Float(f64),
Var(String),
PostDec(Box<Exp>),
PostInc(Box<Exp>),
PreDec(Box<Exp>),
PreInc(Box<Exp>),
}
pub enum Stmt {
Assign(String, Exp),
Decl(String, Exp),
}
(Sorry, I know that's quite a bit to look at!)
My issue is that if I add the ability to parse post-incrementation operations, e.g. x++
, then I get a shift-reduce conflict because the parser can't disambiguate whether to shift the first +
or to reduce and start parsing an addition expression (and similarly for post-decrementation and subtraction).
I was wondering if there is any functionality in lalrpop at this time to be able to disambiguate between these two cases. I admit that my knowledge of parsing is not extremely advanced, so I may be missing some obvious way to deal with these cases. Any tips/suggestions would be greatly appreciated.
Thanks!
I'd like to create an AST, but instead of Box
ing branches, I'd like to use an arena for the allocations.
Instead of
pub enum Expr {
Number(i32),
Op(Box<Expr>, Opcode, Box<Expr>),
}
I'd like
pub enum Expr<'a> {
Number(i32),
Op(Box<&'a Expr<'a>>, Opcode, Box<&'a Expr<'a>>),
}
However, this requires plumbing an arena all the way through the parser. Is this possible yet?
Is there any way to import one parser file's nonterminals for use in another? The .lalrpop
file I'm writing has been getting pretty large, so I was hoping to modularize it a bit. However, I'm not sure how to go about doing if (or if it's possible about all).
I have a .lalrpop
file in my project that results in an shift-reduce-conflict. When building, lalrpop partially generates the .rs
file which looks like this:
#![allow(unused_imports)]
#![allow(unused_variables)]
// here are my imports from the lalrpop file
extern crate lalrpop_util as __lalrpop_util;
use self::__lalrpop_util::ParseError as __ParseError;
This file then prevents lalrpop from rebuilding the lalrpop
file if it wasn't modified. When just executing cargo build
again, rustc
says that the function parse_GoalSymbol
does not exist. It's even worse when you are just testing, if some lalrpop file is valid (without using the pub
symbols in your rust code). Then the second cargo build
works without failure, which is super confusing.
I think the best solution is to not write anything to the .rs
file until everything was processed correctly.
You should be able to use the struct
and enum
keywords to generate types definition from a nonterminal. These should, I think, look as much as possible like their Rust counterparts:
struct BinaryExpr {
<left:Expr>,
<op:Op>,
<right:Expr>,
}
this is equivalent to BinaryExpr: BinaryExpr = <left:Expr> <op:Op> <right:Expr>;
combined with a type definition like:
pub struct BinaryExpr {
left: Expr,
op: BinaryOp,
right: Expr,
}
Now, it'd probably also be nice to be able to make enums somehow, but these are a bit trickier. I imagine we might want a way to make "aliases", e.g.:
enum BinaryOp {
Add = "+",
Sub = "-",
...
}
One thing that's unclear to me is whether you should be able to just dump random type definitions into your LALRPop grammar as well? That might be handy for cases where the grammar rules don't line up so well with the type, but you still want everything in one place for now.
And for sure we'll want the ability to #[derive]
, and probably some kind of auto-generated "pretty printing" trait that uses the grammar to generate back something like the original string that was input.
This feature will definitely take some tweaking to get right!
The current documentation does not explain fallible rules -- that is, rules that produce a Result
which counts as a parse error.
You should be able to thread context values and types through your grammar. I envision that you should be able to write something like:
grammar<'input, T:Eq>(value: &'input T) {
...
}
and then reference 'input
and T
from the types of nonterminals and value
from the action code. More specifically, I imagine your action code would be given an &mut
borrow of value
.
One of the main uses for this is so that when we synthesize a tokenizer, we can have a lifetime ('input
) corresponding to the input string. I'm not entirely clear though on whether that connects to this? Perhaps the tokenizer should just operate on an iterator over (usize, char)
pairs (like str.char_indinces()
) and users can choose to supply the string as context input?
src
is currently the hardcoded path at which processing starts, which results in a pretty ugly "No such file or directory" error without further information when this directory doesn't exist.
Could it just start in the current directory instead, or would that cause problems? (I don't think Cargo passes the path to the crate's root file to build scripts, which could provide another good starting directory)
If you create a .lalrpop file that contains only the line grammar;
, you will get a panic when building your project.
$ cargo build
Compiling miniprolog v0.1.0 (file:///Users/dagit/local-data/sandbox/rust/miniprolog)
failed to run custom build command for `miniprolog v0.1.0 (file:///Users/dagit/local-data/sandbox/rust/miniprolog)`
Process didn't exit successfully: `/Users/dagit/local-data/sandbox/rust/miniprolog/target/debug/build/miniprolog-50573cc21f0fa1d2/build-script-build` (exit code: 101)
--- stderr
thread '<main>' panicked at 'assertion failed: !other_transitions.is_empty()', /Users/dagit/.multirust/toolchains/stable/cargo/registry/src/github.com-0a35038f75765ae4/lalrpop-0.6.1/src/lexer/dfa/mod.rs:174
stack backtrace:
1: 0x104c69255 - sys::backtrace::write::h71ee98355e9ff89fUss
2: 0x104c727a0 - panicking::on_panic::h3058b136d38637c267w
3: 0x104c2bb52 - rt::unwind::begin_unwind_inner::h1a353d5ea12e1abeVBw
4: 0x1042a09bc - rt::unwind::begin_unwind::h5864343035433208673
5: 0x10436af11 - lexer::dfa::DFABuilder<'nfa>::build::h72d848d8eba64160CJi
6: 0x104363138 - lexer::dfa::build_dfa::h76649dab246e12c8P2h
7: 0x104482f29 - normalize::token_check::construct::h8a5ede6247deacc07Tp
8: 0x104418e22 - normalize::token_check::validate::hb4e530ee2dfc6236MIp
9: 0x10441731b - normalize::normalize_helper::ha10a27530ab278bcBpo
10: 0x1042a0b54 - normalize::normalize::h8de0f84f9cda2da8ppo
11: 0x104280068 - build::parse_and_normalize_grammar::h8860a3dd1118077eoVa
12: 0x1042774ec - build::process_dir::h489121058815528936
13: 0x104276b78 - build::process_root::hfd9d6d1192435efeeNa
14: 0x104268105 - main::h241e4db26acdb9e6faa
15: 0x104c7200d - __rust_try
16: 0x104c7372d - rt::lang_start::hd654f015947477d622w
17: 0x10426881e - main
If the file is completely empty you get a parse error instead of a panic.
Is there a way to do this?
pub LiteralDecimal<Out>: Out = <int:r"()[0-9]\_?)*[0-9]"> => int.parse<Out>();
The current tokenizer generation always uses two fixes precedence categories, so all regular expressions have equal weight. This is useful for giving a keyword like "class"
precedence of an identifier regex, but there are times when we would like to give some regexs higher precedence over others. For example, if parsing a case-insensitive language like Pascal, you would like to use regexs like r"[vV][aA][rR]"
rather than having to enumerate all combinations. But this won't work because of precedence conflicts. Another problem is that the tokenizer implicitly skips whitespace and there is no way to extend this set to skip other things, like comments.
I've been contemplating a match
declaration to address prioritization, which might look like:
match {
r"[vV][aA][rR]",
r"[a-zA-Z][a-zA-Z0-9]+",
}
The idea here is that when you have tokens listed in a match declaraction, we can create custom precedence levels, so that e.g. here the "var" regex takes precedence over the identifier regex. Tokens not listed would be implicitly added to the end of the list, with literals first and regex second.
I'm less clear on what to do here. I contemplated adding things to the match declaration with an "empty action", which would signal "do nothing":
match {
r"{.*}" => { } // pascal style comment
}
or having something like an empty if declaration:
if r"{.*}";
I think I prefer the first alternative, but it doesn't seem great. Another thing that is unclear is if the implicit whitespace skipping should be disabled or retained. I think I prefer to retain it unless the user asks for it to be disabled, because it's always a bit surprising when you add something and find it implicitly removes another default. That is, adding comments to the list of things to skip implicitly removes whitespace. But not having the implicit whitespace at all feels like clearly the wrong default.
Eventually I'd like to support lex-like specifications, where tokenizers can have a state stack -- and perhaps go further as described in #10. It'd be nice if we chose a syntax here that scaled gracefully to that case.
So some uncertainties here!
The current tutorial and documentation does not explain how one writes and integrates a custom lexer.
When I try to use the following grammar I get this error in dmesg
and the build scipt stops with signal 4
:
traps: build-script-bu[7480] trap invalid opcode ip:7f1febafb8f0 sp:7fff55edf848 error:0 in libstd-198068b3.so[7f1feb9ff000+1c1000]
Grammar (It's just an example I can reproduce this with other grammars as well)
grammar;
pub Message: Vec<String> =
<Token*>;
Token: Token = {
Ipv4,
Prec1
};
Prec1 = {
Float,
Prec2
};
Prec2 = {
r".*" => { <>.to_string() }
};
Ipv4: Token = {
"." <o0:Octet> "." <o1:Octet> "." <o2:Octet> "." <o3:Octet> => {
format!("{}.{}.{}.{}", o0, o1, o2, o3)
}
};
Octet =
r"(25[0-5]|2[0-4][0-9]|[01]?[1-9][0-9]?)\.";
Float: Token = {
r"[-+]?[0-9]*\.?[0-9]+" => { <>.to_string() }
};
Attempting to do another snapshot, I encountered a bug where the 'input
lifetime parameter is incorrectly pruned from the Nonterminal
type for lrgrammar.lalrpop
. Attempting to isolate into a smaller test case now.
We need to check for invalid terminals, they should be either a valid Rust ID or else a defined alias.
Neat project! I'm not sure if you knew this or not, but in the last few months, the regex
crate grew a sub-crate called regex-syntax
which exposes a full blown, well tested, regex parser/AST tree: https://doc.rust-lang.org/regex/regex_syntax/index.html
I briefly looked over yours, and it looks like it supports everything except negative matches. Not sure if that's a deal breaker, but thought I drop a note anyway!
According to @dagit, LALRPOP reports a lexer ambiguity for a grammar like:
use syntax::*;
grammar;
pub VAR = {
r"[A-Z][_a-zA-Z0-9]*"
};
STRING = {
r#"""# <r#"[^"]*"#> r#"""#
};
which produces output:
src/a.lalrpop:10:11: 10:20 error: ambiguity detected between the terminal `r#"[^\"]*"#` and the terminal `r#"[A-Z][_a-zA-Z0-9]*"#`
r#"""# <r#"[^"]*"#> r#"""#
~~~~~~~~~~
This seems to be incorrect.
Hi,
First of all, thanks for this great library :)
I'd like to instantiate different implementations of objects with the same grammar. This is like an abstract factory where the grammar uses the factory to create the objects.
If I could define a generic type parameter with a bound on a parser then I could use the factory.
What do you think about this use-case?
Sou I'd like to add arbitrary F: Factory
like bounds to every parse_*
functions.
In the new "explanatory errors", we will expand a reduce as much as we have to in order to show whence the lookahead arises. This means we will find some concrete symbol (or list of symbols) that has the lookahead in its first set. However, those symbols can be nonterminals. It'd be nice if we expanded them to show the actual lookahead as a concrete token.
Here is an example grammar where this problem arises:
grammar;
pub E: () = {
"L",
"&" OPT_L E
};
OPT_L: () = {
(),
"L"
};
You get an error report from this with a lookahead of "L"
and symbols like this as the "reduce" example:
"&" ╷ ╷ E
│ └─OPT_L─┘ │
└─E───────────┘
Note that you can't actually see the "L"
here. I think it'd be clearer if we expanded the E
and hence printed something like:
"&" ╷ ╷ "L"
│ └─OPT_L─┘ │ │
│ └E┘
└─E─────────────┘
This might better explain the ambiguity. I think the best place to do this would probably be rather late in the error-reporting code, since right now we have some nice invariants about the structure of the reductions (e.g., they are properly nested) which would be disturbed. We would also have to make the rendering code for Example
a bit smarter in order to handle having that invariant disturbed.
If we're going to self-host LALRPop, writing a tokenizer by hand is probably the easiest first step. After all, it needs some funky bits, like handling r#"foo"#
strings and code blocks.
Probably would be nice to clean up the warnings 😄
I'm trying to write a grammar along the lines of parser.mly
in this MiniML project. However, for the type recognition, I'm getting this error:
Process didn't exit successfully: `/...` (exit code: 1)
--- stdout
when in this state:
Ty = Ty (*) "->" Ty [EOF]
Ty = Ty (*) "->" Ty ["->"]
Ty = Ty "->" Ty (*) [EOF]
Ty = Ty "->" Ty (*) ["->"]
and looking at a token `"->"`,
we can reduce to a `Ty`
but we can also shift
My reduced case of the issue is as follows:
ast.rs
#[derive(Debug)]
pub enum Type {
Int,
Bool,
Arrow(Box<Type>, Box<Type>),
}
lalrfail.lalrpop
use ast::Type;
grammar;
pub Ty: Type = {
"int" => Type::Int,
"bool" => Type::Bool,
<t1:Ty> "->" <t2:Ty> => Type::Arrow(Box::new(t1), Box::new(t2)),
};
I believe this is what you mean in the tutorial when you say you aim to someday cover:
Advice for resolving shift-reduce and reduce-reduce conflicts
but while the other TODO items at least have links to things I could explore to try to figure out on my own, I'm at a dead end here.
Until you have a chance to write that up, is there a small explanation somewhere of what's going on and maybe some high level tips for fixing it? If you can get me started, maybe I can figure it out for myself and then put something in the wiki about it, if you want.
I am puzzled by how the library handles whitespace. I have the following syntax:
pub Identifier: String = <ident:r"[\_A-z][\_A-z0-9]*"> => String::from(ident);
I have a unit test where it tests the syntax for " Word". It passes. Why is this? It should fail.
According to @dagit, the following grammar produces infinite memory usage, presumably this is a bug in the DFA/NFA code:
grammar;
pub period: () = {
r"." => ()
};
If I try to build LALRPOP on the latest nightly, the build fails rather spectacularly by first spewing out a couple thousand warnings and then hitting an internal compiler error. Changing the machine and/or the OS doesn't affect the result. Also, both 0.7.0 and the current master branch fail in exactly the same way. Here is a log of the failing build:
In this grammar:
grammar;
pub E: () = {
"X" "{" <a:AT*> <e:ET> <b:AT*> "}" => (),
};
AT: () = {
"type" ";"
};
ET: () = {
"enum" "{" "}"
};
inlining produces a state like this, which is wrong:
// State 3
// AT = (*) "type" ";" ["enum"]
// AT = (*) "type" ";" ["type"]
// AT+ = (*) AT ["enum"]
// AT+ = (*) AT ["type"]
// AT+ = (*) AT+ AT ["enum"]
// AT+ = (*) AT+ AT ["type"]
// E = "X" "{" (*) AT+ ET AT+ "}" [EOF] <---- ???
// E = "X" "{" (*) ET "}" [EOF]
// ET = (*) "enum" "{" "}" ["}"]
//
// "enum" -> Shift(S7)
// "type" -> Shift(S8)
//
// AT -> S4
// AT+ -> S5
// ET -> S6
In particular, we are missing "X" "{" AT+ ET "}"
, which will cause parse failures.
grammar;
pub Thing {
r"(%%|[^%])+"
};
We've got a lot of the pieces, but you should be able to have LALRPop generate a tokenizer for you. The idea to start is roughly that terminals like "foo" will be interpreted as literals and that we add r"foo"
for regular expressions. You can sprinkle them wherever and things "just work".
Hello,
i try to parse php expressions using lalrpop:
https://github.com/timglabisch/rustphp/blob/c79060d6495a55174fc2ad5710d5774d8ec94d67/src/calculator1.lalrpop
my problem is that cargo run becomes incredible slow:
time cargo run
1234.76s user 19.88s system 99% cpu 20:58.78 total
the file is very huge (~50mb):
cat src/calculator1.rs | wc -l
1044621
is there something fundamentally wrong or is this expected?
With this simple grammar:
grammar;
#[inline]
Foo = { "a", "b" };
I get:
src/grammar/test.lalrpop:3:3: 3:8 error: unrecognized annotation `inline`
#[inline]
~~~~~~
I suppose this isn't correct, is it? I read a bit through the lalrpop-source files and it seems that this error really shouldn't happen. The output is in normalize/prevalidation.rs:83
. Maybe something is wrong with the string interner?
Something like this ~~X
or ~n:~X
makes no sense.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.