lalrpop / lalrpop Goto Github PK

View Code? Open in Web Editor NEW

2.9K 35.0 287.0 9.08 MB

LR(1) parser generator for Rust

Home Page: http://lalrpop.github.io/lalrpop/

License: Apache License 2.0

Rust 99.80% Shell 0.20%

parser-generator rust grammar

lalrpop's Issues

tokenizer support

We want the ability to add your own tokenizer. This should permit very lightweight specifications but scale to really complex things. I envision the first step as generating a tokenizer based on the terminals that people use (#4) but it'd be nice to actually just permit tokenizer specifications as well, where people can write custom action code based on the strings that have been recognized.

Some things I think we'll want:

A stack of states for the tokenizer to use
The ability for action code to yield any number of tokens. I envision that we'll define a trait and require you to return some value that conforms to that trait. This trait will permit:
- return Tok for just one token.
- if you write a return type of (), we expect you to return zero tokens.
- if you write a return type of (Tok, Tok), you always return two tokens.
- if you write a return type of Vec<Tok>, we expect you to return a dynamic number of tokens.
- internally, the generated code will keep a queue of generated tokens, and as tokens are requested they are removed from the front, and we only go back to the bytes when that queue is exhausted.

Failure under Windows

Just trying to follow the tutorial with the nightly (1.4) build under Windows I get E0277 warnings at src/lexer/nfa/mod.rs lines 293 and 294 and then this at the end:

   Compiling lalrpop v0.5.0
failed to run custom build command for `lalrpop v0.5.0`
Process didn't exit successfully: `...\target\debug\build\lalrpop-f7721ad1ac348df2\build-script-build` (exit code: 101)
--- stderr
thread '<main>' panicked at 'called `Result::unwrap()` on an `Err` value: Error { repr: Os { code: 5, message: "Access is denied." } }', ../src/libcore\result.rs:734

I have really no idea what could this be about, would you have any? FWIW build-script-build.exe is executable and can be ran without arguments, but I don't know how does Cargo run it (it doesn't show any details even with -v).

Escape sequences don't seem to be handled properly

This works:

Phrase: String =  <a:r#"""#> <s:r#"[^"]*"#> <b:r#"""#> => s.to_owned();

This doesn't;

Phrase: String =  <a:r"\""> <s:r#"[^"]*"#> <b:r"\""> => s.to_owned();

lalrpop escapes " into \", which is not what I'd expect

Parens in strings in actions causes parse errors

If you define a rule like this:

thing = {
  "a" => "("
};

You'll get a parse error at semicolon like this:

--- stdout
src/huh.lalrpop:5:2: 5:2 error: unexpected token: `;`
  };
   ^

Cannot appear to match against newlines

For the given grammar:

use std::str::FromStr;

grammar;

pub Id: String = <s:r"[_a-zA-Z][_a-zA-Z0-9]*"> => String::from(s);
pub Expr: Vec<String> = {
  <i:(Id ",")*> <u:Id?> => {
    let mut total_vec: Vec<String> = i.into_iter().map(|t|{t.0}).collect();
    if u.is_some() {
      total_vec.push(u.unwrap());
    }
    total_vec
  }
};

pub Name: String = {
  "HEADER(" <s:Id> ")" => s
};

pub SEMI_SEP = r";+";
pub NEWLINE_SEP = r"\n+"; // will only compile with `"\n+"; regex doesn't work here.

File<Sep>: (String, Vec<Vec<String>>) = {
  <name:Name> <lines:(Sep Expr)*> => (name, lines.into_iter().map(|t|{t.1}).collect::<Vec<Vec<String>>>())
};

pub NewlineFile = File<NEWLINE_SEP>; // WILL NOT COMPILE
pub SemiFile = File<SEMI_SEP>;

and the following test:

pub mod test;

#[test]
fn semi_test() {
  let (name, ids) = test::parse_SemiFile(r#"HEADER(this_is_the_name);id1,id2;id3;id4,id5;;id6"#).unwrap();
  assert_eq!(String::from("this_is_the_name"), name);
  assert_eq!(4, ids.len());
}

#[test]
fn newline_test() {
  let (name, ids) = test::parse_NewlineFile(r#"HEADER(this_is_the_name)
id1,id2
id3
id4,id5

id6"#).unwrap();
  assert_eq!(String::from("this_is_the_name"), name);
  assert_eq!(4, ids.len());
}

fn main() {}

semi_test works flawlessly; newline_test fails. If using the raw string (i.e. the regular expression instead of the newline literal) it doesn't even compile, and if using the normal string, it simply fails.

So—am I misunderstanding something here, or is newline special cased to not work where the semicolon would work?

Declarative precedence declarations

I'm trying to replicate part of the rust grammar as found at https://github.com/rust-lang/rust/blob/master/src/grammar/parser-lalr.y and operator precedence is vital for not having my grammar be a huge mess of layers that encode the precedence.

self-hosting

Naturally, LALRPop should self-host. I think that's a good primary goal to work towards. Here is what is needed:

write a tokenizer (#8)
write the grammar
explore the build infrastructure and get that up and running

I imagine that we can have a cargo package called something like "lalrpop-bootstrap" that is just a clone of a fixed version of lalrpop which we update periodically, and lalrpop can depend on that.

Include expected tokens on parse error

In the ParseError variant UnrecognizedToken, I included a spot to list the tokens that would have been accepted -- but I never wrote the code to fill it in. It might also be interested to include nonterminals, though I imagine many tools would just want to screen them out. One question is how to format this list -- do we use the names from the grammar? The regular expressions? I guess so, not much else to use, unless we add a way for users to configure what gets added in the lalrpop file.

cc @LukasKalbertodt, who was asking around this.

add a validation step

There are a number of constraints we need to enforce:

mixing named and anon parameters like ~X and ~foo:Y within one production is bad
nesting like ~~X makes no sense (perhaps this should just not parse)
with using an external token type, nonterminals should either be valid identifiers or remapped

Build failure on Windows

If I try to use the version 0.8.0 of LALRPOP I get the following build failure on both a Windows 7 and a Windows 10 machine. The Rust version was 1.5 stable. I ran into #41 which isn't fixed in 0.7.0 so I am kinda blocked by this issue.

Compiling lalrpop v0.8.0
failed to run custom build command for `lalrpop v0.8.0`
Process didn't exit successfully: `C:\Users\Mikko\Desktop\lalrpop_test\target\debug\build\lalrpop29dda9a615e8fbeb\build-script-build` (exit code: 101)
--- stderr
thread '<main>' panicked at 'called `Result::unwrap()` on an `Err` value: Error
{ repr: Os { code: 5, message: "Access is denied." } }', ../src/libcore\result.rs:738

All the other parts of LALRPOP, i.e. lalrpop-util, lalrpop-snap and lalrpop-intern compile just fine.

Check for correct use of symbol choosing

We need to check that named and anon symbols are not combined. This should be part of a validation step.

Ignore Whitespace

I want to use this library to parse the OpenDDL language, which is interesting in that whitespace is irrelevant at any position. As such, is there a way to tell the library to ignore certain characters?

Also, there is a small exception to this. In a string, whitespace is relevant. So I would like to ignore whitespace in all places but one.

Inlined nonterminals

Checklist

Add annotations, make it a validation error to use anything but #[inline]
Do ininling expansion, either at the end of normalization or (perhaps better) on the repr itself

Description

One of the features I've wanted to have for a long time is inlined nonterminals. The idea is that if you declare a terminal as #[inline]:

#[inline]
A = B | C | ();

then every use of A will be expanded out to its various alternatives. In other words, a reference like:

X = A "foo";

would become:

X = B "foo" | C "foo" | "foo";

It has been my experience that inlining nonterminals is very helpful for removing shift-reduce conflicts, particularly around optional content. Accordingly, the X* and X? shorthands would expand to inlined nonterminals.

This should solve shift-reduce conflicts like:

T = "L" | "&" "L"? T

Here we get a conflict in the state & (*) L, where it's unclear whether the L is the optional L or part of the T. In the latter case, we'd have to reduce () => "L"?. But if we inlined the "L"? expansion, we'd have:

T = "L" | "&" T | "&" "L" T

In which case, we could just shift the "L" and wait until see the next token to decide what to do.

User annotation is needed here because inlining affects the visible order of execution.

Note though that inlining is only really useful in an LR(1) setting. If we chose to adopt some kind of universal algorithm by default this would become less important.

Add option to disable implicit whitespace skipping

The ability to disable implicit skipping of whitespace would be useful. Is it possible to add this as an option at the top of the file or somewhere else? I'm trying to parse a language with significant whitespace before tokens which is being automatically trimmed. If disabling that behavior is already possible, I was unable to find out how.

link to `main.rs` at the bottom of `tutorial.md` is broken.

title says it all.

Being more generic over input?

I wondered if it would be possible to accept several types as input. Right now, as I understand it, only &str is a valid input. Suppose someone already wrote a lexer and now has a token stream. Or someone actually has a non string "thing" that needs to be parsed somehow. Or, god forbid, someone parsing a non-UTF8 string (&[u8]).

In my case: I already wrote a Java 8 lexer and wondered if I could use a/this parser generator on top of my lexer.

So are there any plans to parse something else than a &str?

A bit offtopic:

Java 8 has global unicode escapes: You can write \uXXXX anywhere in your source file and the corresponding unicode char is substituted for the escape sequence. I wonder if this would be easily implementable with this parser generator. Of course one could preprocess the source string before passing it to lalrpop, but that is not particularly nice.

Use `IntoIterator<Item = char>` instead of &str and do not require ending in EOF

As the parser generated only need 1 lookahead, it should be able to use an iterator instead of a string, and not require it to end in EOF. (i.o.w. allow infinite data)

This has the benefit that when reading from for example a socket, you do not have to wait till all data has arrived before you can start parsing.

Specifically for an issue I am having, I want to use lalrpop in two stages, a tokenizer and a parser. It'd allow for reading tokens without it either reading all tokens at once, either never matching because it doesn't find EOF.

Comma in a string in an action causes a parse error

Take for example this grammar definition:

grammar;

pub a = {
  "a" => ","
};

pub aa = {
  "a" => "a"
};

src/a.lalrpop:8:12: 8:12 error: unterminated string literal; missing `"`?
    "a" => "a"
             ^

Better error messages for shift-reduce and reduce-reduce failures

Currently S/R and R/R failures give you some pretty opaque error messages that basically just dump the state of the LR-table construction process. We can do better. Menhir, for example, tries to give more "grammar-centric" errors, and I'd like to do the same. Another option is also to identify common scenarios and try to target those in particular, as well.

Some examples of scenarios worth targeting:

Cases where inlining could help
Associativity conflicts like #22

Disambiguating post-incrementation/decrementation with addition/subtraction

(Sorry if this isn't the right place for this question; this just seemed like the easiest way to get help on this)

I'm attemping to write a parser for JavaScript using lalrpop, and I'm struggling to disambiguate post-incrementation/decremtnation operands (++, --) from regular addition and subtraction (+, -). Currently, the parser file looks like this:

// Statement nonterminal
pub Stmt: Stmt = {
    <Var> "=" <Exp>  ";" => Stmt::Assign(<>),
    "var" <Var> "=" <Exp> ";" => Stmt::Decl(<>),
};

// Expression nonterminal
pub Exp: Exp = {
    <e:Exp> <o:AddOp> <m:MulExp> => Exp::BinExp(Box::new(e), o, Box::new(m)),
    MulExp,
};

AddOp: BinOp = {
    "+" => BinOp::Plus,
    "-" => BinOp::Minus,
};

// Parses multiplication
MulExp: Exp = {
    <m:MulExp> <o:MulOp> <t:Term> => Exp::BinExp(Box::new(m), o, Box::new(t)),
    Term,
};

MulOp: BinOp = {
    "*" => BinOp::Star,
    "/" => BinOp::Slash,
};

// Parses numbers and parenthetical expressions, and variables
Term: Exp = {
    Float => Exp::Float(<>),
    Var => Exp::Var(<>),
    "(" <Exp> ")",
};

Float: f64 = {
    r"-?[0-9]+(\.[0-9]+)?" => f64::from_str(<>).unwrap()
};

Var: String = {
    r"[A-Za-z_][0-9A-Za-z_]*" => String::from(<>)
};

And the AST looks like this (excluding method/trait implementations I've defined for the structs):

pub enum BinOp {
    Minus,
    Plus,
    Slash,
    Star,
}

pub enum Exp {
    BinExp(Box<Exp>, BinOp, Box<Exp>),
    Float(f64),
    Var(String),
    PostDec(Box<Exp>),
    PostInc(Box<Exp>),
    PreDec(Box<Exp>),
    PreInc(Box<Exp>),
}

pub enum Stmt {
    Assign(String, Exp),
    Decl(String, Exp),
}

(Sorry, I know that's quite a bit to look at!)

My issue is that if I add the ability to parse post-incrementation operations, e.g. x++, then I get a shift-reduce conflict because the parser can't disambiguate whether to shift the first + or to reduce and start parsing an addition expression (and similarly for post-decrementation and subtraction).

I was wondering if there is any functionality in lalrpop at this time to be able to disambiguate between these two cases. I admit that my knowledge of parsing is not extremely advanced, so I may be missing some obvious way to deal with these cases. Any tips/suggestions would be greatly appreciated.

Thanks!

Option to pass arguments through the parser

I'd like to create an AST, but instead of Boxing branches, I'd like to use an arena for the allocations.

Instead of

pub enum Expr {
    Number(i32),
    Op(Box<Expr>, Opcode, Box<Expr>),
}

I'd like

pub enum Expr<'a> {
    Number(i32),
    Op(Box<&'a Expr<'a>>, Opcode, Box<&'a Expr<'a>>),
}

However, this requires plumbing an arena all the way through the parser. Is this possible yet?

Split parser into separate files

Is there any way to import one parser file's nonterminals for use in another? The .lalrpop file I'm writing has been getting pretty large, so I was hoping to modularize it a bit. However, I'm not sure how to go about doing if (or if it's possible about all).

Partial generated .rs file prevents rebuild

I have a .lalrpop file in my project that results in an shift-reduce-conflict. When building, lalrpop partially generates the .rs file which looks like this:

#![allow(unused_imports)]
#![allow(unused_variables)]
// here are my imports from the lalrpop file
extern crate lalrpop_util as __lalrpop_util;
use self::__lalrpop_util::ParseError as __ParseError;

This file then prevents lalrpop from rebuilding the lalrpop file if it wasn't modified. When just executing cargo build again, rustc says that the function parse_GoalSymbol does not exist. It's even worse when you are just testing, if some lalrpop file is valid (without using the pub symbols in your rust code). Then the second cargo build works without failure, which is super confusing.

I think the best solution is to not write anything to the .rs file until everything was processed correctly.

Support "struct" (and "enum"?) nonterminals

You should be able to use the struct and enum keywords to generate types definition from a nonterminal. These should, I think, look as much as possible like their Rust counterparts:

struct BinaryExpr {
    <left:Expr>,
    <op:Op>,
    <right:Expr>,
}

this is equivalent to BinaryExpr: BinaryExpr = <left:Expr> <op:Op> <right:Expr>; combined with a type definition like:

pub struct BinaryExpr {
    left: Expr,
    op: BinaryOp,
    right: Expr,
}

Now, it'd probably also be nice to be able to make enums somehow, but these are a bit trickier. I imagine we might want a way to make "aliases", e.g.:

enum BinaryOp {
    Add = "+",
    Sub = "-",
    ...
}

One thing that's unclear to me is whether you should be able to just dump random type definitions into your LALRPop grammar as well? That might be handy for cases where the grammar rules don't line up so well with the type, but you still want everything in one place for now.

And for sure we'll want the ability to #[derive], and probably some kind of auto-generated "pretty printing" trait that uses the grammar to generate back something like the original string that was input.

This feature will definitely take some tweaking to get right!

Document fallible rules.

The current documentation does not explain fallible rules -- that is, rules that produce a Result which counts as a parse error.

add ability for grammars to have context values, type/lifetime parameters

You should be able to thread context values and types through your grammar. I envision that you should be able to write something like:

grammar<'input, T:Eq>(value: &'input T) {
     ...
}

and then reference 'input and T from the types of nonterminals and value from the action code. More specifically, I imagine your action code would be given an &mut borrow of value.

One of the main uses for this is so that when we synthesize a tokenizer, we can have a lifetime ('input) corresponding to the input string. I'm not entirely clear though on whether that connects to this? Perhaps the tokenizer should just operate on an iterator over (usize, char) pairs (like str.char_indinces()) and users can choose to supply the string as context input?

process_root assumes that the source code is placed in `src`

src is currently the hardcoded path at which processing starts, which results in a pretty ugly "No such file or directory" error without further information when this directory doesn't exist.

Could it just start in the current directory instead, or would that cause problems? (I don't think Cargo passes the path to the crate's root file to build scripts, which could provide another good starting directory)

Panic with no productions defined

If you create a .lalrpop file that contains only the line grammar;, you will get a panic when building your project.

$ cargo build
   Compiling miniprolog v0.1.0 (file:///Users/dagit/local-data/sandbox/rust/miniprolog)
failed to run custom build command for `miniprolog v0.1.0 (file:///Users/dagit/local-data/sandbox/rust/miniprolog)`
Process didn't exit successfully: `/Users/dagit/local-data/sandbox/rust/miniprolog/target/debug/build/miniprolog-50573cc21f0fa1d2/build-script-build` (exit code: 101)
--- stderr
thread '<main>' panicked at 'assertion failed: !other_transitions.is_empty()', /Users/dagit/.multirust/toolchains/stable/cargo/registry/src/github.com-0a35038f75765ae4/lalrpop-0.6.1/src/lexer/dfa/mod.rs:174
stack backtrace:
   1:        0x104c69255 - sys::backtrace::write::h71ee98355e9ff89fUss
   2:        0x104c727a0 - panicking::on_panic::h3058b136d38637c267w
   3:        0x104c2bb52 - rt::unwind::begin_unwind_inner::h1a353d5ea12e1abeVBw
   4:        0x1042a09bc - rt::unwind::begin_unwind::h5864343035433208673
   5:        0x10436af11 - lexer::dfa::DFABuilder<'nfa>::build::h72d848d8eba64160CJi
   6:        0x104363138 - lexer::dfa::build_dfa::h76649dab246e12c8P2h
   7:        0x104482f29 - normalize::token_check::construct::h8a5ede6247deacc07Tp
   8:        0x104418e22 - normalize::token_check::validate::hb4e530ee2dfc6236MIp
   9:        0x10441731b - normalize::normalize_helper::ha10a27530ab278bcBpo
  10:        0x1042a0b54 - normalize::normalize::h8de0f84f9cda2da8ppo
  11:        0x104280068 - build::parse_and_normalize_grammar::h8860a3dd1118077eoVa
  12:        0x1042774ec - build::process_dir::h489121058815528936
  13:        0x104276b78 - build::process_root::hfd9d6d1192435efeeNa
  14:        0x104268105 - main::h241e4db26acdb9e6faa
  15:        0x104c7200d - __rust_try
  16:        0x104c7372d - rt::lang_start::hd654f015947477d622w
  17:        0x10426881e - main

If the file is completely empty you get a parse error instead of a panic.

Allow nonterminals to carry additional type parameters

Is there a way to do this?

pub LiteralDecimal<Out>: Out = <int:r"()[0-9]\_?)*[0-9]"> => int.parse<Out>();

Customize handling whitespace, comments when using generated tokenizer

The current tokenizer generation always uses two fixes precedence categories, so all regular expressions have equal weight. This is useful for giving a keyword like "class" precedence of an identifier regex, but there are times when we would like to give some regexs higher precedence over others. For example, if parsing a case-insensitive language like Pascal, you would like to use regexs like r"[vV][aA][rR]" rather than having to enumerate all combinations. But this won't work because of precedence conflicts. Another problem is that the tokenizer implicitly skips whitespace and there is no way to extend this set to skip other things, like comments.

Prioritization

I've been contemplating a match declaration to address prioritization, which might look like:

match {
    r"[vV][aA][rR]",
    r"[a-zA-Z][a-zA-Z0-9]+",
}

The idea here is that when you have tokens listed in a match declaraction, we can create custom precedence levels, so that e.g. here the "var" regex takes precedence over the identifier regex. Tokens not listed would be implicitly added to the end of the list, with literals first and regex second.

Comments and whitespace

I'm less clear on what to do here. I contemplated adding things to the match declaration with an "empty action", which would signal "do nothing":

match {
    r"{.*}" => { } // pascal style comment
}

or having something like an empty if declaration:

if r"{.*}";

I think I prefer the first alternative, but it doesn't seem great. Another thing that is unclear is if the implicit whitespace skipping should be disabled or retained. I think I prefer to retain it unless the user asks for it to be disabled, because it's always a bit surprising when you add something and find it implicitly removes another default. That is, adding comments to the list of things to skip implicitly removes whitespace. But not having the implicit whitespace at all feels like clearly the wrong default.

More complex tokenizers

Eventually I'd like to support lex-like specifications, where tokenizers can have a state stack -- and perhaps go further as described in #10. It'd be nice if we chose a syntax here that scaled gracefully to that case.

So some uncertainties here!

Document custom lexers

The current tutorial and documentation does not explain how one writes and integrates a custom lexer.

trap invalid opcode

When I try to use the following grammar I get this error in dmesg and the build scipt stops with signal 4:

traps: build-script-bu[7480] trap invalid opcode ip:7f1febafb8f0 sp:7fff55edf848 error:0 in libstd-198068b3.so[7f1feb9ff000+1c1000]

Grammar (It's just an example I can reproduce this with other grammars as well)

grammar;

pub Message: Vec<String> =
  <Token*>;

Token: Token = {
  Ipv4,
  Prec1
};

Prec1 = {
  Float,
  Prec2
};

Prec2 = {
  r".*" => { <>.to_string() }
};

Ipv4: Token = {
  "." <o0:Octet> "." <o1:Octet> "." <o2:Octet> "." <o3:Octet> => {
    format!("{}.{}.{}.{}", o0, o1, o2, o3)
  }
};

Octet =
  r"(25[0-5]|2[0-4][0-9]|[01]?[1-9][0-9]?)\.";

Float: Token = {
  r"[-+]?[0-9]*\.?[0-9]+" => { <>.to_string() }
};

pruning of `'input` from nonterminal type sometimes overly aggressive

Attempting to do another snapshot, I encountered a bug where the 'input lifetime parameter is incorrectly pruned from the Nonterminal type for lrgrammar.lalrpop. Attempting to isolate into a smaller test case now.

Check for invalid terminals

We need to check for invalid terminals, they should be either a valid Rust ID or else a defined alias.

use regex-syntax?

Neat project! I'm not sure if you knew this or not, but in the last few months, the regex crate grew a sub-crate called regex-syntax which exposes a full blown, well tested, regex parser/AST tree: https://doc.rust-lang.org/regex/regex_syntax/index.html

I briefly looked over yours, and it looks like it supports everything except negative matches. Not sure if that's a deal breaker, but thought I drop a note anyway!

incorrect lexer ambiguity between identifiers and string literals

According to @dagit, LALRPOP reports a lexer ambiguity for a grammar like:

use syntax::*;

grammar;

pub VAR = {
  r"[A-Z][_a-zA-Z0-9]*"
};

STRING = {
  r#"""# <r#"[^"]*"#> r#"""#
};

which produces output:

src/a.lalrpop:10:11: 10:20 error: ambiguity detected between the terminal `r#"[^\"]*"#` and the terminal `r#"[A-Z][_a-zA-Z0-9]*"#`
    r#"""# <r#"[^"]*"#> r#"""#
            ~~~~~~~~~~

This seems to be incorrect.

Support for generic type parameters

Hi,

First of all, thanks for this great library :)

I'd like to instantiate different implementations of objects with the same grammar. This is like an abstract factory where the grammar uses the factory to create the objects.

If I could define a generic type parameter with a bound on a parser then I could use the factory.

What do you think about this use-case?

Sou I'd like to add arbitrary F: Factory like bounds to every parse_* functions.

Expand reduce backtrace examples to show input tokens

In the new "explanatory errors", we will expand a reduce as much as we have to in order to show whence the lookahead arises. This means we will find some concrete symbol (or list of symbols) that has the lookahead in its first set. However, those symbols can be nonterminals. It'd be nice if we expanded them to show the actual lookahead as a concrete token.

Here is an example grammar where this problem arises:

        grammar;

        pub E: () = {
            "L",
            "&" OPT_L E
        };

        OPT_L: () = {
            (),
            "L"
        };

You get an error report from this with a lookahead of "L" and symbols like this as the "reduce" example:

    "&" ╷       ╷ E
    │   └─OPT_L─┘ │
    └─E───────────┘

Note that you can't actually see the "L" here. I think it'd be clearer if we expanded the E and hence printed something like:

    "&" ╷       ╷ "L"
    │   └─OPT_L─┘ │ │
    │             └E┘
    └─E─────────────┘

This might better explain the ambiguity. I think the best place to do this would probably be rather late in the error-reporting code, since right now we have some nice invariants about the structure of the reductions (e.g., they are properly nested) which would be disturbed. We would also have to make the rendering code for Example a bit smarter in order to handle having that invariant disturbed.

write a tokenizer for self-hosting LALRPop

If we're going to self-host LALRPop, writing a tokenizer by hand is probably the easiest first step. After all, it needs some funky bits, like handling r#"foo"# strings and code blocks.

lalrpop-snap generates RFC 1214 warnings

Probably would be nice to clean up the warnings 😄

we can reduce to a `Foo` but we can also shift

I'm trying to write a grammar along the lines of parser.mly in this MiniML project. However, for the type recognition, I'm getting this error:

Process didn't exit successfully: `/...` (exit code: 1)
--- stdout
when in this state:
  Ty = Ty (*) "->" Ty [EOF]
  Ty = Ty (*) "->" Ty ["->"]
  Ty = Ty "->" Ty (*) [EOF]
  Ty = Ty "->" Ty (*) ["->"]
and looking at a token `"->"`,
we can reduce to a `Ty`
but we can also shift

My reduced case of the issue is as follows:

ast.rs

#[derive(Debug)]
pub enum Type {
    Int,
    Bool,
    Arrow(Box<Type>, Box<Type>),
}

lalrfail.lalrpop

use ast::Type;

grammar;

pub Ty: Type = {
    "int" => Type::Int,
    "bool" => Type::Bool,
    <t1:Ty> "->" <t2:Ty> => Type::Arrow(Box::new(t1), Box::new(t2)),
};

I believe this is what you mean in the tutorial when you say you aim to someday cover:

Advice for resolving shift-reduce and reduce-reduce conflicts

but while the other TODO items at least have links to things I could explore to try to figure out on my own, I'm at a dead end here.

Until you have a chance to write that up, is there a small explanation somewhere of what's going on and maybe some high level tips for fixing it? If you can get me started, maybe I can figure it out for myself and then put something in the wiki about it, if you want.

How do whitespaces get handled?

I am puzzled by how the library handles whitespace. I have the following syntax:

pub Identifier: String = <ident:r"[\_A-z][\_A-z0-9]*"> => String::from(ident);

I have a unit test where it tests the syntax for " Word". It passes. Why is this? It should fail.

Infinite memory use

According to @dagit, the following grammar produces infinite memory usage, presumably this is a bug in the DFA/NFA code:

grammar;

pub period: () = {
  r"." => ()
};

Building LALRPOP fails on latest nightly

If I try to build LALRPOP on the latest nightly, the build fails rather spectacularly by first spewing out a couple thousand warnings and then hitting an internal compiler error. Changing the machine and/or the OS doesn't affect the result. Also, both 0.7.0 and the current master branch fail in exactly the same way. Here is a log of the failing build:

failing_build.txt

inlining not properly enumerating all options

In this grammar:

grammar;

pub E: () = {
    "X" "{" <a:AT*> <e:ET> <b:AT*> "}" => (),
};

AT: () = {
    "type" ";"
};

ET: () = {
    "enum" "{" "}"
};

inlining produces a state like this, which is wrong:

    // State 3
    //   AT = (*) "type" ";" ["enum"]
    //   AT = (*) "type" ";" ["type"]
    //   AT+ = (*) AT ["enum"]
    //   AT+ = (*) AT ["type"]
    //   AT+ = (*) AT+ AT ["enum"]
    //   AT+ = (*) AT+ AT ["type"]
    //   E = "X" "{" (*) AT+ ET AT+ "}" [EOF] <---- ???
    //   E = "X" "{" (*) ET "}" [EOF]
    //   ET = (*) "enum" "{" "}" ["}"]
    //
    //   "enum" -> Shift(S7)
    //   "type" -> Shift(S8)
    //
    //   AT -> S4
    //   AT+ -> S5
    //   ET -> S6

In particular, we are missing "X" "{" AT+ ET "}", which will cause parse failures.

A regex with alternation OOMs the generator

grammar;

pub Thing {
    r"(%%|[^%])+"
};

Generate a tokenizer based on the terminals people use

We've got a lot of the pieces, but you should be able to have LALRPop generate a tokenizer for you. The idea to start is roughly that terminals like "foo" will be interpreted as literals and that we add r"foo" for regular expressions. You can sprinkle them wherever and things "just work".

larger grammars generate a LOT of code

Hello,

i try to parse php expressions using lalrpop:
https://github.com/timglabisch/rustphp/blob/c79060d6495a55174fc2ad5710d5774d8ec94d67/src/calculator1.lalrpop

my problem is that cargo run becomes incredible slow:

time cargo run
1234.76s user 19.88s system 99% cpu 20:58.78 total

the file is very huge (~50mb):

cat src/calculator1.rs | wc -l
1044621

is there something fundamentally wrong or is this expected?

unrecognized annotation `inline`

With this simple grammar:

grammar;

#[inline]
Foo = { "a", "b" };

I get:

src/grammar/test.lalrpop:3:3: 3:8 error: unrecognized annotation `inline`
  #[inline]
    ~~~~~~

I suppose this isn't correct, is it? I read a bit through the lalrpop-source files and it seems that this error really shouldn't happen. The output is in normalize/prevalidation.rs:83. Maybe something is wrong with the string interner?

Check for chosen symbols not directly inside parens

Something like this ~~X or ~n:~X makes no sense.