softdevteam / grmtools Goto Github PK

Rust grammar tool libraries and binaries

License: Other

Rust 99.36% Lex 0.07% Yacc 0.41% Shell 0.16%

rust parser grammar yacc lex lexer generator lr error-recovery

grmtools's Introduction

Grammar and parsing libraries for Rust

grmtools is a suite of Rust libraries and binaries for parsing text, both at compile-time, and run-time. Most users will probably be interested in the compile-time Yacc feature, which allows traditional .y files to be used (mostly) unchanged in Rust.

Quickstart

A minimal example using this library consists of two files (in addition to the grammar and lexing definitions). First we need to create a file build.rs in the root of our project with the following content:

use cfgrammar::yacc::YaccKind;
use lrlex::CTLexerBuilder;

fn main() {
    CTLexerBuilder::new()
        .lrpar_config(|ctp| {
            ctp.yacckind(YaccKind::Grmtools)
                .grammar_in_src_dir("calc.y")
                .unwrap()
        })
        .lexer_in_src_dir("calc.l")
        .unwrap()
        .build()
        .unwrap();
    Ok(())
}

This will generate and compile a parser and lexer, where the definitions for the lexer can be found in src/calc.l:

%%
[0-9]+ "INT"
\+ "+"
\* "*"
\( "("
\) ")"
[\t ]+ ;

and where the definitions for the parser can be found in src/calc.y:

%start Expr
%avoid_insert "INT"
%%
Expr -> Result<u64, ()>:
      Expr '+' Term { Ok($1? + $3?) }
    | Term { $1 }
    ;

Term -> Result<u64, ()>:
      Term '*' Factor { Ok($1? * $3?) }
    | Factor { $1 }
    ;

Factor -> Result<u64, ()>:
      '(' Expr ')' { $2 }
    | 'INT'
      {
          let v = $1.map_err(|_| ())?;
          parse_int($lexer.span_str(v.span()))
      }
    ;
%%
// Any functions here are in scope for all the grammar actions above.

fn parse_int(s: &str) -> Result<u64, ()> {
    match s.parse::<u64>() {
        Ok(val) => Ok(val),
        Err(_) => {
            eprintln!("{} cannot be represented as a u64", s);
            Err(())
        }
    }
}

We can then use the generated lexer and parser within our src/main.rs file as follows:

use std::env;

use lrlex::lrlex_mod;
use lrpar::lrpar_mod;

// Using `lrlex_mod!` brings the lexer for `calc.l` into scope. By default the
// module name will be `calc_l` (i.e. the file name, minus any extensions,
// with a suffix of `_l`).
lrlex_mod!("calc.l");
// Using `lrpar_mod!` brings the parser for `calc.y` into scope. By default the
// module name will be `calc_y` (i.e. the file name, minus any extensions,
// with a suffix of `_y`).
lrpar_mod!("calc.y");

fn main() {
    // Get the `LexerDef` for the `calc` language.
    let lexerdef = calc_l::lexerdef();
    let args: Vec<String> = env::args().collect();
    // Now we create a lexer with the `lexer` method with which we can lex an
    // input.
    let lexer = lexerdef.lexer(&args[1]);
    // Pass the lexer to the parser and lex and parse the input.
    let (res, errs) = calc_y::parse(&lexer);
    for e in errs {
        println!("{}", e.pp(&lexer, &calc_y::token_epp));
    }
    match res {
        Some(r) => println!("Result: {:?}", r),
        _ => eprintln!("Unable to evaluate expression.")
    }
}

For more information on how to use this library please refer to the grmtools book, which also includes a more detailed quickstart guide.

Examples

lrpar contains several examples on how to use the lrpar/lrlex libraries, showing how to generate parse trees and ASTs, use start conditions/states or execute code while parsing.

Documentation

Latest release	master
grmtools book	grmtools book
cfgrammar	cfgrammar
lrpar	lrpar
lrlex	lrlex
lrtable	lrtable

Documentation for all past and present releases

grmtools's People

Contributors

Stargazers

Watchers

grmtools's Issues

Add `rust,noplaypen` to example grammar code in the documentation

Much of the example code in the documentation is not syntax highlighted, presumably because it is a mix of both Rust and lrpar syntax rather than pure Rust. However, highlighting it as Rust code seems to work quite well. Also, the failure mode of Highlight.js is quite benign: in the worst case, a few tokens are left the default text color (black/dark gray).

The playpen feature (the ▶️ button) doesn't make much sense on those code blocks, as the code will not compile on https://play.rust-lang.org/. The noplaypen annotation turns this off.

Here's an example:

```rust,noplaypen
Assign -> ASTAssign: "ID" "+" Expr
{
    let id = $lexer.span_str($1.as_ref().unwrap().span()).to_string();
    ASTAssign::new(id, $3)
}

%%

struct ASTAssign {
    id: String
}

impl ASTAssign {
    fn new(name: String) -> Self {
        ASTAssign { name }
    }
}
```

I've already done this for the grmtools parsing idioms page but thought I'd ask if this PR would be welcome before I do the rest of the docs.

Adding a %grammar-kind declaration?

Before I try and come up with a patch, I figured it would be good to discuss this in an issue,
I was considering potentially adding a declaration %grammar-kind Original(NoAction), etc

One of the problems with this is that it is likely that we want to parse the value by just using Deserialize on the YaccGrammarKind,
this would at least be the easiest way. But it brings about a few issues

cfgrammar has Optional deserialization support, so if we deserialized that way %grammar-kind would only work with serde feature enabled. Alternately we could just implement this by hand instead of serde?
Some declarations depend upon a specific %grammar-kind, we may have to move some checks from parse_declarations to ast.complete_and_validate.

But it could potentially reduce the number of places that YaccGrammarKind needs to be specified (build.rs, nimbleparse command line, etc). So it seemed like it might be worth considering.

Declaring extra parameters to be passed to parse with %parse-param

I couldn't find a way to do this in lrpar, but in bison there is a %parse-param directive which allows you to declare additional parameters to pass to the generated parse function,

I would guess there would also need to be something for lifetimes in addition to that.

how to use lrlex with rust code

we can use c code, and c context with flex, just like below

struct comile_context ctx;
int test_compile() {
     struct comile_store store;
     
     yaccparse(&store);
}

this c code is with yacc export function yaccparse. and yacc 's param is a pointer for struct comile_store .
ctx is global context, or we can also say that it is a global var for yacc.
and in .l file, we can use c code with ctx :

%{
extern struct comile_context ctx;
%}
%%
testsymbol    {
                               do somthing with ctx. 
                               for example, store the count of testsymbol at ctx,
                               for example, get some sub strings from yytext, then store them at ctx.
                       }

if there are two testsymbol in input, the c code in {} will execute twice.
there are two problem metioned:

how to write rust code with lrlex
how to get yytext in .l with lrlex

Does it support two parser in one project?

I attempted to write to parser(.y file) in one project and build them like:

use cfgrammar::yacc::{YaccKind, YaccOriginalActionKind};
use lrlex::CTLexerBuilder;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    CTLexerBuilder::new()
        .lrpar_config(|ctp| {
            ctp.yacckind(YaccKind::Grmtools)
                .grammar_in_src_dir("parser.y")
                .unwrap()
        })
        .lexer_in_src_dir("lexer.l")?
        .build()?;


    CTLexerBuilder::new()
        .lrpar_config(|ctp| {
            ctp.yacckind(YaccKind::Original(YaccOriginalActionKind::GenericParseTree))
                .grammar_in_src_dir("func.y")
                .unwrap()
        })
        .lexer_in_src_dir("lexer.l")?
        .build()?;

    Ok(())
}

And it seems to work properly.

This is my .y file:

%start Stat
%%
Stat -> Result<DrawableKind<'input>, ()>:
      'ORIGIN' 'IS' Origin 'SEMICOLON' { $3 }
    | 'ROT' 'IS' Rot 'SEMICOLON' { $3 }
    | 'SCALE' 'IS' Scale 'SEMICOLON' { $3 }
    | 'FOR' DrawFor 'SEMICOLON' { $2 }
    | 'EXIT' { Ok(DrawableKind::Exit) }
    ;

Origin -> Result<DrawableKind<'input>, ()>:
    'LB' Expr 'COMMA' Expr 'RB' {
      Ok(DrawableKind::Origin($2?, $4?))
    }
    ;

Rot -> Result<DrawableKind<'input>, ()>:
    Expr {
      Ok(DrawableKind::Rot($1?))
    }
    ;

Scale -> Result<DrawableKind<'input>, ()>:
    'LB' Expr 'COMMA' Expr 'RB' {
      Ok(DrawableKind::Scale($2?, $4?))
    }
    ;

DrawFor -> Result<DrawableKind<'input>, ()>:
    Alphabet 'FROM' Expr 'TO' Expr 'STEP' Expr 'DRAW' 'LB' Alphabet 'COMMA' Alphabet 'RB'
        {
            Ok(DrawableKind::DrawableFor(
                ForStruct {
                    ch: $1?,
                    from: $3?,
                    to: $5?,
                    step: $7?,
                    x: $10?,
                    y: $12?,
                }
            ))
        }
    ;

Alphabet -> Result<&'input str, ()>:
      'ALPHABET'
        {
            let v = $1.map_err(|_| ())?;
            Ok($lexer.span_str(v.span()))
        }
    | 'FLOAT'
        {
           let v = $1.map_err(|_| ())?;
           Ok($lexer.span_str(v.span()))
        }
    ;

Expr -> Result<f64, ()>:
      Expr 'PLUS' Term { Ok($1? + $3?) }
    | Expr 'MINUS' Term { Ok($1? - $3?) }
    | Term { $1 }
    ;

Term -> Result<f64, ()>:
      Term 'MUL' Factor { Ok($1? * $3?) }
    | Term 'DIV' Factor { Ok($1? / $3?) }
    | Factor { $1 }
    ;

Factor -> Result<f64, ()>:
      'LB' Expr 'RB' { $2 }
    | 'FLOAT'
      {
          let v = $1.map_err(|_| ())?;
          parse_float($lexer.span_str(v.span()))
      }
    ;
%%
// Any functions here are in scope for all the grammar actions above.
use crate::rt_util::*;

fn parse_float(s: &str) -> Result<f64, ()> {
    match s.parse::<f64>() {
        Ok(val) => {
            Ok(val)
        },
        Err(_) => {
            eprintln!("{} cannot be represented as a f64", s);
            Err(())
        }
    }
}

%start E
%avoid_insert "ALPHABET"
%%
E: E 'PLUS' T
    | E 'MINUS' T
    | T ;

T: T 'MUL' F
    | T 'DIV' F
    | F ;

F: 'LB' E 'RB'
      | 'FUNC' 'LB' 'ALPHABET' 'RB'
      | 'ALPHABET';

when I don't build second parser it work fine...

Is it possible to perform side effects while parsing?

I am wondering if it is possible, instead of parsing source text into an AST, to output a vector of bytes in postfix order. For example, assuming constants are prefixed by 17 (for no particular reason) and + is encoded as 5, I would want 2 + 11 to be parsed into the vector [17, 2, 17, 11, 5]. This would require allocating the vector at the start of the parsing process and then pushing to it as expressions are parsed. I haven't seen anything in the documentation about how to do this, so I suspect it may not be supported.

Context: The bytecode interpreter in Crafting Interpreters uses this kind of architecture. It's easy to do if you write a recursive descent parser as the book advocates, but I'm interested in using a parser generator.

Non-Relative Paths given to `include_bytes!`

I'm working with a build.rs that is roughly:

let rules_id = CTParserBuilder::new()
    .yacckind(YaccKind::Grmtools)
    .recoverer(RecoveryKind::None)
    .error_on_conflicts(true)
    .process_file(
        "src/internals/parser/generated/parser.y",
        "src/internals/parser/generated/parser.rs")?;

Now the build.rs works. My parser gets generated. But as the project get's compiled I see:

error: couldn't read src/internals/parser/generated/src/internals/parser/generated/parser.grm: No such file or directory (os error 2)
error: couldn't read src/internals/parser/generated/src/internals/parser/generated/parser.sgraph: No such file or directory (os error 2)
error: couldn't read src/internals/parser/generated/src/internals/parser/generated/parser.stable: No such file or directory (os error 2)

Within my generated parser.rs I see:

::lrpar::ctbuilder::_reconstitute( include_bytes!("src/internals/parser/generated/parser.grm"),
include_bytes!("src/internals/parser/generated/parser.sgraph"),
include_bytes!("src/internals/parser/generated/parser.stable"));

The problem is include_bytes! specifies

The file is located relative to the current file

Expose more than one rule?

Question / Feature Request: Is there any way to parse a specific rule as the starting parser? For example, if I have:

%start Expr
%%
Expr -> ...;

Int -> ...;
%%

I also want to be able to parse a string as Int, not just Expr.

(I'm trying to port my parser from LALRPOP to lrpar (mainly because of the operator precedence feature) which exposes a parser for any rule prefixed with the keyword pub.)

Implement std::error::Error for lrpar::lex::Lexeme<T>

It would be nice if std::error::Error was implemented for lrpar::lex::Lexeme<T>.

That way, if we're in the context of a -> Result<T, Box<dyn std::error::Error>> function, we can do

let value = $1?;

instead of

let value = $1.map_err(|error| format!("{:?}", error))?;

In context:

Integer -> Result<i32, Box<dyn Error>>:
    "INT" {
        let value = $1.map_err(|error| format!("{:?}", error))?;
        let string = $lexer.lexeme_str(&value);
        Ok(string.parse::<i32>()?)
    }
;

$1 is of type Result<Lexeme<u32>, Lexeme<u32>> in the generated code.

how to use “ and space with .l file

I have tried some example， but all compile failed.It is not same with flex

%%
# [0-9]+ \"      "TEST"

%%
#\ [0-9]+\ \"      "TEST"

and i write with flex before, it compile success with c code:

testline  #\ [0-9]+\ \"
%%
testline   return TEST;

I found that this project not support define testline before, and with " or "return XXX;" will compile failed

token_span sometimes returns the wrong span

I haven't gotten to the bottom of this yet, but token_span added in my first series of patches,
appears as though it can sometimes return the wrong span.

I also think that perhaps with #293 token_span may actually be superfluous,
given that it should return the first production rule where it occurs, between rule_span's and symbol spans.
I think you could probably replace it with a running find over symbol spans.

I will get to the bottom of what is wrong with it regardless of whether we think it should be kept.
Apologies, I kind of thought approaching this in stages would help but it really hasn't seemed to.

Ignoring `$X` within strings for actions

When translating action code we currently replace all $X within it to reference the correct value from the actions stack. However, since we are using a simple regex to do this, we also replace any occurrence of $[0-9]+ within comments and strings. While we can live with altering comments, changing strings may prevent users from generating reasonable programs.

Since parsing rust is too big a task, a quick solution to this problem would be to count unescaped quotes, and only replace occurrences of $X, if the count is equal. This should at least get rid of replacements within user strings.

Repair not following grammar

Hi, I was playing with your amazing tool and used a grammar very similar to the "calculator evaluator" given in Quickstart Guide.

I tried an arbitrary buggy calculator program: 1 + * ( 1 + ) ( + * 1 + 1 * ) ( * 1 )

These were the suggested repair sequences:

Parsing error at line 1 column 5. Repair sequences found:
   1: Delete *
   2: Insert INT
start (line: 1, col: 5), end (line: 1, col:6)

Parsing error at line 1 column 13. Repair sequences found:
   1: Delete ), Shift (, Delete +, Delete *
   2: Delete ), Shift (, Insert INT, Delete +
   3: Insert INT, Shift ), Delete (, Delete +
   4: Delete ), Shift (, Insert INT, Shift +, Insert INT
   5: Insert INT, Shift ), Delete (, Shift +, Insert INT
   6: Insert INT, Shift ), Delete (, Shift +, Delete *
   7: Delete ), Shift (, Insert INT, Shift +, Delete *

Parsing error at line 1 column 29. Repair sequences found:
   1: Insert INT, Shift ), Delete (
   2: Delete ), Shift (, Insert INT
start (line: 1, col: 29), end (line: 1, col:30)

However, the repaired program we get using the first repair sequence is: 1 + ( 1 + ( 1 + 1 * INT ) * 1 ), which doesn't conform to the grammar and the tool rejects the repair upon sending it through again. (gives error). Is there something that I missed?

calc.y

%start Expr
%%
Expr -> Result<u64, ()>:
      Expr 'PLUS' Term { Ok($1? + $3?) }
    | Term { $1 }
    ;

Term -> Result<u64, ()>:
      Term 'MUL' Factor { Ok($1? * $3?) }
    | Factor { $1 }
    ;

Factor -> Result<u64, ()>:
      'PARENOPEN' Expr 'PARENCLOSE' { $2 }
    | 'INT'
      {
          let v = $1.map_err(|_| ())?;
          parse_int($lexer.span_str(v.span()))
      }
    ;
%%
// Any functions here are in scope for all the grammar actions above.

fn parse_int(s: &str) -> Result<u64, ()> {
    match s.parse::<u64>() {
        Ok(val) => Ok(val),
        Err(_) => {
            eprintln!("{} cannot be represented as a u64", s);
            Err(())
        }
    }
}

calc.l

%%
[0-9]+ "INT"
\+ "PLUS"
\* "MUL"
\( "PARENOPEN"
\) "PARENCLOSE"
[\t ]+ ;

out of date example in README.md

The production for FACTOR needs to be

          parse_int($lexer.span_str(v.span()))

rather than

          parse_int($lexer.lexeme_str(&v))

write .l with var for lrlex

when we use flex, we can write .l like that:

testline  #\ [0-9]+\ \"
%%
{testline}   return TEST;

testline just like a var or symbol. And flex will replace testline with #\ [0-9]+\ " in %% when compiling; it can improve the readable for the .l file when the regex is complex.
It will be nice to write .l with var for lrlex.

Panicked with 'called `Result::unwrap()` on an `Err` value: Custom("invalid value: integer `4984270261285729579`, expected usize")' after compiling to WebAssembly and running on nodejs

OS: Ubuntu 20.04 64 bit

Steps to reproduce bug:
0) Follow Quickstart guide from grmtools book.

Edit the Cargo.toml in the following way

[lib]
crate-type = ["cdylib"]

[build-dependencies]
cfgrammar = "0.11"
lrlex = "0.11"
lrpar = "0.11"

[dependencies]
cfgrammar = "0.11"
lrlex = "0.11"
lrpar = "0.11"
wasm-bindgen = "0.2"
console_error_panic_hook = "*"

Edit parse function in the following way in src/lib.rs

#[wasm_bindgen]
pub fn parse(string: &str) -> u64 { 
    console_error_panic_hook::set_once();
    let lexerdef = calc_l::lexerdef();
    let lexer = lexerdef.lexer(string);
    // Pass the lexer to the parser and lex and parse the input.
    let (res, _) = calc_y::parse(&lexer);
    match res {
        Some(r) => r.unwrap(),
        _ => panic!("Failed to parse expression"),
    }
}

Compile using `wasm-pack build --target nodejs
Link the package to a node project
Run the following script using node

const calc = require("calc");
console.log(calc.parse("2 + 3"))

My initial conjecture is that there is a cast between u64 and usize somewhere, which does not work with the Wasm architecture.

Runtime error in lrpar::ctbuilder::_reconstitute with target wasm32-unknown-unknown

Hi there!

I'm trying to parse a grammar from within WebAssembly, but

grmtools/lrpar/src/lib/ctbuilder.rs

Line 835 in f16a9d0

let stable = deserialize(stable_buf).unwrap();

produces the following error at runtime:

invalid value: integer `17471392538647552630`, expected usize

Looks like in some part of the library the target architecture wasn't taken into account correctly.

Do you have a guess where the error could be?

Please let me know if I there's some more information that I can provide to debug this issue.

error: environment variable `OUT_DIR` not defined

I try to compile a simple example from documentation. And I get the error:

error: environment variable `OUT_DIR` not defined
 --> src/main.rs:7:1
  |
7 | lrlex_mod!(calc_l);
  | ^^^^^^^^^^^^^^^^^^^
  |
  = note: this error originates in a macro outside of the current crate (in Nightly builds, run with -Z external-macro-backtrace for more info)

error: environment variable `OUT_DIR` not defined
 --> src/main.rs:9:1
  |
9 | lrpar_mod!(calc_y);
  | ^^^^^^^^^^^^^^^^^^^
  |
  = note: this error originates in a macro outside of the current crate (in Nightly builds, run with -Z external-macro-backtrace for more info)

May be is a problem with cargo?
My files are:

src/calc_l.l

%%
[0-9]+ "INT"
\+ "PLUS"
\* "MUL"
\( "LBRACK"
\) "RBRACK"
[\t ]+ ;

src/calc_y.y

%start Expr
%%
Expr -> Result<u64, ()>:
      Term 'PLUS' Expr { Ok($1? + $3?) }
    | Term { $1 }
    ;

Term -> Result<u64, ()>:
      Factor 'MUL' Term { Ok($1? * $3?) }
    | Factor { $1 }
    ;

Factor -> Result<u64, ()>:
      'LBRACK' Expr 'RBRACK' { $2 }
    | 'INT'
      {
          let v = $1.map_err(|_| ())?;
          parse_int($lexer.lexeme_str(&v))
      }
    ;
%%
// Any functions here are in scope for all the grammar actions above.

fn parse_int(s: &str) -> Result<u64, ()> {
    match s.parse::<u64>() {
        Ok(val) => Ok(val),
        Err(_) => {
            eprintln!("{} cannot be represented as a u64", s);
            Err(())
        }
    }
}

src/main.rs

use std::io::{self, BufRead, Write};

use lrlex::lrlex_mod;
use lrpar::lrpar_mod;

// Using `lrlex_mod!` brings the lexer for `calc.l` into scope.
lrlex_mod!(calc_l);
// Using `lrpar_mod!` brings the lexer for `calc.l` into scope.
lrpar_mod!(calc_y);

fn main() {
    // We need to get a `LexerDef` for the `calc` language in order that we can lex input.
    let lexerdef = calc_l::lexerdef();
    let stdin = io::stdin();
    loop {
        print!(">>> ");
        io::stdout().flush().ok();
        match stdin.lock().lines().next() {
            Some(Ok(ref l)) => {
                if l.trim().is_empty() {
                    continue;
                }
                // Now we create a lexer with the `lexer` method with which we can lex an input.
                let mut lexer = lexerdef.lexer(l);
                // Pass the lexer to the parser and lex and parse the input.
                let (res, errs) = calc_y::parse(&mut lexer);
                for e in errs {
                    println!("{}", e.pp(&lexer, &calc_y::token_epp));
                }
                match res {
                    Some(r) => println!("Result: {}", r),
                    _ => eprintln!("Unable to evaluate expression.")
                }
            }
            _ => break
        }
    }
}

Error span improvements

In pr #299 which adds spans to various Error types, the Spans returned are based off of the existing
offset data from which we can derive a line & column. As it is we currently always return a span where start == end, since it is just getting us to the desired semver ABI.

SemVer compatible changes (after we add Spans to Errors):
After that PR we could include in the error more information from the parse functions into YaccParserError and LexBuildError.
This may require some reorganization of the various private parse functions.
Potential SemVer incompatible changes (after we add Spans to Errors):
YaccErrorKind, and LexErrorKind could sometimes have useful additional spans, for instance LexErrorKind::DuplicateName
Could have a span pointing to the first occurrence of the duplicate entry.

SemVer compatible improvements
SemVer incompatible improvements

Break `>>` into `>` and `>`

Hi! Sorry if this is a loose fit for an issue, but I'd love to get CPCT+ going in my language.

I've been (rather fruitlessly) trying to implement generics in my language, but I ran into this issue; after lots of googling I could not seem to figure it out. lrlex parses >> into a right bit shift, so I'm having trouble writing logic for nested generics in yacc form.

Array<Array<T>>
             ^^ lexer thinks this is a bit shift instead of two greater thans

Does anyone have any ideas that will play nicely with CPCT+? Or is the only way to write a custom lexer to fool the parser (if that's even possible)?

Specifying one action implies specifying all actions?

Currently we allow grammars to have some rules with actions and some without e.g.:

R1: R2; { ... }
R2: ;

I don't think this makes sense: what is R2 expected to return? My gut feeling is that if a user specifies an action to one rule, we should statically complain if any other rules are missing actions. Thoughts @ptersilie?

Allow qualifying types (module::Foo) in parser files

It would be nice if it was allowed to qualify types in parser files.

E.g. being able to write a production:

Integer -> context::Result<i32>:
  [...]
;

Currently, using :: yields an error:

Illegal string at line [...] column 20

The context was included in the .y file as

use crate::parser::context::*;

and contains

pub trait Error = std::error::Error;
pub type Result<T> = std::result::Result<T, Box<dyn Error>>;

[...]

Question: A good idea to call line_col at every node?

This is really a question, and not an issue:

I am building an AST and want to store a source location at every node. The simplest approach is to store the result of
line_col. However, skimming the code suggests that this may not be a good idea. I believe each call to line_col may scan over every prior newline in the file:

grmtools/lrlex/src/lib/lexer.rs

Line 339 in 5d37c36

fn line_col(&self, span: Span) -> ((usize, usize), (usize, usize)) {

Is that accurate? If so, I suppose the smart approach is to store a Span, and only invoke line_col when I need to print a location.

From repair sequence to repaired input strings

This may not be the right venue to ask, but I thought it was worth while asking (since couldn't find in README/docs).

Is there an API provided to apply the repair sequence(s) produced to the original input (i.e. producing a new input string that would not raise a parse error)? Or is the expectation that consumers do that themselves? Since users can plug in custom lexers, I'm inclined to believe that the latter is the expectation, but I wanted to confirm before reimplementing something already available in grmtools.

Broken links for YACC and LEX Manual in grmtools-book

The resource intended to be used in the guides are down.

http://dinosaur.compilertools.net/lex/index.html and

http://dinosaur.compilertools.net/yacc/index.html

Support for implicit tokens

In Yacc we can define implicit tokens like. From the documentation:

date  :  month_name  day  ','  year   ;

The comma , is enclosed in single quotes; this implies that the comma is to appear literally in the input.

From what I can tell, double quotes do the same thing.

Allow generating `pub mod` instead of just `mod`

I noticed when wanting to pull out lrlex_mod! and lrpar_mod! calls that they unfortunately generate mod instead of pub mod. Could the macro be modified to optionally allow generation of pub?

panicked at 'time not implemented on wasm32-unknown-unknown'

Hey there,

as already mentioned in #126 I had some problems on target wasm32-unknown-unknown with gracefully handling input that isn't accepted by the grammar.

I got stack traces working now:

panicked at 'time not implemented on wasm32-unknown-unknown', src/libstd/sys/wasm/time.rs:13:9

Stack:

Error
    [...<omitted>...]
    at std::sys::wasm::time::Instant::now::h766434e196ba7d73 (wasm-function[2735]:0xc091e)
    at std::time::Instant::now::hd779be9a0cda9cbc (wasm-function[3030]:0xc1473)
    at lrpar::parser::Parser<StorageT,ActionT>::lr::hce502a5dc7a16768 (wasm-function[61]:0x2c111)
    at lrpar::parser::Parser<StorageT,ActionT>::parse_actions::hfdeea25ce9816fde (wasm-function[318]:0x67fc8)
    [...<omitted>...]

See rust-lang/rust#48564 for more info on that issue in general.

Having a quick glance at the code base, it looks like the calls to std::time::Instant::now could be replaced by calling a function that produces strictly incremental identifiers.

Is that correct, or does some part rely on the actual timestamps?

What should the return types of Lexeme::end() and Lexeme::len() be?

Currently these two functions have return types Option<usize> where None means "this lexeme was inserted by error recovery". In a sense this seems like the right thing to do, in that it stops users accidentally thinking they've got a "normal" lexeme. But in practise I'm starting to be less sure about it because it seems that I always end up doing l.end().unwrap_or_else(|| l.start()).

Indeed, if I look in lrlex/lrpar I always end up doing that:

$ rg end\\\(\\\) lrlex lrpar
lrlex/src/main.rs
74:            &input[l.start()..l.end().unwrap_or_else(|| l.start())]

lrlex/src/lib/lexer.rs
203:                    let len = m.end();
294:        &self.s[st..l.end().unwrap_or(st)]

lrlex/src/lib/parser.rs
113:        let line = self.src[i..i + line_len].trim_end();
141:        let re_str = line[..rspace].trim_end().to_string();

lrpar/src/lib/parser.rs
61:                    let lt = &input[lexeme.start()..lexeme.end().unwrap_or_else(|| lexeme.start())];
322:                    span.push(la_lexeme.end().unwrap_or_else(|| la_lexeme.start()));
444:                            span_uw.push(la_lexeme.end().unwrap_or_else(|| la_lexeme.start()));
976:                    let len = m.end();

lrpar/examples/calc_parsetree/src/main.rs
89:                        self.s[lexeme.start()..lexeme.end().unwrap_or_else(|| lexeme.start())]

This is making me wonder if we've been too precise for our own good. But I'm really not sure: I really can see people shooting themselves in the foot if they've forgotten that some tokens can be inserted as the result of error recover. Thoughts are appreciated!

Fully qualify structs in generated code to avoid name conflicts

If you include some custom types in the context of a parser file, specifically

struct Result {}

it will change the meaning of Result aka. ::std::result::Result in the generated code.

If the generated code fully qualified (::some::path::to::Struct) the usage of all structs, no such name conflicts would be possible.

Case insensitive tokenization / Flex's %option caseless

Flex has a directive %option caseless to lex tokens in a case insensitive way.

Is there currently a way (besides using regular expressions) to accomplish the same?

The way we serialise output makes cross compiling across architectures hard

At the moment we serialise a number of usize things which means if you cross-compile for a different machine word size (e.g. you cross-compile on a 64-bit machine with a WASM target) you end up with deserialisation errors. There are two solutions:

The quick hack is to remove the troublesome usize things in the serialisation path.
The long term solution is to have a better (almost certainly custom written) serialisation format which can deal with such things.

adding a nimbleparse.toml

I've been thinking of trying to write a nimbleparse-lsp which can be used to automatically run nimbleparse on various test input when editing grammars.

For this, I think it would be nice to have something like a nimbleparse.toml which specifies lexer, parser, yacc kind, error recovery style, and either test inputs or file extension for inputs and test input directories.

Then, when running nimbleparse with no options or --toml or --toml nimbleparse.toml would run over the various test cases.
We could also add an option to CTLexerBuilder, e.g. process_toml (In which case calling it nimbleparse.toml) might need a different name to specify the lexer and parsers.

Is this a thing you all would be interested in having integrated into either nimbleparse or the builder API, or would it be best to experiment with it in the lsp server first in isolation and figure out the details there?

Generated code incompatible with `-D rust-2018-idioms`

With -D rust-2018-idioms and warnings as errors hidden lifetime parameters in generated code are causing compilation failures e.g.

603 |                       mut __gt_args: ::std::vec::Drain<'_, ::lrpar::parser::AStackType<lrlex::lexemes::DefaultLexeme<u8>, __GtActionsKind<'input>>>,
    |                                                        +++
error: hidden lifetime parameters in types are deprecated

I'd disable this by injecting #![allow(elided_lifetimes_in_paths)] at the start of the generated module but I can't find a way to do this (is there one?)

Increase visibility of this project

Hey there,

I just ported over a grammar written in Bison/Flex to grmtools – and I was super pleased with how the process went. (Sorry for bombarding you with a lot of small issues, those were just things I noted while implementing.)

Considering how mature this library seems to me, I'm wondering if you deliberatively chose not to advertise it more (-> more maintenance effort)?

I spent a lot of time researching and trying parser generators for Rust, but none of the existing ones ticked all the boxes that I were looking for, until I found this one.

In case you're interested in this project having more visibility (which in my opinion would be useful to a lot of people) I think the Readme could be a bit more elaborate and the repository could be tagged with some topics. Submitting it to relevant aggregators (/r/rust, hackernews, ...) after having a representative landing page would certainly help as well.

Thank you for the huge effort and developing this project in public!

Support file paths in process_file_in_src / lrlex_mod! / lrpar_mod!

It would be nice if the path specified in process_file_in_src would be replicated over to $OUT_DIR and lrlex_mod!/lrpar_mod! would support strings instead of identifiers.

That way, the usage of

.process_file_in_src("some/path/to/file.y")

and

lrpar_mod!("some/path/to/file.y");

would be symmetric.

It would also have the benefit of not creating conflicts for .process_file_in_src("a/file.y") and .process_file_in_src("b/file.y").

Would be nice to have an online playground

Would be nice to have an online playground with nimbleparse like:

Should CTParserBuilder::action_kind be CTParserBuilder::actionkind?

The underscore in there seems a bit of an odd choice, in retrospect, especially given the struct is ActionKind. Any thoughts @ptersilie?

Lexer start conditions (state) similar to flex

GNU flex provides the option to have "start conditions" for the lexer, allowing the lexer to ignore certain rules when getting the next token. It's a nice feature to have, are there any plans to implement it?

see this link for what I'm talking about: https://www.cs.virginia.edu/~cr4bd/flex-manual/Start-Conditions.html

Apparently infinite recursive rule

One of the "fun" things about my project is running the parser on strange, half edited/incomplete changes.
Here is one such case I encountered that way, and have minimized.

given the input character a, this will cause an infinite loop pushing to pstack between
https://github.com/softdevteam/grmtools/blob/master/lrpar/src/lib/parser.rs#L297
https://github.com/softdevteam/grmtools/blob/master/lrtable/src/lib/statetable.rs#L461-L466
adding a case like: Some(i) if i == usize::from(stidx) + 1 => None, to goto fixes it, (i.e. the return value of goto == prior).

Filing this as a bug report rather than sending a PR though, because I haven't yet tested it against valid parsers, or
as of yet tried to surmise if this case can only and always lead to infinite recursion or if it ever actually comes up in a valid way.

%%
a "a"
[\t\n ] ;

%%
Start: Bar;
Foo: "a" | ;
Bar: Foo | Foo Bar;

Use more descriptive names for pp / epp

This is just a nit-pick – but I think it would be nice to use more descriptive identifiers for pp/epp, e.g. pretty_print:

grmtools/lrpar/src/lib/parser.rs

Lines 617 to 621 in 439201b

    
           pub fn pp<'a>( 
        
               &self, 
        
               lexer: &dyn Lexer<StorageT>, 
        
               epp: &dyn Fn(TIdx<StorageT>) -> Option<&'a str> 
        
           ) -> String {

Those functions might not be needed at all if you implement std::fmt::Display / std::fmt::Debug, but I haven't looked into this case here in detail yet.

nimbleparse's output doesn't track the input properly

#20 accidentally broken nimbleparse's output. For the calculator grammar and the input 2 3 + it reports:

Expr
 Term
  Factor
   INT 2

Error at line 1 col 3. Repairs found:
  Delete "3", Delete "3"
  Insert "+", Shift "3", Delete "3"
  Insert "*", Shift "3", Delete "3"
  Delete "3", Shift "3", Insert "INT"
  Insert "+", Shift "3", Shift "3", Insert "INT"
  Insert "*", Shift "3", Shift "3", Insert "INT"

The first repair sequence should be:

  Delete "3", Delete "+"

(and the other repair sequences also need tweaking).

I know what's causing this (the repair sequences are only using the token at the point of the error, failing to take account of shifts and deletes), but it I think it's going to require an API change of some sort, so I might have to try a few things to find a good trade-off.

%ignore like in flex

Flex uses %ignore to ignore the patterns matching with given regex.

Something like :
%ignore <regex>

Im not sure but, this doesnt work in lrlex, or is there any alternative which I failed to find?

Remove debug formatting in non-debug locations

In a couple of places (e.g. https://github.com/softdevteam/grmtools/blob/master/lrlex/src/lib/ctbuilder.rs#L419) we use debug formatting in a non-debug location. This feels somewhat unsatisfactory, particularly as there are fewer guarantees about stability.

The pretty conflict errors aren't being displayed in v0.8.0?

Were these removed?

Support to work with subgrammars

Let I have subgrammars:

<subgrammar1> with rules "1","2","+"
<subgrammar2> with rules "3","4","-"

And I have <input file>

1+2;
3-4;

Then I have to get
<grammar> = <subgrammar1> + <subgrammar2> to parse all lines of <input file>.

What paths exist?
Thank you.
// I hope my question is more clear now.

Add support for %expect

Bison has %expect declarations which tell the parsing system "I know there are n conflicts, please don't warn me if you find n too". Although undocumented in http://dinosaur.compilertools.net/yacc/index.html, Berkeley Yacc seems to support %expect too, so I think grmtools should add that. See section 3.7.9 of https://www.gnu.org/software/bison/manual/bison.html for more details.

Bison also support a %expect-rr option too although the manual suggests it should only apply to GLR parsers. I don't see a good reason why we couldn't accept that for normal LR grammars as well though perhaps only if the user specifies YaccKind::Grmtools?

Permit stack operations on start conditions

In #318, start state logic was added for start states defined by name.
In the POSIX lex standard, start states can be used by numeric id

Q: Should this include support for expanding the target start state logic to support increment and decrementing the current start state, as well as setting to an explicit target?

Allow leading "|" in first alternative of a production

It would be nice if it was allowed to place a leading | in the first alternative of a production.

E.g.:

Production -> [...]:
 |  Foo { [...] }
 |  Bar { [...] }
 |  Baz { [...] }
;

As far as I'm aware this wouldn't change any semantics, as also the ε-alternative is required to have a body {}.

Token ids are not accessible

For the calculator grammar, the following two files are generated:

///  calc_l.rs
mod calc_l {use lrlex::{LexerDef, Rule};

pub fn lexerdef() -> LexerDef<u8> {
    // code not shown for the sake of brevity 
} // fn ends here!!!
...
#[allow(dead_code)]
const T_INT: u8 = 4;
...
} // mod ends here!!!

///  calc_y.rs
mod calc_y {use lrpar::{Lexeme, Node, parse_rcvry, ParseError, reconstitute, RecoveryKind};

pub fn parse(lexemes: &[Lexeme<u8>])
          -> Result<Node<u8>, (Option<Node<u8>>, Vec<ParseError<u8>>)> { 
    // code not shown for the sake of brevity 
} // fn ends here!!!
} // mod ends here!!!
...
#[allow(dead_code)]
const R_TERM: u8 = 2;
...

One can match Node::NonTerms as shown here. R_TERM can be used because it is included here, and because it is not part of the calc_y module. If the user wanted to match a Node::Term based on a token id, they wouldn't be able to do the following:

match node {
    // does not compile as T_INT does not exist in the current scope
    // and it is a private constant in the `calc_l` module
    Term{lexeme} if lexeme.tok_id() == T_INT => {
        println!("hello");
     },
     _ => unreachable!();
}

Let me know if I am missing something obvious. If I am correct, then simply moving this statement to line 190 would solve the issue.

	pub fn pp<'a>(
	&self,
	lexer: &dyn Lexer<StorageT>,
	epp: &dyn Fn(TIdx<StorageT>) -> Option<&'a str>
	) -> String {