parse`[that]`

Parser combinators for TypeScript and Rust.

Write your language in BBNF (better backus-naur form) .

Handles left recursion and left factoring. Performance focused whilst maintaining readability and ease of use.

Usage

TypeScript:

import { string, regex } from "@mkbabb/parse-that";

const heyy = regex(/hey+t/);
heyy.parse("heyyyyyyyyyt"); // => "heyyyyyyyyyt"

Rust:

use parse_that::string;

let heyy = string("heyyyyyyyyyt");
heyy.parse("heyyyyyyyyyt"); // => "heyyyyyyyyyt"

Or with a grammar:

import { string, match, generateParserFromBBNF } from "@mkbabb/parse-that";

const grammar = `
    expr = term, { ("+" | "-"), term };
    term = factor, { ("*" | "/"), factor };
    factor = number | "(", expr, ")";
    number = /[0-9]+/;
    whatzupwitu = "whatzupwitu";
`;

const [nonterminals, ast] = generateParserFromBBNF(grammar);
const expr = nonterminals.expr;
expr.parse("1 + whatzupwitu * 3"); // => [1, "+", ["whatzupwitu", "*", 3]]

nice.

parse[that]

Performance

TypeScript

Here's a benchmark comparing the following libraries, all parsing the same JSON grammar:

Chevrotain
Parsimmon
this library, with standard combinators: here
this library, using a generated parser from BBNF: here

The file used is a 3.8 MB JSON file, containing ~10K lines of JSON. Whitespace is randomly inserted to make the file a bit more difficult to parse (makes the file about 5x the size). Benchmark is run 100 times.

Results

name	hz	min	max	mean	p75	p99	p995	p999	rme	samples
Standard	8.1013	119.20	127.90	123.44	127.08	127.90	127.90	127.90	±2.64%	10
BBNF	5.1979	183.18	216.12	192.38	195.59	216.12	216.12	216.12	±3.97%	10
Chevrotain	6.8699	129.91	167.46	145.56	158.21	167.46	167.46	167.46	±5.70%	10
Parsimmon	4.0256	246.69	250.34	248.41	249.09	250.34	250.34	250.34	±1.54%	10

hz: ops/sec - higher is better

Standard - test/benchmarks/json.bench.ts > JSON Parser
    1.18x faster than Chevrotain
    1.56x faster than BBNF
    2.01x faster than Parsimmon

Probably not the most scientific comparison, but it's generally about 10-30% faster than what I've tested it against.

Have a look inside benchmarks if you're curious.

Rust

The approach taken in the Rust implementation is almost identical to that of the TypeScript one - there's much to improve here. The benchmark is run on a 2.3 MB JSON file (canada.json), containing ~10K lines of JSON. Benchmark is run 100 times.

Results

TODO: but it averages around 80 MB/s on my machine.

Debugging

Debugging is made 🌈pretty🌈 by using the debug combinator - but you must run in development mode (vite build --mode development) to see the output.

As output, you'll see a few things:

A header containing:
- parsing status (Ok or Err)
- current offset into the input string
- debug node's name
- stringified current parser - a bit like the BBNF format
A body containing:
- A maximum of 10 lines of the input string, with the current offset into the parse string denoted by ^
- Line numbers for each line currently displayed

The blue color indicates that that variable is an BBNF nonterminal - yellow is the stringified parser.

Thanks to chalk and colored for the colors!

BBNF and the Great Parser Generator

Better Backus-Naur Form is a simple and readable way to describe a language. A better EBNF.

See the BBNF for BBNF (meta right) at bbnf.bbnf.

With your grammar in hand, call the generateParserFromBBNF function within bbnf.ts and you'll be returned two objects:

[nonterminals, ast]: [BBNFNonterminals, BBNFAST] = generateParserFromBBNF(grammar);

a JavaScript object containing all of the parsed nonterminals in your language, and the AST for your language. Each nonterminal is a Parser object - use it as you would any other parser.

Fully featured, and self-parsing, so the BBNF parser-generator is written in BBNF. Checkout the self-parsing example (+ formatting), at bbnf.test.ts.

Key differences with EBNF

It's a mesh between your run-of-the-mill EBNF and W3C's EBNF. So stuff like ? and * are allowed - but it also supports [ ] and { } for optional and repeated elements.

Set-like subtraction is supported, so you can do things like a - b to mean "a, but not b".

Here's a list of operators:

A? | [ A ]: optional A
A* | { A }: repeated A (0 or more)
A+: repeated A (1 or more)
A | B: A or B - higher precedence than A, B
A, B: A followed by B
A - B: A, but not B
A >> B: A, then B, but only return B
A << B: A, then B, but only return A
( A ): grouping - maximum precedence

And yes emojis are supported. Epsilon even has a special value for it - ε (or just use the word epsilon).

Left recursion & more

This library fully supports left recursion (either direct or indirect) - highly ambiguous grammars.

So grammars like (trivial direct left recursion):

expr = expr , "+" , expr
     | integer
     | string ;

and like (highly ambiguous):

ms = "s";
mSL = ( mSL , mSL , ms ) ? ;

mz = "z" ;
mZ = mZ | mY | mz ;

mY = mZ, mSL ;

are supported.

Our scheme is multifaceted and optimized depending upon a few factors:

Using BBNF

Since BBNF allows for one to easily grab ahold of the AST, we can optimize its structure rather easily. Left recursion is "removed" (it's actually just factored out) using a four-pass algorithm:

1. Sort the nonterminals topologically
1. Remove indirect left recursion
1. Remove left recursion
1. Factorize

For the above example each pass would look something like this:

Input grammar:

expr = expr , "+" , expr
     | integer
     | string ;

Sorted & removed left recursion:

expr = integer, expr_0
     | string, expr_0 ;
expr_0 = "+" , expr , expr_0
        | ε ;

Factorized:

expr = (integer | string) , expr_0 ;
expr_0 = "+" , expr , expr_0
        | ε

If any remaining left recursion is found, it's handled via the combinators.

Combinator support

Left recursion can be handled via two combinators: memoize and mergeMemos. mergeMemos must be applied directly one's left-recursive call, and memoize must be applied to the entire parser.

Here's an example:

...
const expression = Parser.lazy(() =>
    all(expression, operators.then(expression).opt()).mergeMemos().or(number)
)
    .memoize();
...

How this is done under the hood is by using a memoization table and a few clever tricks to detect if the current parser is in a left-recursive call. See the left recursion document for more information, and the memoization tests for examples.

Caveats

Though left recursion is supported, it's absolutely not optimal. If it can be factored out via the BBNF parser generator generally the performance will quite fine, but if it cannot you may run into some performance issues. This stems, among other things, primarily from JavaScript's lack of tail call optimization. Again, see the left recursion document for more information.

`pretty`

A pretty-printing submodule, inspired by Prettier, is included in the Rust implementation. I wanted to have something similar in Rust for my own projects, but I couldn't find anything that was simple enough to use. So I wrote my own. Check out the pretty docs for more information.

API & examples

See api.md for all of the API information.

See the test directory for fully explained and working examples.

Sources, acknowledgements, & c.

Other great parsing libraries 🎉:

Parsimmon
bread-n-butter
parsy
Chevrotain
nom
[pest] (https://github.com/pest-parser/pest)

mkbabb / parse-that Goto Github PK

parse-that's Introduction

parse[that]

Usage

Table of Contents

Performance

TypeScript

Results

hz: ops/sec - higher is better

Rust

Results

Debugging

Thanks to chalk and colored for the colors!

BBNF and the Great Parser Generator

Key differences with EBNF

Left recursion & more

Using BBNF

Combinator support

Caveats

pretty

API & examples

Sources, acknowledgements, & c.

parse-that's People

Contributors

Stargazers

Watchers

Recommend Projects

Recommend Topics

Recommend Org

parse`[that]`

`pretty`