Code Monkey home page Code Monkey logo

owl's Introduction

Owl is a parser generator which targets the class of visibly pushdown languages. It is:

  • Efficient — parsers generated by Owl always run in linear time.
  • Understandable — like regular expressions, its parsing model (and error messages[1]) can be understood without talking about parser states, backtracking, lookahead, or any other implementation details.
  • Easy to use — using Owl's interpreter mode, you can design, test, and debug your grammar without writing any code. An Owl grammar compiles to a single C header file which provides a straightforward parse tree API.

Here's a grammar for a simple programming language with expressions, assignment, and a print statement (see it online with syntax highlighting).

program = stmt*
stmt =
   'print' expr : print
   identifier '=' expr : assign
expr =
   [ '(' expr ')' ] : parens
   identifier : variable
   number : literal
 .operators prefix
   '-' : negative
 .operators infix left
   '*' : times
   '/' : divided-by
 .operators infix left
   '+' : plus
   '-' : minus

More examples are available in the example/ directory.

how to use it

You can build the owl tool from this repository using make:

$ git clone https://github.com/ianh/owl.git
$ cd owl/
$ make

Run make install to copy the owl tool into /usr/local/bin/owl.

On Windows, MSYS2 is the supported way to build owl. In the MSYS shell, run pacman -S make gcc git to install the proper dependencies before running the commands above. To install, use make PREFIX=/usr install (which copies to /usr/bin/owl instead).

Owl has two modes of operation—interpreter mode and compilation mode.

In interpreter mode, Owl reads your grammar file, then parses standard input on the fly, producing a visual representation of the parse tree as soon as you hit ^D:

$ owl test/expr.owl
1 + 2
^D
. 1            + 2
  expr:literal   expr:literal
  expr:plus------------------

You can specify a file to use as input with --input or -i:

$ echo "8 * 7" > multiply.txt
$ owl test/expr.owl -i multiply.txt
. 8            * 7
  expr:literal   expr:literal
  expr:times-----------------

You can also use Owl's interpreter on the web.

In compilation mode, Owl reads your grammar file, but doesn't parse any input right away. Instead, it generates C code with functions that let you parse the input later.

$ owl -c test/expr.owl -o parser.h

You can #include this generated parser into a C program:

#include "parser.h"

Wherever you #define OWL_PARSER_IMPLEMENTATION, the implementation of the parser will also be included. Make sure to do this somewhere in your program:

#define OWL_PARSER_IMPLEMENTATION
#include "parser.h"

For more about how to use this header, see the docs on using the generated parser.

rules and grammars

Rules in owl are written like regular expressions with a few extra features. Here's a rule that matches a comma-separated list of numbers:

number-list = number (',' number)*

Note that Owl operates on tokens (like ',' and number), not individual characters like a regular expression would. Owl comes with the built-in token classes number, identifier, integer, and string; keywords 'like' 'this' match their literal text.

To create a parse tree, you can write rules that refer to each other:

variable = 'var' identifier (':' type)?
type = 'int' | 'string'

Rules can only refer to later rules, not earlier ones: plain recursion isn't allowed. Only two restricted kinds of recursion are available: guarded recursion and expression recursion.

Guarded recursion is recursion inside [ guard brackets ]. Here's a grammar to parse {"arrays", "that", {"look", "like"}, "this"}:

element = array | string
array = [ '{' element (',' element)* '}' ]

The symbols just inside the brackets — '{' and '}' here — are the begin and end tokens of the guard bracket. Begin and end tokens can't appear anywhere else in the grammar except as other begin and end tokens. This is what guarantees the language is visibly pushdown: all recursion is explicitly delineated by special symbols.

Expression recursion lets you define unary and binary operators using the .operators keyword:

expression =
    identifier | number | parens : value
  .operators prefix
    '-' : negate
  .operators infix left
    '+' : add
    '-' : subtract
parens = [ '(' expression ')' ]

Operators in the same .operators clause have the same precedence level; clauses nearer the top of the list are higher in precedence.

These forms of recursion may seem limiting, but you can go surprisingly far with them. For example, if you're willing to use ? and : as begin and end tokens, the C ternary operator can be written as an infix operator:

expr =
    ...
  .operators infix left
    [ '?' middle ':' ] : ternary

For a more thorough guide to Owl's grammar format, check out the grammar reference. The example/ directory also has some example grammars.

what owl isn't

non-goals

  • Parsing arbitrary context-free languages or other, more general languages.
  • Constant-space streaming-style parsing — the way Owl parses operators depends on having a syntax tree representation (plus, see the "memory use" bullet below).

current limitations

  • Large grammars — Owl uses precomputed DFAs and action tables, which can blow up in size as grammars get more complex. In the future, it would be nice to build the DFAs incrementally.
  • Memory use — generated parsers store a small (single digit bytes) amount of information for every token while parsing in order to resolve nondeterminism. If a decision about what to match depends on a token which appears much later in the text, the parser needs to store enough information to go back and make this decision at the point that the token appears. Instead of analyzing how long to wait before making these decisions, it just waits until the end, gathering data for the entire input before creating the parse tree.
  • Error messages — generated parsers can show you the token where an error happened[2], but they can't suggest how to fix it.
  • Generating code in other languages — only C is supported right now.

footnotes

[1] This includes errors caused by ambiguities, where your grammar can match the same text in two different ways. Owl shows you the text in question alongside its two conflicting matches:

error: this grammar is ambiguous

. a ( b ) 

  can be parsed in two different ways: as

. a ( b )   
  expr:call 
  stmt:expr 
  program-- 

  or as

. a          ( b )       
  expr:ident expr:parens 
  stmt:expr- stmt:expr-- 
  program--------------- 

If you want more context for why this is important, LL and LR in Context: Why Parsing Tools Are Hard is a good summary. Ambiguity is an inherent problem with context-free grammars; limiting ourselves to visibly pushdown languages lets us solve this problem (at the expense of excluding some useful grammars).

go back up

. . . . . . .

[2] To be precise, the parser reports the token where the input stopped being a valid prefix of the grammar. It's possible that the "real" error location was earlier in the input.

go back up

notes and bibliography

This article about Rust's macro system was what got me thinking about all of this in the first place:

The question I had was: what would a parser that operated directly on token trees look like? In particular, could you just use regular expressions to parse each level of the tree?

visibly pushdown languages

It turns out there's a body of research which answers this question. Combining an ordinary regular language with explicit nesting tokens gives you what's called a visibly pushdown language (or VPL):

or, as it has been called earlier, an input-driven pushdown language:

These languages share a lot of nice properties with regular languages, including closure under union and concatenation (which are useful for composing VPL/IDPLs out of atomic pieces).

VPLs are often used for program analysis, where the nesting tokens represent calls and returns in an execution trace. There haven't been many attempts to apply this theory to text parsing, though. This presentation about visibly pushdown applicative parser combinators in Haskell is the only one I've found:

automata

To match (balanced) visibly pushdown languages, Owl uses a pair of finite-state automata—a "base automaton" and a "bracket automaton"—with a few rules for moving between them. I'll describe how these automata work briefly.

The idea is to treat an entire bracketed group of tokens ( ... ) as a single bracket symbol. The bracket automaton matches these groups, and the bracket automaton's accepting state tells us the bracket symbol.

Each symbol is classified as one of the following:

  • the begin symbols (1 … (m — "begin tokens" that appear on the left side of a guard bracket
  • the end symbols )1 … )n — "end tokens" that appear on the right side of a guard bracket
  • the regular symbols a1al — all tokens except the begin and end tokens
  • the bracket symbols ()1 … ()k — these don't appear in the input, but are produced by matching against the bracket automaton.

Execution starts at the start state of the base automaton. A stack of states is used during execution; it's initialized to be empty. The automaton reads each symbol from the input and acts according to its classification:

  • a regular symbol
    • transitions normally;
  • a begin symbol
    • pushes the current state onto the stack,
    • moves to the start state of the bracket automaton,
    • and transitions normally from that start state;
  • an end symbol
    • transitions normally,
    • checks the bracket symbol corresponding to the current accepting bracket automaton state,
    • pops the current state from the stack,
    • and transitions normally from the popped state using that bracket symbol.

A sequence of input symbols is recognized if this process leaves us in an accepting state of the base automaton.

Here's a quick drawing of what the two automata look like for this grammar of nested arrays: a = [ '[' (a (',' a)*)? ']' ].

base automaton bracket automaton

The square-shaped state is the labeled accepting state of the bracket automaton.

determinization

Owl's determinization is an iterative version of the usual subset construction. The first iteration ignores the bracket symbols, following only transitions involving regular, begin, and end symbols. Once accepting state sets appear in the determinized bracket automaton, the bracket symbols corresponding to the states in each set are followed simultaneously (since their appearance in a state set means they can occur together) during the next iteration of the subset construction.

Because these accepting state sets grow monotonically (we only ever find new ways of reaching them) and are bounded above in size, this iterative process eventually stops making progress. At this point, determinization is finished, and the accepting state sets in the bracket automaton become the new bracket symbols in the deterministic automata.

parse tree construction

The next step is to generate a parse tree. When Owl builds its automata from a grammar, it encodes parse tree actions (like "start operand", "set slot choice", etc.) along certain transitions. How can Owl preserve this information during determinization so it can reconstruct the list of actions?

These two papers describe how to do that:

The basic idea is: each transition in the deterministic automaton is derived from several transitions in the original automaton (typically with the same symbol). Collect the original transitions which produce each deterministic transition together in a table.

Then, when you find a path through the deterministic automaton, start from the end and walk backward along the path, using the table for each transition to reconstruct the original path through the original automaton.

Adapting this technique to Owl's iterative determinization algorithm is straightfoward: since determinized bracket symbols can represent multiple non-deterministic symbols, the table has an extra field for the non-deterministic symbol, but everything else is the same.

I considered some other approaches too:

A Thompson-style interpreter would sidestep the problem by interpreting the original, non-deterministic automaton.

Each thread would have to store its own action list, which could end up taking a lot of memory for some inputs—that's why I avoided this approach in the first place. But the Dubé-Feeley-style action maps also take up a lot of memory. In retrospect, I think this approach could be worth a try.

Laurikari's tagged NFAs seem like they would work well for a limited number of sub-matches, but I couldn't think of a way to scale them up to a full parse tree:

operator-precedence grammars

A variation of Dijkstra's two-stack "shunting-yard" algorithm is used to associate operators with their operands (see src/x-construct-parse-tree.h for the implementation). I mostly just read the Wikipedia page for this one, though the original reference is apparently:

ambiguity checking

In order to check the grammar for ambiguity, Owl creates a "product automaton" over pairs of states—a path through the product automaton corresponds to two paths in the original automaton that accept the same input. This paper was very helpful when I was figuring out how to search the product automaton for ambiguity in a systematic way:

Owl uses a variation of their "epsilon filtering" technique to create the product automaton without introducing bogus ambiguities.

parse tree formatting

The way that Owl's interpreter and ambiguity checker display parse trees was inspired by the icicle plot visualization in Ohm (a PEG parser):

Owl turns the icicles upside down (a stalagmite plot?) on the principle that the text itself forms the leaves of the tree.

. . . . . . .

Ian Henderson <[email protected]>
(Feel free to email me with any questions or comments!)

owl's People

Contributors

alhadis avatar ianh avatar modernserf avatar numist avatar ozbolt-abrantix avatar rofl0r avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

owl's Issues

Parser doesn't compile with multiple custom tokens

The grammar:

.token foo
.token bar

program = 'f'

Produces a parser.h that fails to compile due to defining one write_custom_token function for each custom token:

static void write_custom_token(size_t offset, size_t length, uint32_t token, uint64_t data, void *info) {
    struct owl_tree *tree = info;
    size_t token_offset = tree->next_offset;
    write_tree(tree, token_offset - tree->next_foo_token_offset);
    write_tree(tree, offset);
    write_tree(tree, length);
    write_tree(tree, data);
    tree->next_foo_token_offset = token_offset;
}
static void write_custom_token(size_t offset, size_t length, uint32_t token, uint64_t data, void *info) { // <== Redefinition of write_custom_token
    struct owl_tree *tree = info;
    size_t token_offset = tree->next_offset;
    write_tree(tree, token_offset - tree->next_bar_token_offset);
    write_tree(tree, offset);
    write_tree(tree, length);
    write_tree(tree, data);
    tree->next_bar_token_offset = token_offset;
}

Confusing start symbol and missing ambiguity warning.

Consider the following grammar:

start = kabstract | ident
kabstract = "abstract"
ident = atoz_lower+
atoz_lower = "a"|"b"|"c"|"d"|"e"|"f"|"g"|"h"|"i"|"j"|"k"|"l"|"m"|"n"|"o"|"p"|"q"|"r"|"s"|"t"|"u"|"v"|"w"|"x"|"y"|"z"

If we use the playground to parse abstract, we get the following result:

. abstract 
  kabstract

this is a little confusing to me:

  • it's not clear what the start symbol is, or how one might select it themselves. Does owl use all unused nonterminals as the alternatives of a pseudo start nonterminal symbol as the start symbol?
  • kabstract and ident are ambiguous because "abstract" can be matched via kabstract or via "a" "b" "s" "t" "r" "a" "c" "t" via ident. Why does owl not report any ambiguities here?

How would you handle optional trailing semicolons?

This is a really nice project, I am considering porting a shell scripting language to a formal owl definition and an owl-based parser as I really like the design of this project.

However, one blocker for us is that in scripting languages with an optional terminating semicolon, a line break may be used instead of an explicit ;. The hard-coded interpretation of \n as a mere token separator is already a blocking issue here, but even if it weren't, I feel as though it would require a separate class of tokens resting somewhere between whitespace and literal tokens to handle, if I'm not mistaken?

In particular, I am looking to parse syntax like

for f in foo; bar; baz; end

as being equal to

for f in foo
    bar # or bar;
    baz # or baz;
end

This example can work by replacing all linebreaks with semicolons prior to parsing (feasibility/overhead aside) but that doesn't always work, for example:

echo hello |
    cat

as some symbols are allowed to wrap on to the next line without hard-escaping the new line with a backslash, but wouldn't be accepted if it were input as echo hello |; cat as that's two separate statements.

Any ideas?

Hexadecimal input is parsed as number instead of integer

I was thinking about adding support for hexadecimal arguments in my grammar, when I noticed that OWL already parses things like "0xff". But why is every hex input parsed as a number and not an integer? In my application it messes with the type checking system, because some function might expect an integer and would refuse to accept "0xff".

For example, the grammar

input = argument{','}
argument =
  integer : int
  number : num

with the input

1, 1.2, 0xff, 0xff.f

yields:

1           , 1.2         , 0xff        , 0xff.f      
argument:int  argument:num  argument:num  argument:num

Would it be possible to distinguish hex integers and hex floating point numbers without writing a new tokenize function?

Deforested tree construction possible?

Hello @ianh,

Here you wrote:

So the last thing to do is to incrementally update the tree itself based on the change to the sequence of construction actions. I'm not sure how to do this, but I'm sure someone has written about it somewhere.

This reminds me a lot of GLR parsing. One thing that has made GLR parsing much simpler for me is not to construct an explicit tree, but to maintain a postordered list of tree events (see "Post-Order Parser Output" for a brief explanation).

A postordered list of events is inherent to LR parsing. A preordered list of events is inherent to LL parsing. I was wondering, how feasible would it be to support generating such a list of events with the approach that owl takes? That is, would it be possible to compress your construction actions into just 2 categories of events: begin_subtree_of_type_x and end_subtree_of_type_x?

If so, do you think that your approach would support a preordered or a postordered list of events?

With such a list of events, updating subtrees could be made very simple with a Piece table, which I'm currently investigating in a different project.

Identical "bracket" symbols not supported?

Consider the following grammar:

sl = [ "'" "a"* "'" ]

which is meant to model a single quote single line string ('', 'aaa', ...).

My mental model of how it would work with VPLs is the following:

sl = [ "'" "a"* "'" ]
^      ^        |  |     
|----- |        |  |
|      |        |  |
'-> state 0     |  '-> state 0 again
       |        |
       |        |
       |        '-> we see a "'", we pop to state 0
       |
       '-> state 1 (any "'" will pop to state 0)


- we start in state 0.
- we switch to state 1 on an "'".
- we do our regular parsing in state 1.
- we see a "'" in state 1, so we pop state 1 and we go back to state 0.

The goal here is to support recursive definitions like the following and, as expected, the following works:

sl = [ "<" ("a" | [ "{" sl "}" ] )* ">" ]

However, if the "bracket symbols" are the same, it doesn't work anymore:

sl = [ "'" ("a" | [ "{" sl "}" ] )* "'" ]

error: token ''' can't be used as both a start and an end 
keyword

  sl = [ "'" ("a" | [ "{" sl "}" ] )* "'" ]
         ~~~                          ~~~  

I'm wondering, is my mental model of VPLs wrong, and having identical "start" and "end" symbols is simply not possible, or is this a limitation of owl and not a limitation of VPLs?

'integer' token type?

Lots of programming languages support integers, and it'd be nice to support them in Owl too. If both types of tokens were used in the same grammar, both would be matched, with the longer match winning—if integer weren't used in the grammar, number would behave the same as it does now.

Unable to install (make)

>make
cc -O3 -std=c11 -pedantic -Wall -Wno-missing-braces -Wno-overlength-strings -DNO
T_UNIX  -o owl src/*.c
cc: error: src/*.c: Invalid argument
cc: fatal error: no input files
compilation terminated.
make: *** [owl] Error 1

What build tools do I need to get this to work?

Out-of-bounds Read Vulnerability

Hello,

I am currentlry working on OpenCRS, and I used your app for testing the Vulnerability Detection module.

While doing so, I discovered an out-of-bounds read that could prove itself as a vulnerability.

Steps to reproduce

  1. Compile using the base Makefile.
  2. Run with valgrind to see the error report for the invalid read: valgrind ./owl -T -g -.

Patches

I already forked the repository and proposed a patch in the Pull Request #32

Example: Dart lexer.

I have a rough VPL specification of the lexical structure of Dart, which I think is a VPL. It looks like it just needs a little massaging to make it a great real world example grammar for owl.

One of my parsing systems attempts to simulate VPAs by having a set of regular automata, and a stack to track which regular automaton is active. Push and pop actions associated with certain literals switch between regular automata. The specification is designed around that idea.

Here's a snippet from my declarative specification for the lexical structure of Dart written in Dart (it's a little verbose, I apologize)
    final key_blockcomment = LexiKey();
    final key_mldq = LexiKey();
    final key_mlsq = LexiKey();
    final key_sldq = LexiKey();
    final key_slsq = LexiKey();
    final key_rmldq = LexiKey();
    final key_rmlsq = LexiKey();
    final key_rsldq = LexiKey();
    final key_rslsq = LexiKey();
    final key_base = LexiKey();
    return LexiLexerInfoImpl(
      root: inner(
        name: "top",
        atoms: [],
        children: () {
          final lang_base = leaf(
            name: "base",
            atoms: [
              apgm(v: "'", n: "startslsq", k: key_slsq),
              apgm(v: '"', n: "startsldq", k: key_sldq),
              argm(r: r"'''((\\? )*\\?(\n|\r|\r\n))?", n: "startmlsq", k: key_mlsq),
              argm(r: r'"""((\\? )*\\?(\n|\r|\r\n))?', n: "startmldq", k: key_mldq),
              apgm(v: "r'", n: "startrslsq", k: key_rslsq),
              apgm(v: 'r"', n: "startrsldq", k: key_rsldq),
              argm(r: r"r'''((\\? )*\\?(\n|\r|\r\n))?", n: "startrmlsq", k: key_rmlsq),
              argm(r: r'r"""((\\? )*\\?(\n|\r|\r\n))?', n: "startrmldq", k: key_rmldq),
              arnm(r: r'[0-9]+([eE][+\-]?[0-9]+)?', n: "integer"),
              arnm(r: r'[0-9]*\.[0-9]+([eE][+\-]?[0-9]+)?', n: "decimal"),
              arnm(r: r'0[xX][0-9ABCDEFabcdef]+', n: "hex"),
              apgm(v: r'{', k: key_base, n: "lcur"),
              aplm(v: r'}', n: "rcur"),
            ],
            handle: key_base,
          );
          return [
            () {
              final base_identifier = arnm(r: r'[_$a-zA-Z][_$0-9a-zA-Z]*', n: "identifier");
              final base_kabstract = apnm(v: r'abstract', n: "kabstract");
              final base_kas = apnm(v: r'as', n: "kas");
              final base_kassert = apnm(v: r'assert', n: "kassert");
              final base_kasync = apnm(v: r'async', n: "kasync");
              final base_kawait = apnm(v: r'await', n: "kawait");
              final base_ksealed = apnm(v: r'sealed', n: "ksealed");
              final base_kbase = apnm(v: r'base', n: "kbase");
              final base_kwhen = apnm(v: r'when', n: "kwhen");
              final base_kbreak = apnm(v: r'break', n: "kbreak");
              final base_kcase = apnm(v: r'case', n: "kcase");
              final base_kcatch = apnm(v: r'catch', n: "kcatch");
              final base_kclass = apnm(v: r'class', n: "kclass");
              final base_kconst = apnm(v: r'const', n: "kconst");
              final base_kcontinue = apnm(v: r'continue', n: "kcontinue");
              final base_kcovariant = apnm(v: r'covariant', n: "kcovariant");
              final base_kdefault = apnm(v: r'default', n: "kdefault");
              final base_kdeferred = apnm(v: r'deferred', n: "kdeferred");
              final base_kdo = apnm(v: r'do', n: "kdo");
              final base_kdynamic = apnm(v: r'dynamic', n: "kdynamic");
              final base_kelse = apnm(v: r'else', n: "kelse");
              final base_kenum = apnm(v: r'enum', n: "kenum");
              final base_kexport = apnm(v: r'export', n: "kexport");
              final base_kextends = apnm(v: r'extends', n: "kextends");
              final base_kextension = apnm(v: r'extension', n: "kextension");
              final base_kexternal = apnm(v: r'external', n: "kexternal");
              final base_kfactory = apnm(v: r'factory', n: "kfactory");
              final base_kfalse = apnm(v: r'false', n: "kfalse");
              final base_kfinal = apnm(v: r'final', n: "kfinal");
              final base_kfinally = apnm(v: r'finally', n: "kfinally");
              final base_kfor = apnm(v: r'for', n: "kfor");
              final base_kfunction = apnm(v: r'Function', n: "kfunction");
              final base_kget = apnm(v: r'get', n: "kget");
              final base_khide = apnm(v: r'hide', n: "khide");
              final base_kif = apnm(v: r'if', n: "kif");
              final base_kimplements = apnm(v: r'implements', n: "kimplements");
              final base_kimport = apnm(v: r'import', n: "kimport");
              final base_kin = apnm(v: r'in', n: "kin");
              final base_kinterface = apnm(v: r'interface', n: "kinterface");
              final base_kis = apnm(v: r'is', n: "kis");
              final base_klate = apnm(v: r'late', n: "klate");
              final base_klibrary = apnm(v: r'library', n: "klibrary");
              final base_kmixin = apnm(v: r'mixin', n: "kmixin");
              final base_knew = apnm(v: r'new', n: "knew");
              final base_knull = apnm(v: r'null', n: "knull");
              final base_kof = apnm(v: r'of', n: "kof");
              final base_kon = apnm(v: r'on', n: "kon");
              final base_koperator = apnm(v: r'operator', n: "koperator");
              final base_kpart = apnm(v: r'part', n: "kpart");
              final base_krequired = apnm(v: r'required', n: "krequired");
              final base_krethrow = apnm(v: r'rethrow', n: "krethrow");
              final base_kreturn = apnm(v: r'return', n: "kreturn");
              final base_kset = apnm(v: r'set', n: "kset");
              final base_kshow = apnm(v: r'show', n: "kshow");
              final base_kstatic = apnm(v: r'static', n: "kstatic");
              final base_ksuper = apnm(v: r'super', n: "ksuper");
              final base_kswitch = apnm(v: r'switch', n: "kswitch");
              final base_ksync = apnm(v: r'sync', n: "ksync");
              final base_kthis = apnm(v: r'this', n: "kthis");
              final base_kthrow = apnm(v: r'throw', n: "kthrow");
              final base_ktrue = apnm(v: r'true', n: "ktrue");
              final base_ktry = apnm(v: r'try', n: "ktry");
              final base_ktypedef = apnm(v: r'typedef', n: "ktypedef");
              final base_kvar = apnm(v: r'var', n: "kvar");
              final base_kvoid = apnm(v: r'void', n: "kvoid");
              final base_kwhile = apnm(v: r'while', n: "kwhile");
              final base_kwith = apnm(v: r'with', n: "kwith");
              final base_kyield = apnm(v: r'yield', n: "kyield");
              return inner(
                name: "kw",
                atoms: [
                  base_identifier,
                  // region keywords
                  base_kabstract,
                  base_kas,
                  base_kassert,
                  base_kasync,
                  base_kawait,
                  base_ksealed,
                  base_kbase,
                  base_kwhen,
                  base_kbreak,
                  base_kcase,
                  base_kcatch,
                  base_kclass,
                  base_kconst,
                  base_kcontinue,
                  base_kcovariant,
                  base_kdefault,
                  base_kdeferred,
                  base_kdo,
                  base_kdynamic,
                  base_kelse,
                  base_kenum,
                  base_kexport,
                  base_kextends,
                  base_kextension,
                  base_kexternal,
                  base_kfactory,
                  base_kfalse,
                  base_kfinal,
                  base_kfinally,
                  base_kfor,
                  base_kfunction,
                  base_kget,
                  base_khide,
                  base_kif,
                  base_kimplements,
                  base_kimport,
                  base_kin,
                  base_kinterface,
                  base_kis,
                  base_klate,
                  base_klibrary,
                  base_kmixin,
                  base_knew,
                  base_knull,
                  base_kof,
                  base_kon,
                  base_koperator,
                  base_kpart,
                  base_krequired,
                  base_krethrow,
                  base_kreturn,
                  base_kset,
                  base_kshow,
                  base_kstatic,
                  base_ksuper,
                  base_kswitch,
                  base_ksync,
                  base_kthis,
                  base_kthrow,
                  base_ktrue,
                  base_ktry,
                  base_ktypedef,
                  base_kvar,
                  base_kvoid,
                  base_kwhile,
                  base_kwith,
                  base_kyield,
                  // endregion
                ],
                children: [
                  lang_base,
                ],
                r: LexiResolutionHandlerExplicitImpl(
                  resolvers: [
                    LexiResolverImpl(match: [base_identifier, base_kabstract], prefer: base_kabstract),
                    LexiResolverImpl(match: [base_identifier, base_kas], prefer: base_kas),
                    LexiResolverImpl(match: [base_identifier, base_kwhen], prefer: base_kwhen),
                    LexiResolverImpl(match: [base_identifier, base_ksealed], prefer: base_ksealed),
                    LexiResolverImpl(match: [base_identifier, base_kbase], prefer: base_kbase),
                    LexiResolverImpl(match: [base_identifier, base_kassert], prefer: base_kassert),
                    LexiResolverImpl(match: [base_identifier, base_kasync], prefer: base_kasync),
                    LexiResolverImpl(match: [base_identifier, base_kawait], prefer: base_kawait),
                    LexiResolverImpl(match: [base_identifier, base_kbreak], prefer: base_kbreak),
                    LexiResolverImpl(match: [base_identifier, base_kcase], prefer: base_kcase),
                    LexiResolverImpl(match: [base_identifier, base_kcatch], prefer: base_kcatch),
                    LexiResolverImpl(match: [base_identifier, base_kclass], prefer: base_kclass),
                    LexiResolverImpl(match: [base_identifier, base_kconst], prefer: base_kconst),
                    LexiResolverImpl(match: [base_identifier, base_kcontinue], prefer: base_kcontinue),
                    LexiResolverImpl(match: [base_identifier, base_kcovariant], prefer: base_kcovariant),
                    LexiResolverImpl(match: [base_identifier, base_kdefault], prefer: base_kdefault),
                    LexiResolverImpl(match: [base_identifier, base_kdeferred], prefer: base_kdeferred),
                    LexiResolverImpl(match: [base_identifier, base_kdo], prefer: base_kdo),
                    LexiResolverImpl(match: [base_identifier, base_kdynamic], prefer: base_kdynamic),
                    LexiResolverImpl(match: [base_identifier, base_kelse], prefer: base_kelse),
                    LexiResolverImpl(match: [base_identifier, base_kenum], prefer: base_kenum),
                    LexiResolverImpl(match: [base_identifier, base_kexport], prefer: base_kexport),
                    LexiResolverImpl(match: [base_identifier, base_kextends], prefer: base_kextends),
                    LexiResolverImpl(match: [base_identifier, base_kextension], prefer: base_kextension),
                    LexiResolverImpl(match: [base_identifier, base_kexternal], prefer: base_kexternal),
                    LexiResolverImpl(match: [base_identifier, base_kfactory], prefer: base_kfactory),
                    LexiResolverImpl(match: [base_identifier, base_kfalse], prefer: base_kfalse),
                    LexiResolverImpl(match: [base_identifier, base_kfinal], prefer: base_kfinal),
                    LexiResolverImpl(match: [base_identifier, base_kfinally], prefer: base_kfinally),
                    LexiResolverImpl(match: [base_identifier, base_kfor], prefer: base_kfor),
                    LexiResolverImpl(match: [base_identifier, base_kfunction], prefer: base_kfunction),
                    LexiResolverImpl(match: [base_identifier, base_kget], prefer: base_kget),
                    LexiResolverImpl(match: [base_identifier, base_khide], prefer: base_khide),
                    LexiResolverImpl(match: [base_identifier, base_kif], prefer: base_kif),
                    LexiResolverImpl(match: [base_identifier, base_kimplements], prefer: base_kimplements),
                    LexiResolverImpl(match: [base_identifier, base_kimport], prefer: base_kimport),
                    LexiResolverImpl(match: [base_identifier, base_kin], prefer: base_kin),
                    LexiResolverImpl(match: [base_identifier, base_kinterface], prefer: base_kinterface),
                    LexiResolverImpl(match: [base_identifier, base_kis], prefer: base_kis),
                    LexiResolverImpl(match: [base_identifier, base_klate], prefer: base_klate),
                    LexiResolverImpl(match: [base_identifier, base_klibrary], prefer: base_klibrary),
                    LexiResolverImpl(match: [base_identifier, base_kmixin], prefer: base_kmixin),
                    LexiResolverImpl(match: [base_identifier, base_knew], prefer: base_knew),
                    LexiResolverImpl(match: [base_identifier, base_knull], prefer: base_knull),
                    LexiResolverImpl(match: [base_identifier, base_kof], prefer: base_kof),
                    LexiResolverImpl(match: [base_identifier, base_kon], prefer: base_kon),
                    LexiResolverImpl(match: [base_identifier, base_koperator], prefer: base_koperator),
                    LexiResolverImpl(match: [base_identifier, base_kpart], prefer: base_kpart),
                    LexiResolverImpl(match: [base_identifier, base_krequired], prefer: base_krequired),
                    LexiResolverImpl(match: [base_identifier, base_krethrow], prefer: base_krethrow),
                    LexiResolverImpl(match: [base_identifier, base_kreturn], prefer: base_kreturn),
                    LexiResolverImpl(match: [base_identifier, base_kset], prefer: base_kset),
                    LexiResolverImpl(match: [base_identifier, base_kshow], prefer: base_kshow),
                    LexiResolverImpl(match: [base_identifier, base_kstatic], prefer: base_kstatic),
                    LexiResolverImpl(match: [base_identifier, base_ksuper], prefer: base_ksuper),
                    LexiResolverImpl(match: [base_identifier, base_kswitch], prefer: base_kswitch),
                    LexiResolverImpl(match: [base_identifier, base_ksync], prefer: base_ksync),
                    LexiResolverImpl(match: [base_identifier, base_kthis], prefer: base_kthis),
                    LexiResolverImpl(match: [base_identifier, base_kthrow], prefer: base_kthrow),
                    LexiResolverImpl(match: [base_identifier, base_ktrue], prefer: base_ktrue),
                    LexiResolverImpl(match: [base_identifier, base_ktry], prefer: base_ktry),
                    LexiResolverImpl(match: [base_identifier, base_ktypedef], prefer: base_ktypedef),
                    LexiResolverImpl(match: [base_identifier, base_kvar], prefer: base_kvar),
                    LexiResolverImpl(match: [base_identifier, base_kvoid], prefer: base_kvoid),
                    LexiResolverImpl(match: [base_identifier, base_kwhile], prefer: base_kwhile),
                    LexiResolverImpl(match: [base_identifier, base_kwith], prefer: base_kwith),
                    LexiResolverImpl(match: [base_identifier, base_kyield], prefer: base_kyield),
                  ],
                ),
              );
            }(),
            lang_base,
            inner(
              name: "sy",
              atoms: [
                apnm(v: r'[', n: "slbra"),
                apnm(v: r']', n: "srbra"),
                apnm(v: r'(', n: "slpar"),
                apnm(v: r')', n: "srpar"),
                apnm(v: r',', n: "scomma"),
                apnm(v: r':', n: "scolon"),
                apnm(v: r';', n: "ssemicolon"),
                apnm(v: r'&', n: "sand"),
                apnm(v: r'&=', n: "sandeq"),
                apnm(v: r'^', n: "scaret"),
                apnm(v: r'^=', n: "scareteq"),
                apnm(v: r'>', n: "sg"),
                apnm(v: r'<', n: "sl"),
                apnm(v: r'%', n: "spercent"),
                apnm(v: r'%=', n: "spercenteq"),
                apnm(v: r'+', n: "splus"),
                apnm(v: r'+=', n: "spluseq"),
                apnm(v: r'|', n: "spipe"),
                apnm(v: r'|=', n: "spipeeq"),
                apnm(v: r'??', n: "sqq"),
                apnm(v: r'??=', n: "sqqeq"),
                apnm(v: r'~/', n: "stildeslash"),
                apnm(v: r'~/=', n: "stildeslasheq"),
                apnm(v: r'*', n: "sstar"),
                apnm(v: r'*=', n: "sstareq"),
                apnm(v: r'/', n: "sslash"),
                apnm(v: r'/=', n: "sslasheq"),
                apnm(v: r'-', n: "sminus"),
                apnm(v: r'-=', n: "sminuseq"),
                apnm(v: r'=', n: "seq"),
                apnm(v: r'==', n: "seqeq"),
                apnm(v: r'&&', n: "sandand"),
                apnm(v: r'@', n: "sat"),
                apnm(v: r'.', n: "sdot"),
                apnm(v: r'..', n: "sdotdot"),
                apnm(v: r'...', n: "sdotdotdot"),
                apnm(v: r'...?', n: "sdotdotdotq"),
                apnm(v: r'--', n: "sminusminus"),
                apnm(v: r'!', n: "sbang"),
                apnm(v: r'!=', n: "sbangeq"),
                apnm(v: r'++', n: "splusplus"),
                apnm(v: r'#', n: "shash"),
                apnm(v: r'||', n: "spipepipe"),
                apnm(v: r'?', n: "squestion"),
                apnm(v: r'?.', n: "squestiondot"),
                apnm(v: r'?..', n: "squestiondotdot"),
                apnm(v: r'~', n: "stilde"),
              ],
              children: [
                lang_base,
              ],
            ),
            leaf(
              name: "blockcomment",
              handle: key_blockcomment,
              atoms: [
                argm(r: r"(\*+[^/*]+|/+[^/*]+|[^/*]+)*/*/\*", k: key_blockcomment, n: "bcstart"),
                arlm(r: r"(\*+[^/*]+|/+[^/*]+|[^/*]+)*\**\*/", n: "bcend"),
              ],
            ),
            inner(
              name: "t",
              atoms: [
                arnm(r: r'( |\t|\n|\r)+', n: "ws"),
                arnm(r: r'#![^\n\r]*(\r|\n|\r\n)?', n: "scripttag"),
                // TODO • can't parse '/**' because this will mess up parsing '/**/' I think I'll need a new language for the body of bcstart if it starts with *.
                apgm(v: r'/*', k: key_blockcomment, n: "bcinit"),
                // TODO • have a separate language for line comments to parse '///' correctly.
                arnm(r: r'//([^\n\r]+)?(\r|\n|\r\n)?', n: "singlelinenondoccomment"),
              ],
              children: [
                lang_base,
              ],
              r: const LexiResolutionHandlerExplicitImpl(
                resolvers: [],
              ),
            ),
            inner(
              name: "bsc",
              atoms: [
                arnm(r: r'\\n', n: "escaped_lf"),
                arnm(r: r'\\r', n: "escaped_cr"),
                arnm(r: r'\\b', n: "escaped_b"),
                arnm(r: r'\\t', n: "escaped_ht"),
                arnm(r: r'\\v', n: "escaped_vt"),
                arnm(r: r'\\x[0-9a-fA-F][0-9a-fA-F]', n: "escaped_byte"),
                arnm(r: r'\\u[0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F]', n: "escaped_unicodesimple"),
                arnm(r: r'\\u{[0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?}', n: "escaped_unicoderaw"),
                arnm(r: r'\\[^bnrtuvx]', n: "escaped_unescaped"),
                arnm(r: r'$[_a-zA-Z][_0-9a-zA-Z]*', n: "simpleinterpolation"),
                argm(r: r'${', k: key_base, n: "advancedinterpolation"),
              ],
              children: [
                leaf(
                  name: "slsq",
                  atoms: [
                    aplm(v: "'", n: "slsqpop"),
                    arnm(r: r"[^'$\\\n\r]+", n: "slsqcontent"),
                  ],
                  handle: key_slsq,
                ),
                leaf(
                  name: "sldq",
                  atoms: [
                    aplm(v: '"', n: "sldqpop"),
                    arnm(r: r'[^"$\\\n\r]+', n: "sldqcontent"),
                  ],
                  handle: key_sldq,
                ),
                leaf(
                  name: "mlsq",
                  atoms: [
                    aplm(v: "'''", n: "mlsqpop"),
                    arnm(r: r"[^'$\\]+", n: "mlsqcontent"),
                    arnm(r: r"('|'')", n: "mlsqq"),
                  ],
                  handle: key_mlsq,
                ),
                leaf(
                  name: "mldq",
                  atoms: [
                    aplm(v: '"""', n: "mldqpop"),
                    arnm(r: r'[^"$\\]+', n: "mldqcontent"),
                    arnm(r: r'("|"")', n: "mldqq"),
                  ],
                  handle: key_mldq,
                ),
              ],
            ),
            leaf(
              name: "rslsq",
              atoms: [
                aplm(v: "'", n: "rslsqend"),
                arnm(r: r"[^\n\r']+", n: "rslsqcontent"),
              ],
              handle: key_rslsq,
            ),
            leaf(
              name: "rsldq",
              atoms: [
                aplm(v: '"', n: "rsldqend"),
                arnm(r: r'[^\n\r"]+', n: "rsldqcontent"),
              ],
              handle: key_rsldq,
            ),
            leaf(
              name: "rmlsq",
              atoms: [
                aplm(v: r"'''", n: "rmlsqend"),
                arnm(r: r"(('|'')?[^'])+", n: "rmlsqcontent"),
              ],
              handle: key_rmlsq,
            ),
            leaf(
              name: "rmldq",
              atoms: [
                aplm(v: '"""', n: "rmldqend"),
                arnm(r: r'(("|"")?[^"])+', n: "rmldqcontent"),
              ],
              handle: key_rmldq,
            ),
          ];
        }(),
      ),
      start: key_base,
    );

Here's a visual representation that I use to visualize that specification:
dot.pdf

Here's the ANTLR specification of Dart.

ANTLR uses order to disambiguate between identifiers and keywords. The way I handle the specification, only keywords and identifiers need to be disambiguated in favor of the keyword.

I plan to translate that specification to owl just to see how it works out.

token ' ' can't be used as both a whitespace and a normal keyword

Consider the grammar of smileys with an arbitrary long space as a nose:

face = eyes nose mouth
eyes =
  ':' : normal
  ';' : winking
nose =
  ' '* : space
mouth =
  ')' : smile

The playground says:

error: token ' ' can't be used as both a whitespace and a normal keyword

    ' '* : space
    ~~~         

It's not quite clear to me why a nose can't be a space. Are space characters like ' ' used as whitespace by default? Is there a way to disable any default whitespace rules?

Is the disjointness requirement for call and exit symbols really necessary?

According to #43 (comment), call and return symbols need to be disjoint.

Therefore, #38 is not possible because string literals viewed as a VPL would have an identical call and return symbol, and would not be a VPL.

However, I'm wondering, could it be that this requirement by Rajeev Alur & P. Madhusudan for VPLs is too restrictive?

I don't have an intuitive understanding of the inner workings of owl, and this might be obvious, but @ianh can you think of a counterexample where having identical call and return symbols would break any of the invariants that owl depends on?

What would happen (where would owl break) if owl ignored this requirement, and a grammar like [ '[' ... '[' ] would be accepted and compiled by owl?

e.g.?:

Open Ast

non-breaking space is not a right turnstile

(@ianh I hope you don't me reporting any issues I find)

This one is really not urgent, but the following grammar:

foo = a 

Leads to the following error message:

error: 'ᅡᅠ' isn't a valid token

  foo = aᅡᅠ
         ~~

The space after the a is a non-breaking space (command+space), and it is being reported as a right turnstile.

Reserved words for field names

The grammar

a = identifier@return

won't compile because return is a C keyword.

We need to reserve all the C keywords, along with range and type (because these conflict with built-in fields).

Other types of numbers?

owl is fantastic, thanks for sharing it! I've started experimenting with it for creating a little language that defines a system of differential equations.

The predefined number is convenient, but I would like to have more flexible parsing of numbers. In some expressions, I want integers only. For example, in parsing the expression in an index operation such as x[k - 1], the numerical values inside the brackets must be integers. I want to treat x[k - 1.5] as invalid.

For other expressions, I would like to parse complex numbers, where the imaginary part of the number is indicated with a j, e.g. z = -1.0 + 2.5j (I'm using the Python convention for complex numbers).

Is there already an obvious way to handle these cases, or will I have to modify the tokenizer to not automatically consume numerical values as doubles? After only a brief look at the code, I suspect modifying the tokenizer would involve adding two more numerical token types, say integer and imaginary. If the sequence of characters in the input stream contains a number that is terminated by j (with no whitespace), create an imaginary token, and if the number is only digits (and not terminated by j), create an integer.

ER: Regex-defined custom tokens

It occurs to me that a C-based custom token matcher is overkill in most cases. For example:

.token hex '0x0BADCAFE' '0x248c'
.token oct '0644' '0777771'
.token bin '11001011b' '10b'

Could be completely defined in the language specification using regexes:

.token hex /0x[0-9A-Fa-f]+/
.token oct /0[1-7][0-7]*/
.token bin /[01]+b/

syntactic sugar for 'separated by'

I find myself writing some variation of RuleName ? ("," RuleName)* repeatedly, even in simple grammars. This is not a tremendous hardship, but it would be more elegant to have a "separated by" operator, similar to those that come in a lot of parser combinator libraries. For example:

FnDefinition = "function" identifier "(" identifier ? ("," identifier)* ")" Block

would instead be:

FnDefinition = "function" identifier "(" identifier % ","  ")" Block

Again, I recognize that this is merely syntactic sugar, but the + and ? operators are also merely syntactic sugar, but they reduce repetition and increase readability.

Build errors and warnings

I am running PopOS (an Ubuntu derivative):

$ uname -a
Linux pop-os 5.11.0-7620-generic #21~1626191760~20.10~55de9c3-Ubuntu SMP Wed Jul 21 20:34:49 UTC  x86_64 x86_64 x86_64 GNU/Linux
$ gcc --version
gcc (Ubuntu 10.3.0-1ubuntu1~20.10) 10.3.0
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

When I run make with commit db462c6, I get the following errors and warnings:

$ make
cc -O3 -std=c11 -pedantic -Wall -Wno-missing-braces -Wno-overlength-strings   -o owl src/*.c -ldl
In file included from src/owl.c:9:
src/test.h:12:5: error: unknown type name ‘pid_t’
   12 |     pid_t child;
      |     ^~~~~
In file included from src/test.c:2:
src/test.h:12:5: error: unknown type name ‘pid_t’
   12 |     pid_t child;
      |     ^~~~~
src/test.c: In function ‘spawn_child’:
src/test.c:70:19: warning: implicit declaration of function ‘fdopen’; did you mean ‘fopen’? [-Wimplicit-function-declaration]
   70 |         t->file = fdopen(fd[1], "w");
      |                   ^~~~~~
      |                   fopen
src/test.c:70:17: warning: assignment to ‘FILE *’ from ‘int’ makes pointer from integer without a cast [-Wint-conversion]
   70 |         t->file = fdopen(fd[1], "w");
      |                 ^
src/test.c:58:5: warning: ignoring return value of ‘pipe’ declared with attribute ‘warn_unused_result’ [-Wunused-result]
   58 |     pipe(fd);
      |     ^~~~~~~~
make: *** [Makefile:24: owl] Error 1

See, for example this stackoverflow question for a discussion of the problem with the definition of pid_t. I can fix that by adding #include <sys/types.h> in src/test.h.

The issue with fdopen is that it is POSIX-only, and not part of C11.

hex integer

If I try to implement another "built-in token class" like hexinteger (I will follow the implementation for other builtins), would the PR be excepted?

Unfortunately for my usecase I can not use 0xABC integer matcher with 0x.

Compiler warning in generated code

FYI: I just tried re-building this owl experiment

https://github.com/WarrenWeckesser/experiments/tree/master/c/gufunc-out-expr

using 93a6b7c. I get the following compiler warning (converted to an error because of -Werror):

$ make
gcc -Wall -Werror -pedantic -std=c99 main.c -lm -o main
In file included from main.c:9:
dim-expr-parser.h: In function ‘owl_default_tokenizer_advance’:
dim-expr-parser.h:787:25: error: variable ‘string’ set but not used [-Werror=unused-but-set-variable]
  787 |             const char *string = text + content_offset;
      |                         ^~~~~~
cc1: all warnings being treated as errors
make: *** [Makefile:5: main] Error 1

The following is the generated code around line 787:

        else if (token == 4294967295U) {
            size_t content_offset = offset + 1;
            size_t content_length = token_length - 2;
            const char *string = text + content_offset;  // <<< line 787
            size_t string_length = content_length;
            if (has_escapes) {
                for (size_t i = 0;
                i < content_length;
                ++i) {
                    if (text[content_offset + i] == '\\') {
                        string_length--;
                        i++;
                    }
                }
                char *unescaped = allocate_string_contents(string_length, tokenizer->info);
                size_t j = 0;
                for (size_t i = 0;
                i < content_length;
                ++i) {
                    if (text[content_offset + i] == '\\') i++;
                    unescaped[j++] = ESCAPE_CHAR(text[content_offset + i], tokenizer->info);
                }
                string = unescaped;
            }
            IGNORE_TOKEN_WRITE(offset, token_length, string, string_length, has_escapes, tokenizer->info);
        }

This is not a major issue for me, since this is still just an experiment with owl, but I thought you'd like to know.

Potential crash after calloc due to empty node->slots

In construct_node_alloc() there is a possibly empty allocation, because number_of_slots can be 0:

owl/src/1-parse.h

Lines 1675 to 1676 in b7f1cf4

node->slots = calloc(number_of_slots, sizeof(size_t));
if (!node->slots) abort();

In this case the behavior of calloc() is not defined:

If nmemb or size is 0, then calloc() returns either NULL, or a unique pointer value that can later be successfully passed to free().
https://linux.die.net/man/3/calloc

I was trying to compile owl for an embedded system, but abort() was called when creating the tree. So I fixed the parser with the following sed command, considering the number of slots within the if-statement:

sed -i '' 's/(!node->slots)/(number_of_slots \&\& !node->slots)/' src/parser.h

I guess a similar fix would help to ensure a more consistent behavior across different compilers/architectures.

Unexpected result when getting string or identifier

Given the following text

emit(<"a" )
Calling
parsed_string_get(parsed_expr_get(parsed_arg_get(arg).expr).string).string
results in
a" ) , though a was expected

Given the following grammar:

#using owl.v4 program = stmt* stmt = identifier ['(' arg* ')'] : func identifier '=' expr : assign 'stream' identifier ':' identifier : creates 'tick' identifier : ticks 'move' move : moves arg = '<' expr move = identifier '<' expr : movel expr '>' identifier : mover expr = [ '(' expr ')' ] : parens number : literal identifier : ident string : str .operators prefix '-' : negative .operators infix left '*' : times '/' : divided-by .operators infix left '+' : plus '-' : minus

Incremental parsing of visibly pushdown languages?

Hello @ianh,

I'm wondering, do you have any insights into incremental parsing where VPLs could offer some kind of advantage over, e.g., parsing context free languages using, e.g., LR/LL parsing?

Or, specifically, how would you go about making owl work in an incremental fashion, that is, if owl has parsed some source once, and only small parts of it have changed, subsequent parses should not have to reparse everything from scratch.

I'd appreciate any thoughts you might have on this.

Option to switch priority between user-defined tokens and built-in tokens

Hi there,

Awesome project. I love the strict approach taken to validation and the owl utility is really nifty and makes writing a definition for an existing language a breeze.

However I'm running into an issue that I think can be rather easily solved in the library: the current token classes are non-comprehensive, in that there are symbols that don't fall into the category number | string | identifier.

For context, I'm actually trying to use owl to model a shell scripting language, where there are two issues that don't play well with owl's current architecture: tokens not matching a reserved pattern should be treated as strings ("literal data") and line breaks are special characters that shouldn't be treated as generic whitespace (I'll file another issue for that separately).

This first issue should be solvable by simply introducing a final token class other or unmatched which should take literally anything so long as no keyword is expected at that point, and can be used to accept un-quoted content where it does not introduce ambiguities into the parser.

Documentation on Custom Tokenization

I can't find any resources on how to define a custom tokenizer other than the hex-color example. That example seems to be considerable simpler than custom tokenizers owl can currently support (for example not using the tokenization info).

Additionally, knowing more details about how the default tokenizer works would be good as well. Particularly how whitespace is handled. I would have no idea how to make a whitespace parser from looking at the current examples and readme.

Can't include parser.h in C++ project due to "class" keyword

I'm trying to use owl in a C++ project. Therefore I need to include parser.h (at least without implementation) in the main.cpp, where I call e.g. owl_tree_create_from_string(). The implementation is included in an almost empty file parser.c. The C++ compiler, however, has trouble parsing parser.h while compiling main.cpp due to the keyword class, which occurs at unexpected places.

I fixed the generated parser.h with the following sed command, appending an underscore after every "class":

sed -i '' 's/class/class_/g' src/parser.h

This is a bit more than necessary, because "class" also occurs as part of longer words. But this fix works.

It might be reasonable to slightly change the naming in 1-parse.h to make it C++ compatible. After all owl works nicely in C as well as C++.

Feature request: Unicode properties

Owl is awesome, thank you!

My proposals:

  • range(cp1, cp2) or range[cp1, cp2] - cp1 and cp2 are codepoints here (hex or decimal)
  • block(name) - Unicode's script name (Basic_Latin, Latin-1_Supplement, etc.)
  • property(name) - Unicode's property name (White_Space, Hyphen, Ps, Mn, etc.)
  • script(name) - Unicode's script name (Common, Latin, etc.)

What do you think?

How would one handle a macro/define keyword, is it possible currently?

I'm trying to implement a macro keyword. Is it possible to do something like?
'def' identifier remaining_line: define

If not, where would be a good place for me to start looking to add such a feature?

Also this is a really cool project! I've tested out quite a few different parser generators and this one just blows everything out of the water so far for what I'm trying to use it for.

Shadowing of local variable

offset gets shadowed here:

output_line(out, " size_t offset = tree->next_%%rule_token_offset;");

This results in warning in most compilers. Don't know the project, but at first glance offset%%rule-index might work.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.