zevv / npeg Goto Github PK

PEGs for Nim, another take

License: MIT License

Nim 100.00%

peg parser grammar regex gerexp regular-expressions parser-generator nim

npeg's Introduction

"Because friends don't let friends write parsers by hand"

NPeg is a pure Nim pattern matching library. It provides macros to compile patterns and grammars (PEGs) to Nim procedures which will parse a string and collect selected parts of the input. PEGs are not unlike regular expressions, but offer more power and flexibility, and have less ambiguities. (More about PEGs on Wikipedia)

Some use cases where NPeg is useful are configuration or data file parsers, robust protocol implementations, input validation, lexing of programming languages or domain specific languages.

Some NPeg highlights:

Grammar definitions and Nim code can be freely mixed. Nim code is embedded using the normal Nim code block syntax, and does not disrupt the grammar definition.
NPeg-generated parsers can be used both at run and at compile time.
NPeg offers various methods for tracing, optimizing and debugging your parsers.
NPeg can parse sequences of any data types, also making it suitable as a stage-two parser for lexed tokens.
NPeg can draw cool diagrams

Quickstart
Usage
- Simple patterns
- Grammars
Syntax
- Atoms
- Operators
Precedence operators
Captures
More about grammars
Error handling
Advanced topics
- Parsing other types then strings
Some notes on using PEGs
Tracing and debugging
Compile-time configuration
Tracing and debugging
Random stuff and frequently asked questions
Examples
Future directions / Todos / Roadmap / The long run

Quickstart

Here is a simple example showing the power of NPeg: The macro peg compiles a grammar definition into a parser object, which is used to match a string and place the key-value pairs into the Nim table words:

import npeg, strutils, tables

type Dict = Table[string, int]

let parser = peg("pairs", d: Dict):
  pairs <- pair * *(',' * pair) * !1
  word <- +Alpha
  number <- +Digit
  pair <- >word * '=' * >number:
    d[$1] = parseInt($2)

var words: Dict
doAssert parser.match("one=1,two=2,three=3,four=4", words).ok
echo words

Output:

{"two": 2, "three": 3, "one": 1, "four": 4}

A brief explanation of the above code:

The macro peg is used to create a parser object, which uses pairs as the initial grammar rule to match. The variable d of type Dict will be available inside the code block parser for storing the parsed data.
The rule pairs matches one pair, followed by zero or more times (*) a comma followed by a pair.
The rules word and number match a sequence of one or more (+) alphabetic characters or digits, respectively. The Alpha and Digit rules are pre-defined rules matching the character classes {'A'..'Z','a'..'z'} and {'0'..'9'}.
The rule pair matches a word, followed by an equals sign (=), followed by a number.
The word and number in the pair rule are captured with the > operator. The Nim code fragment below this rule is executed for every match, and stores the captured word and number in the words Nim table.

Usage

The patt() and peg() macros can be used to compile parser functions:

patt() creates a parser from a single anonymous pattern.
peg() allows the definition of a set of (potentially recursive) rules making up a complete grammar.

The result of these macros is an object of the type Parser which can be used to parse a subject:

proc match(p: Parser, s: string) = MatchResult
proc matchFile(p: Parser, fname: string) = MatchResult

The above match functions returns an object of the type MatchResult:

MatchResult = object
  ok: bool
  matchLen: int
  matchMax: int
  ...

ok: A boolean indicating if the matching succeeded without error. Note that a successful match does not imply that all of the subject was matched, unless the pattern explicitly matches the end-of-string.
matchLen: The number of input bytes of the subject that successfully matched.
matchMax: The highest index into the subject that was reached during parsing, even if matching was backtracked or did not succeed. This offset is usually a good indication of the location where the matching error occurred.

The string captures made during the parsing can be accessed with:

proc captures(m: MatchResult): seq[string]

Simple patterns

A simple pattern can be compiled with the patt macro.

For example, the pattern below splits a string by white space:

let parser = patt *(*' ' * > +(1-' '))
echo parser.match("   one two three ").captures

Output:

@["one", "two", "three"]

The patt macro can take an optional code block which is used as code block capture for the pattern:

var key, val: string
let p = patt >+Digit * "=" * >+Alpha:
  (key, val) = ($1, $2)

assert p.match("15=fifteen").ok
echo key, " = ", val

Grammars

The peg macro provides a method to define (recursive) grammars. The first argument is the name of initial patterns, followed by a list of named patterns. Patterns can now refer to other patterns by name, allowing for recursion:

let parser = peg "ident":
  lower <- {'a'..'z'}
  ident <- *lower
doAssert parser.match("lowercaseword").ok

The order in which the grammar patterns are defined affects the generated parser. Although NPeg could always reorder, this is a design choice to give the user more control over the generated parser:

when a pattern P1 refers to pattern P2 which is defined before P1, P2 will be inlined in P1. This increases the generated code size, but generally improves performance.
when a pattern P1 refers to pattern P2 which is defined after P1, P2 will be generated as a subroutine which gets called from P1. This will reduce code size, but might also result in a slower parser.

Syntax

The NPeg syntax is similar to normal PEG notation, but some changes were made to allow the grammar to be properly parsed by the Nim compiler:

NPeg uses prefixes instead of suffixes for *, +, - and ?.
Ordered choice uses | instead of / because of operator precedence.
The explicit * infix operator is used for sequences.

NPeg patterns and grammars can be composed from the following parts:

Atoms:

   0              # matches always and consumes nothing
   1              # matches any character
   n              # matches exactly n characters
  'x'             # matches literal character 'x'
  "xyz"           # matches literal string "xyz"
 i"xyz"           # matches literal string, case insensitive
  {'x'..'y'}      # matches any character in the range from 'x'..'y'
  {'x','y','z'}   # matches any character from the set

Operators:

   P1 * P2        # concatenation
   P1 | P2        # ordered choice
   P1 - P2        # matches P1 if P2 does not match
  (P)             # grouping
  !P              # matches everything but P
  &P              # matches P without consuming input
  ?P              # matches P zero or one times
  *P              # matches P zero or more times
  +P              # matches P one or more times
  @P              # search for P
   P[n]           # matches P n times
   P[m..n]        # matches P m to n times

Precedence operators:

  P ^ N           # P is left associative with precedence N
  P ^^ N          # P is right associative with precedence N

String captures:  

  >P              # Captures the string matching  P 

Back references:

  R("tag", P)     # Create a named reference for pattern P
  R("tag")        # Matches the given named reference

Error handling:

  E"msg"          # Raise an `NPegParseError` exception

In addition to the above, NPeg provides the following built-in shortcuts for common atoms, corresponding to POSIX character classes:

  Alnum  <- {'A'..'Z','a'..'z','0'..'9'}, # Alphanumeric characters
  Alpha  <- {'A'..'Z','a'..'z'},          # Alphabetic characters
  Blank  <- {' ','\t'},                   # Space and tab
  Cntrl  <- {'\x00'..'\x1f','\x7f'},      # Control characters
  Digit  <- {'0'..'9'},                   # Digits
  Graph  <- {'\x21'..'\x7e'},             # Visible characters
  Lower  <- {'a'..'z'},                   # Lowercase characters
  Print  <- {'\x21'..'\x7e',' '},         # Visible characters and spaces
  Space  <- {'\9'..'\13',' '},            # Whitespace characters
  Upper  <- {'A'..'Z'},                   # Uppercase characters
  Xdigit <- {'A'..'F','a'..'f','0'..'9'}, # Hexadecimal digits

Atoms

Atoms are the basic building blocks for a grammar, describing the parts of the subject that should be matched.

Integer literal: 0 / 1 / n

The int literal atom n matches exactly n number of bytes. 0 always matches, but does not consume any data.
Character and string literals: 'x' / "xyz" / i"xyz"

Characters and strings are literally matched. If a string is prefixed with i, it will be matched case insensitive.
Character sets: {'x','y'}

Characters set notation is similar to native Nim. A set consists of zero or more comma separated characters or character ranges.
```
 {'x'..'y'}    # matches any character in the range from 'x'..'y'
 {'x','y','z'} # matches any character from the set 'x', 'y', and 'z'
```
The set syntax {} is flexible and can take multiple ranges and characters in one expression, for example {'0'..'9','a'..'f','A'..'F'}.

Operators

NPeg provides various prefix and infix operators. These operators combine or transform one or more patterns into expressions, building larger patterns.

Concatenation: P1 * P2
```
o──[P1]───[P2]──o
```
The pattern P1 * P2 returns a new pattern that matches only if first P1 matches, followed by P2.

For example, "foo" * "bar" would only match the string "foobar".

Note: As an alternative for the * asterisk, the unicode glyph ∙ ("bullet operator", 0x2219) can also be used for concatenation.
Ordered choice: P1 | P2
```
o─┬─[P1]─┬─o
  ╰─[P2]─╯
```
The pattern P1 | P2 tries to first match pattern P1. If this succeeds, matching will proceed without trying P2. Only if P1 can not be matched, NPeg will backtrack and try to match P2 instead. Once either P1 or P2 has matched, the choice will be final ("commited"), and no more backtracking will be possible for this choice.

For example ("foo" | "bar") * "fizz" would match both "foofizz" and "barfizz".

NPeg optimizes the | operator for characters and character sets: The pattern 'a' | 'b' | 'c' will be rewritten to a character set {'a','b','c'}.
Difference: P1 - P2

The pattern P1 - P2 matches P1 only if P2 does not match. This is equivalent to !P2 * P1:
```
   ━━━━
o──[P2]─»─[P1]──o
```
NPeg optimizes the - operator for characters and character sets: The pattern {'a','b','c'} - 'b' will be rewritten to the character set {'a','c'}.
Grouping: (P)

Brackets are used to group patterns similar to normal arithmetic expressions.
Not-predicate: !P
```
   ━━━
o──[P]──o
```
The pattern !P returns a pattern that matches only if the input does not match P. In contrast to most other patterns, this pattern does not consume any input.

A common usage for this operator is the pattern !1, meaning "only succeed if there is not a single character left to match" - which is only true for the end of the string.
And-predicate: &P
```
   ━━━
   ━━━
o──[P]──o
```
The pattern &P matches only if the input matches P, but will not consume any input. This is equivalent to !!P. This is denoted by a double negation in the railroad diagram, which is not very pretty unfortunately.
Optional: ?P
```
  ╭──»──╮
o─┴─[P]─┴─o
```
The pattern ?P matches if P can be matched zero or more times, so essentially succeeds if P either matches or not.

For example, ?"foo" * bar" matches both "foobar" and "bar".
Match zero or more times: *P
```
  ╭───»───╮
o─┴┬─[P]─┬┴─o
   ╰──«──╯
```
The pattern *P tries to match as many occurrences of pattern P as possible - this operator always behaves greedily.

For example, *"foo" * "bar" matches "bar", "fooboar", "foofoobar", etc.
Match one or more times: +P
```
o─┬─[P]─┬─o
  ╰──«──╯
```
The pattern +P matches P at least once, but also more times. It is equivalent to the P * *P - this operator always behave greedily.
Search: @P

This operator searches for pattern P using an optimized implementation. It is equivalent to s <- *(1 - P) * P, which can be read as "try to match as many characters as possible not matching P, and then match P:
```
  ╭─────»─────╮
  │  ━━━      │
o─┴┬─[P]─»─1─┬┴»─[P]──o
   ╰────«────╯
```
Note that this operator does not allow capturing the skipped data up to the match; if this is required you can manually construct a grammar to do this.
Match exactly n times: P[n]

The pattern P[n] matches P exactly n times.

For example, "foo"[3] only matches the string "foofoofoo":
```
o──[P]─»─[P]─»─[P]──o
```
Match m to n times: P[m..n]

The pattern P[m..n] matches P at least m and at most n times.

For example, "foo[1,3]" matches "foo", "foofoo" and "foofoofo":
```
        ╭──»──╮ ╭──»──╮
o──[P]─»┴─[P]─┴»┴─[P]─┴─o
```

Precedence operators

Note: This is an experimental feature, the implementation or API might change in the future.

Precedence operators allows for the construction of "precedence climbing" or "Pratt parsers" with NPeg. The main use for this feature is building parsers for programming languages that follow the usual precedence and associativity rules of arithmetic expressions.

Left associative precedence of N: P ^ N

   <1<   
o──[P]──o

Right associative precedence of N: P ^^ N

   >1> 
o──[P]──o

During parsing NPeg keeps track of the current precedence level of the parsed expression - the default is 0 if no precedence has been assigned yet. When the ^ operator is matched, either one of the next three cases applies:

P ^ N where N > 0 and N is lower then the current precedence: in this case the current precedence is set to N and parsing of pattern P continues.
P ^ N where N > 0 and N is higher or equal then the current precedence: parsing will fail and backtrack.
P ^ 0: resets the current precedence to 0 and continues parsing. This main use case for this is parsing sub-expressions in parentheses.

The heart of a Pratt parser in NPeg would look something like this:

exp <- prefix * *infix

parenExp <- ( "(" * exp * ")" ) ^ 0

prefix <- number | parenExp

infix <- {'+','-'}    * exp ^  1 |
         {'*','/'}    * exp ^  2 |
         {'^'}        * exp ^^ 3:

More extensive documentation will be added later, for now take a look at the example in tests/precedence.nim.

Captures

     ╭╶╶╶╶╶╮
s o────[P]────o
     ╰╶╶╶╶╶╯

NPeg supports a number of ways to capture data when parsing a string. The various capture methods are described here, including a concise example.

The capture examples below build on the following small PEG, which parses a comma separated list of key-value pairs:

const data = "one=1,two=2,three=3,four=4"

let parser = peg "pairs":
  pairs <- pair * *(',' * pair) * !1
  word <- +Alpha
  number <- +Digit
  pair <- word * '=' * number

let r = parser.match(data)

String captures

The basic method for capturing is marking parts of the peg with the capture prefix >. During parsing NPeg keeps track of all matches, properly discarding any matches which were invalidated by backtracking. Only when parsing has fully succeeded it creates a seq[string] of all matched parts, which is then returned in the MatchData.captures field.

In the example, the > capture prefix is added to the word and number rules, causing the matched words and numbers to be appended to the result capture seq[string]:

let parser = peg "pairs":
  pairs <- pair * *(',' * pair) * !1
  word <- +Alpha
  number <- +Digit
  pair <- >word * '=' * >number

let r = parser.match(data)

The resulting list of captures is now:

@["one", "1", "two", "2", "three", "3", "four", "4"]

Code block captures

Code block captures offer the most flexibility for accessing matched data in NPeg. This allows you to define a grammar with embedded Nim code for handling the data during parsing.

Note that for code block captures, the Nim code gets executed during parsing, even if the match is part of a pattern that fails and is later backtracked.

When a grammar rule ends with a colon :, the next indented block in the grammar is interpreted as Nim code, which gets executed when the rule has been matched. Any string captures that were made inside the rule are available to the Nim code in the injected variable capture[] of type seq[Capture]:

type Capture = object
  s*: string      # The captured string
  si*: int        # The index of the captured string in the subject

The total subject matched by the code block rule is available in capture[0] Any additional explicit > string captures made by the rule or any of its child rules will be available as capture[1], capture[2], ...

For convenience there is syntactic sugar available in the code block capture blocks:

The variables $0 to $9 are rewritten to capture[n].s and can be used to access the captured strings. The $ operator uses then usual Nim precedence, thus these variables might need parentheses or different ordering in some cases, for example $1.parseInt should be written as parseInt($1).
The variables @0 to @9 are rewritten to capture[n].si and can be used to access the offset in the subject of the matched captures.

Example:

let p = peg foo:
  foo <- >(1 * >1) * 1:
    echo "$0 = ", $0
    echo "$1 = ", $1
    echo "$2 = ", $2
       
echo p.match("abc").ok

Will output

$0 = abc
$1 = ab
$2 = b

Code block captures consume all embedded string captures, so these captures will no longer be available after matching.

A code block capture can also produce captures by calling the push(s: string) function from the code block. Note that this is an experimental feature and that the API might change in future versions.

The example has been extended to capture each word and number with the > string capture prefix. When the pair rule is matched, the attached code block is executed, which adds the parsed key and value to the words table.

from strutils import parseInt
var words = initTable[string, int]()

let parser = peg "pairs":
  pairs <- pair * *(',' * pair) * !1
  word <- +Alpha
  number <- +Digit
  pair <- >word * '=' * >number:
    words[$1] = parseInt($2)

let r = parser.match(data)

After the parsing finished, the words table will now contain:

{"two": 2, "three": 3, "one": 1, "four": 4}

Custom match validations

Code block captures can be used for additional validation of a captured string: the code block can call the functions fail() or validate(bool) to indicate if the match should succeed or fail. Failing matches are handled as if the capture itself failed and will result in the usual backtracking. When the fail() or validate() functions are not called, the match will succeed implicitly.

For example, the following rule will check if a passed number is a valid uint8 number:

uint8 <- >Digit[1..3]:
  let v = parseInt($a)
  validate v>=0 and v<=255

The following grammar will cause the whole parse to fail when the error rule matches:

error <- 0:
  fail()

Note: The Nim code block is running within the NPeg parser context and in theory could access to its internal state - this could be used to create custom validator/matcher functions that can inspect the subject string, do lookahead or lookback, and adjust the subject index to consume input. At the time of writing, NPeg lacks a formal API or interface for this though, and I am not sure yet what this should look like - If you are interested in doing this, contact me so we can discuss the details.

Passing state

NPeg allows passing of data of a specific type to the match() function, this value is then available inside code blocks as a variable. This mitigates the need for global variables for storing or retrieving data in access captures.

The syntax for passing data in a grammar is:

peg(name, identifier: Type)

For example, the above parser can be rewritten as such:

type Dict = Table[string, int]

let parser = peg("pairs", userdata: Dict):
  pairs <- pair * *(',' * pair) * !1
  word <- +Alpha
  number <- +Digit
  pair <- >word * '=' * >number:
    userdata[$1] = parseInt($2)

var words: Dict
let r = parser.match(data, words)

Backreferences

Backreferences allow NPeg to match an exact string that matched earlier in the grammar. This can be useful to match repetitions of the same word, or for example to match so called here-documents in programming languages.

For this, NPeg offers the R operator with the following two uses:

The R(name, P) pattern creates a named reference for pattern P which can be referred to by name in other places in the grammar.
The pattern R(name) matches the contents of the named reference that earlier been stored with R(name, P) pattern.

For example, the following rule will match only a string which will have the same character in the first and last position:

patt R("c", 1) * *(1 - R("c")) * R("c") * !1

The first part of the rule R("c", 1) will match any character, and store this in the named reference c. The second part will match a sequence of zero or more characters that do not match reference c, followed by reference c.

More about grammars

Ordering of rules in a grammar

Repetitive inlining of rules might cause a grammar to grow too large, resulting in a huge executable size and slow compilation. NPeg tries to mitigate this in two ways:

Patterns that are too large will not be inlined, even if the above ordering rules apply.
NPeg checks the size of the total grammar, and if it thinks it is too large it will fail compilation with the error message NPeg: grammar too complex.

Check the section "Compile-time configuration" below for more details about too complex grammars.

The parser size and performance depends on many factors; when performance and/or code size matters, it pays to experiment with different orderings and measure the results.

When in doubt, check the generated parser instructions by compiling with the -d:npegTrace or -d:npegDotDir flags - see the section Tracing and Debugging for more information.

At this time the upper limit is 4096 rules, this might become a configurable number in a future release.

For example, the following grammar will not compile because recursive inlining will cause it to expand to a parser with more then 4^6 = 4096 rules:

let p = peg "z":
  f <- 1
  e <- f * f * f * f
  d <- e * e * e * e
  c <- d * d * d * d
  b <- c * c * c * c
  a <- b * b * b * b
  z <- a * a * a * a

The fix is to change the order of the rules so that instead of inlining NPeg will use a calling mechanism:

let p = peg "z":
  z <- a * a * a * a
  a <- b * b * b * b
  b <- c * c * c * c
  c <- d * d * d * d
  d <- e * e * e * e
  e <- f * f * f * f
  f <- 1

When in doubt check the generated parser instructions by compiling with the -d:npegTrace flag - see the section Tracing and Debugging for more information.

Templates, or parameterized rules

When building more complex grammars you may find yourself duplicating certain constructs in patterns over and over again. To avoid code repetition (DRY), NPeg provides a simple mechanism to allow the creation of parameterized rules. In good Nim-fashion these rules are called "templates". Templates are defined just like normal rules, but have a list of arguments, which are referred to in the rule. Technically, templates just perform a basic search-and-replace operation: every occurrence of a named argument is replaced by the exact pattern passed to the template when called.

For example, consider the following grammar:

numberList <- +Digit * *( ',' * +Digit)
wordList <- +Alpha * *( ',' * +Alpha)

This snippet uses a common pattern twice for matching lists: p * *( ',' * p). This matches pattern p, followed by zero or more occurrences of a comma followed by pattern p. For example, numberList will match the string 1,22,3.

The above example can be parameterized with a template like this:

commaList(item) <- item * *( ',' * item )
numberList <- commaList(+Digit)
wordList <- commaList(+Alpha)

Here the template commaList is defined, and any occurrence of its argument 'item' will be replaced with the patterns passed when calling the template. This template is used to define the more complex patterns numberList and wordList.

Templates may invoke other templates recursively; for example the above can even be further generalized:

list(item, sep) <- item * *( sep * item )
commaList(item) <- list(item, ',')
numberList <- commaList(+Digit)
wordList <- commaList(+Alpha)

Composing grammars with libraries

For simple grammars it is usually fine to build all patterns from scratch from atoms and operators, but for more complex grammars it makes sense to define reusable patterns as basic building blocks.

For this, NPeg keeps track of a global library of patterns and templates. The grammar macro can be used to add rules or templates to this library. All patterns in the library will be stored with a qualified identifier in the form libraryname.patternname, by which they can be referred to at a later time.

For example, the following fragment defines three rules in the library with the name number. The rules will be stored in the global library and are referred to in the peg by their qualified names number.dec, number.hex and number.oct:

grammar "number":
  dec <- {'1'..'9'} * *{'0'..'9'}
  hex <- i"0x" * +{'0'..'9','a'..'f','A'..'F'}
  oct <- '0' * *{'0'..'9'}

let p = peg "line":
  line <- int * *("," * int)
  int <- number.dec | number.hex | number.oct

let r = p.match("123,0x42,0644")

NPeg offers a number of pre-defined libraries for your convenience, these can be found in the npeg/lib directory. A library an be imported with the regular Nim import statement, all rules defined in the imported file will then be added to NPeg's global pattern library. For example:

import npeg/lib/uri

Note that templates defined in libraries do not implicitly bind the the rules from that grammar; instead, you need to explicitly qualify the rules used in the template to refer to the grammar. For example:

grammar "foo":
  open <- "("
  close <- ")"
  inBrackets(body): foo.open * body * foo.close

Library rule overriding/shadowing

To allow the user to add custom captures to imported grammars or rules, it is possible to override or shadow an existing rule in a grammar.

Overriding will replace the rule from the library with the provided new rule, allowing the caller to change parts of an imported grammar. A overridden rule is allowed to reference the original rule by name, which will cause the new rule to shadow the original rule. This will effectively rename the original rule and replace it with the newly defined rule which will call the original referred rule.

For example, the following snippet will reuse the grammar from the uri library and capture some parts of the URI in a Nim object:

import npeg/lib/uri

type Uri = object
  host: string
  scheme: string
  path: string
  port: int

var myUri: Uri

let parser = peg "line":
  line <- uri.URI
  uri.scheme <- >uri.scheme: myUri.scheme = $1
  uri.host <- >uri.host:     myUri.host = $1
  uri.port <- >uri.port:     myUri.port = parseInt($1)
  uri.path <- >uri.path:     myUri.path = $1

echo parser.match("http://nim-lang.org:8080/one/two/three")
echo myUri  # --> (host: "nim-lang.org", scheme: "http", path: "/one/two/three", port: 8080)

Error handling

NPeg offers a number of ways to handle errors during parsing a subject string; what method best suits your parser depends on your requirements.

MatchResult

The most simple way to handle errors is to inspect the MatchResult object that is returned by the match() proc:

MatchResult = object
  ok: bool
  matchLen: int
  matchMax: int

The ok field in the MatchResult indicates if the parser was successful: when the complete pattern has been matched this value will be set to true, if the complete pattern did not match the subject the value will be false.

In addition to the ok field, the matchMax field indicates the maximum offset into the subject the parser was able to match the string. If the matching succeeded matchMax equals the total length of the subject, if the matching failed, the value of matchMax is usually a good indication of where in the subject string the error occurred:

let a = patt 4
let r = a.match("123")
if not r.ok:
  echo "Parsing failed at position ", r.matchMax

NpegParseError exceptions

When, during matching, the parser reaches an E"message" atom in the grammar, NPeg will raise an NPegParseError exception with the given message. The typical use case for this atom is to be combine with the ordered choice | operator to generate helpful error messages. The following example illustrates this:

let parser = peg "list":
  list <- word * *(comma * word) * !1
  word <- +Alpha | E"expected word"
  comma <- ',' | E"expected comma"

try:
  echo parser.match("one,two;three")
except NPegParseError as e:
  echo "Parsing failed at position ", e.matchMax, ": ", e.msg

The rule comma tries to match the literal ','. If this can not be matched, the rule E"expected comma" will match instead, where E will raise an NPegParseError exception.

The NPegParseError type contains the same two fields as MatchResult to indicate where in the subject string the match failed: matchLen and matchMax, which can be used as an indication of the location of the parse error:

Parsing failed at position 7: expected comma

Other exceptions

NPeg can raise a number of other exception types during parsing:

NPegParseError: described in the previous section
NPegStackOverflowError: a stack overflow occured in the backtrace or call stack; this is usually an indication of a faulty or too complex grammar.
NPegUnknownBackrefError: An unknown back reference identifier is used in an R() rule.
NPegCaptureOutOfRangeError: A code block capture tries to access a capture that is not available using the $ notation or by accessing the capture[] seq.

All the above errors are inherited from the generic NPegException object.

Parser stack trace

If an exception is raised from within an NPeg parser - either by the E atom or by nim code in a code block capture - NPeg will augment the Nim stack trace with frames indicating where in the grammar the exception occured.

The above example will generate the following stack trace, note the last two entries which are added by NPeg and show the rules in which the exception occured:

/tmp/list.nim(9)         list
./npeg/src/npeg.nim(142) match
./npeg/src/npeg.nim(135) match
/tmp/flop.nim(4)         list <- word * *(comma * word) * eof
/tmp/flop.nim(7)         word <- +{'a' .. 'z'} | E"expected word"
Error: unhandled exception: Parsing error at #14: "expected word" [NPegParseError]

Note: this requires Nim 'devel' or version > 1.6.x; on older versions you can use -d:npegStackTrace to make NPeg dump the stack to stdout.

Advanced topics

Parsing other types then strings

Note: This is an experimental feature, the implementation or API might change in the future.

NPeg was originally designed to parse strings like a regular PEG engine, but has since evolved into a generic parser that can parse any subject of type openArray[T]. This section describes how to use this feature.

The peg() macro must be passed an additional argument specifying the base type T of the subject; the generated parser will then parse a subject of type openArray[T]. When not given, the default type is char, and the parser parsers openArray[char], or more typically, string.
When matching non-strings, some of the usual atoms like strings or character sets do not make sense in a grammar, instead the grammar uses literal atoms. Literals can be specified in square brackets and are interpreted as any Nim code: [foo], [1+1] or ["foo"] are all valid literals.
When matching non-strings, captures will be limited to only a single element of the base type, as this makes more sense when parsing a token stream.

For an example of this feature check the example in tests/lexparse.nim - this implements a classic parser with separate lexing and parsing stages.

Some notes on using PEGs

Anchoring and searching

Unlike regular expressions, PEGs are always matched in anchored mode only: the defined pattern is matched from the start of the subject string. For example, the pattern "bar" does not match the string "foobar".

To search for a pattern in a stream, a construct like this can be used:

p <- "bar"
search <- p | 1 * search

The above grammar first tries to match pattern p, or if that fails, matches any character 1 and recurs back to itself. Because searching is a common operation, NPeg provides the builtin @P operator for this.

Complexity and performance

Although it is possible to write patterns with exponential time complexity for NPeg, they are much less common than in regular expressions, thanks to the limited backtracking. In particular, patterns written without grammatical rules always have a worst-case time O(n^k) (and space O(k), which is constant for a given pattern), where k is the pattern's star height. Moreover, NPeg has a simple and clear performance model that allows programmers to understand and predict the time complexity of their patterns. The model also provides a firm basis for pattern optimizations.

(Adapted from Ierusalimschy, "A Text Pattern-Matching Tool based on Parsing Expression Grammars", 2008)

End of string

PEGs do not care what is in the subject string after the matching succeeds. For example, the rule "foo" happily matches the string "foobar". To make sure the pattern matches the end of string, this has to be made explicit in the pattern.

The idiomatic notation for this is !1, meaning "only succeed if there is not a single character left to match" - which is only true for the end of the string.

Non-consuming atoms and captures

The lookahead(&) and not(!) operators may not consume any input, and make sure that after matching the internal parsing state of the parser is reset to as is was before the operator was started, including the state of the captures. This means that any captures made inside a & and ! block also are discarded. It is possible however to capture the contents of a non-consuming block with a code block capture, as these are always executed, even when the parser state is rolled back afterwards.

Left recursion

NPeg does not support left recursion (this applies to PEGs in general). For example, the rule

A <- A | 'a'

will cause an infinite loop because it allows for left-recursion of the non-terminal A.

Similarly, the grammar

A <- B | 'a' A
B <- A

is problematic because it is mutually left-recursive through the non-terminal B.

Note that loops of patterns that can match the empty string will not result in the expected behavior. For example, the rule *0 will cause the parser to stall and go into an infinite loop.

UTF-8 / Unicode

NPeg has no built-in support for Unicode or UTF-8, instead is simply able to parse UTF-8 documents just as like any other string. NPeg comes with a simple UTF-8 grammar library which should simplify common operations like matching a single code point or character class. The following grammar splits an UTF-8 document into separate characters/glyphs by using the utf8.any rule:

import npeg/lib/utf8

let p = peg "line":
  line <- +char
  char <- >utf8.any

let r = p.match("γνωρίζω")
echo r.captures()   # --> @["γ", "ν", "ω", "ρ", "ί", "ζ", "ω"]

Tracing and debugging

Syntax diagrams

When compiled with -d:npegGraph, NPeg will dump syntax diagrams (also known as railroad diagrams) for all parsed rules.

Syntax diagrams are sometimes helpful to understand or debug a grammar, or to get more insight in a grammars' complexity.

                              ╭─────────»──────────╮                     
                              │      ╭─────»──────╮│                     
                ╭╶╶╶╶╶╶╶╶╶╶╮  │      │  ━━━━      ││         ╭╶╶╶╶╶╶╶╮   
inf o──"INF:"─»───[number]───»┴─","─»┴┬─[lf]─»─1─┬┴┴»─[lf]─»───[url]────o
                ╰╶╶╶╶╶╶╶╶╶╶╯          ╰────«─────╯           ╰╶╶╶╶╶╶╶╯

Optionals (?) are indicated by a forward arrow overhead.
Repeats ('+') are indicated by a backwards arrow underneath.
Literals (strings, chars, sets) are printed in purple.
Non-terminals are printed in cyan between square brackets.
Not-predicates (!) are overlined in red. Note that the diagram does not make it clear that the input for not-predicates is not consumed.
Captures are boxed in a gray rectangle, optionally including the capture name.

Here is a a larger example of an URL parser.

Grammar graphs

NPeg can generate a graphical representation of a grammar to show the relations between rules. The generated output is a .dot file which can be processed by the Graphviz tool to generate an actual image file.

When compiled with -d:npegDotDir=<PATH>, NPeg will generate a .dot file for each grammar in the code and write it to the given directory.

Edge colors represent the rule relation: grey=inline, blue=call, green=builtin
Rule colors represent the relative size/complexity of a rule: black=<10, orange=10..100, red=>100

Large rules result in larger generated code and slow compile times. Rule size can generally be decreased by changing the rule order in a grammar to allow NPeg to call rules instead of inlining them.

Tracing

When compiled with -d:npegTrace, NPeg will dump its intermediate representation of the compiled PEG, and will dump a trace of the execution during matching. These traces can be used for debugging or optimization of a grammar.

For example, the following program:

let parser = peg "line":
  space <- ' '
  line <- word * *(space * word)
  word <- +{'a'..'z'}

discard parser.match("one two")

will output the following intermediate representation at compile time. From the IR it can be seen that the space rule has been inlined in the line rule, but that the word rule has been emitted as a subroutine which gets called from line:

line:
   0: line           opCall 6 word        word
   1: line           opChoice 5           *(space * word)
   2:  space         opStr " "            ' '
   3: line           opCall 6 word        word
   4: line           opPartCommit 2       *(space * word)
   5:                opReturn

word:
   6: word           opSet '{'a'..'z'}'   {'a' .. 'z'}
   7: word           opSpan '{'a'..'z'}'  +{'a' .. 'z'}
   8:                opReturn

At runtime, the following trace is generated. The trace consists of a number of columns:

The current instruction pointer, which maps to the compile time dump.
The index into the subject.
The substring of the subject.
The name of the rule from which this instruction originated.
The instruction being executed.
The backtrace stack depth.

  0|  0|one two                 |line           |call -> word:6                          |
  6|  0|one two                 |word           |set {'a'..'z'}                          |
  7|  1|ne two                  |word           |span {'a'..'z'}                         |
  8|  3| two                    |               |return                                  |
  1|  3| two                    |line           |choice -> 5                             |
  2|  3| two                    | space         |chr " "                                 |*
  3|  4|two                     |line           |call -> word:6                          |*
  6|  4|two                     |word           |set {'a'..'z'}                          |*
  7|  5|wo                      |word           |span {'a'..'z'}                         |*
  8|  7|                        |               |return                                  |*
  4|  7|                        |line           |pcommit -> 2                            |*
  2|  7|                        | space         |chr " "                                 |*
   |  7|                        |               |fail                                    |*
  5|  7|                        |               |return (done)                           |

The exact meaning of the IR instructions is not discussed here.

Compile-time configuration

NPeg has a number of configurable setting which can be configured at compile time by passing flags to the compiler. The default values should be ok in most cases, but if you ever run into one of those limits you are free to configure those to your liking:

-d:npegPattMaxLen=N This is the maximum allowed length of NPeg's internal representation of a parser, before it gets translated to Nim code. The reason to check for an upper limit is that some grammars can grow exponentially by inlining of patterns, resulting in slow compile times and oversized executable size. (default: 4096)
-d:npegInlineMaxLen=N This is the maximum allowed length of a pattern to be inlined. Inlining generally results in a faster parser, but also increases code size. It is valid to set this value to 0; in that case NPeg will never inline patterns and use a calling mechanism instead, this will result in the smallest code size. (default: 50)
-d:npegRetStackSize=N Maximum allowed depth of the return stack for the parser. The default value should be high enough for practical purposes, the stack depth is only limited to detect invalid grammars. (default: 1024)
-d:npegBackStackSize=N Maximum allowed depth of the backtrace stack for the parser. The default value should be high enough for practical purposes, the stack depth is only limited to detect invalid grammars. (default: 1024)
-d:npegGcsafe This is a workaround for the case where NPeg needs to be used from a {.gcsafe.} context when using threads. This will mark the generated matching function to be {.gcsafe.}.

Tracing and debugging

NPeg has a number of compile time flags to enable tracing and debugging of the generated parser:

-d:npegTrace: Enable compile time and run time tracing. Please refer to the section 'Tracing' for more details.
-d:npegGraph: Dump syntax diagrams of all parsed rules at compile time.

These flags are meant for debugging NPeg itself, and are typically not useful to the end user:

-d:npegDebug: Enable more debug info. Meant for NPeg development debugging purposes only.
-d:npegExpand: Dump the generated Nim code for all parsers defined in the program. Meant for NPeg development debugging purposes only.
-d:npegStacktrace: When enabled, NPeg will dump a stack trace of the current position in the parser when an exception is thrown by NPeg itself or by Nim code in code block captures.

Random stuff and frequently asked questions

Why does NPeg not support regular PEG syntax?

The NPeg syntax is similar, but not exactly the same as the official PEG syntax: it uses some different operators, and prefix instead of postfix operators. The reason for this is that the NPeg grammar is parsed by a Nim macro in order to allow code block captures to embed Nim code, which puts some limitations on the available syntax. Also, NPeg's operators are chosen so that they have the right precedence for PEGs.

The result is that the grammer itself is expressed as valid Nim, which has the nice side effect of allowing syntax highlighting and code completion work with your favorite editor.

Can NPeg be used to parse EBNF grammars?

Almost, but not quite. Although PEGS and EBNF look quite similar, there are some subtle but important differences which do not allow a literal translation from EBNF to PEG. Notable differences are left recursion and ordered choice. Also, see "From EBNF to PEG" from Roman R. Redziejowski.

NPeg and generic functions

Nim's macro system is sometimes finicky and not well defined, and NPeg seems to push it to the limit. This means that you might run into strange and unexpected issues, especially when mixing NPeg with generic code.

If you run into weird error messages that do not seem to make sense when using NPeg from generic procs, check the links below for more information and possible workarounds:

Examples

Parsing arithmetic expressions

let parser = peg "line":
  exp      <- term   * *( ('+'|'-') * term)
  term     <- factor * *( ('*'|'/') * factor)
  factor   <- +{'0'..'9'} | ('(' * exp * ')')
  line     <- exp * !1

doAssert parser.match("3*(4+15)+2").ok

A complete JSON parser

The following PEG defines a complete parser for the JSON language - it will not produce any captures, but simple traverse and validate the document:

let s = peg "doc":
  S              <- *Space
  jtrue          <- "true"
  jfalse         <- "false"
  jnull          <- "null"

  unicodeEscape  <- 'u' * Xdigit[4]
  escape         <- '\\' * ({ '{', '"', '|', '\\', 'b', 'f', 'n', 'r', 't' } | unicodeEscape)
  stringBody     <- ?escape * *( +( {'\x20'..'\xff'} - {'"'} - {'\\'}) * *escape)
  jstring         <- ?S * '"' * stringBody * '"' * ?S

  minus          <- '-'
  intPart        <- '0' | (Digit-'0') * *Digit
  fractPart      <- "." * +Digit
  expPart        <- ( 'e' | 'E' ) * ?( '+' | '-' ) * +Digit
  jnumber         <- ?minus * intPart * ?fractPart * ?expPart

  doc            <- JSON * !1
  JSON           <- ?S * ( jnumber | jobject | jarray | jstring | jtrue | jfalse | jnull ) * ?S
  jobject        <- '{' * ( jstring * ":" * JSON * *( "," * jstring * ":" * JSON ) | ?S ) * "}"
  jarray         <- "[" * ( JSON * *( "," * JSON ) | ?S ) * "]"

doAssert s.match(json).ok

let doc = """ {"jsonrpc": "2.0", "method": "subtract", "params": [42, 23], "id": 1} """
doAssert parser.match(doc).ok

Captures

The following example shows how to use code block captures. The defined grammar will parse a HTTP response document and extract structured data from the document into a Nim object:

import npeg, strutils, tables

type
  Request = object
    proto: string
    version: string
    code: int
    message: string
    headers: Table[string, string]

# HTTP grammar (simplified)

let parser = peg("http", userdata: Request):
  space       <- ' '
  crlf        <- '\n' * ?'\r'
  url         <- +(Alpha | Digit | '/' | '_' | '.')
  eof         <- !1
  header_name <- +(Alpha | '-')
  header_val  <- +(1-{'\n'}-{'\r'})
  proto       <- >+Alpha:
    userdata.proto = $1
  version     <- >(+Digit * '.' * +Digit):
    userdata.version = $1
  code        <- >+Digit:
    userdata.code = parseInt($1)
  msg         <- >(+(1 - '\r' - '\n')):
    userdata.message = $1
  header      <- >header_name * ": " * >header_val:
    userdata.headers[$1] = $2
  response    <- proto * '/' * version * space * code * space * msg
  headers     <- *(header * crlf)
  http        <- response * crlf * headers * eof


# Parse the data and print the resulting table

const data = """
HTTP/1.1 301 Moved Permanently
Content-Length: 162
Content-Type: text/html
Location: https://nim.org/
"""

var request: Request
let res = parser.match(data, request)
echo request

The resulting data:

(
  proto: "HTTP",
  version: "1.1",
  code: 301,
  message: "Moved Permanently",
  headers: {
    "Content-Length": "162",
    "Content-Type":
    "text/html",
    "Location": "https://nim.org/"
  }
)

More examples

More examples can be found in tests/examples.nim.

Future directions / Todos / Roadmap / The long run

Here are some things I'd like to have implemented one day. Some are hard and require me to better understand what I'm doing first. In no particular order:

Handling left recursion: PEGs are typically not good at handling grammar invoking left recursion, see https://en.wikipedia.org/wiki/Parsing_expression_grammar#Indirect_left_recursion for an explanation of the problem. However, some smart people have found a way to make this work anyway, but I am not yet able to understand this well enough to implement this in NPeg. https://github.com/zevv/npeg/blob/master/doc/papers/Left_recursion_in_parsing_expression_grammars.pdf
Design and implement a proper API for code block captures. The current API feels fragile and fragmented (capture[], $1/$2, fail(), validate()), and does not offer solid primitives to make custom match functions yet, something better should be in place before NPeg goes v1.0.
Resuming/streaming: The current parser is almost ready to be invoked multiple times, resuming parsing where it left off - this should allow parsing of (infinite) streams. The only problem not solved yet is how to handle captures: when a block of data is parsed it might contain data which must later be available to collect the capture. Not sure how to handle this yet.
Memoization: I guess it would be possible to add (limited) memoization to improve performance, but no clue where to start yet.
Parallelization: I wonder if parsing can parallelized: when reaching an ordered choice, multiple threads should be able to try to parse each individual choice. I do see problems with captures here, though.
I'm not happy about the {.gcsafe.} workaround. I'd be happy to hear any ideas on how to improve this.

npeg's People

Contributors

Stargazers

Watchers

npeg's Issues

Feature request: Parse from seq[Token]

Peg is great. One of its advantages is that it's suitable for both lexing and parsing. This doesn't mean it's always preferable to do both of them in one step. Therefore we should provide an interface where the user can either choose to write 1 peg that does both lexing and parsing, or write 2 pegs, 1 for lexing and 1 (which feeds from the previous) for parsing.

To achieve this we need a match() that accepts seq[Token] where Token is user-defined (probably some codegen to generate the match() proc is involve here)

Cannot compile in stable release (Version 0.19.4)

Error message: .nimble\pkgs\npeg-0.8.0\npeg\buildpatt.nim(48, 22) Error: expected: '}', but got: '{'

More/larger examples of Error reporting

I am really not able to wrap my head around error reporting. I understand how to use it in small fragments. But I am not sure how to structure the grammar so that the error reporting is effective. Any longer example code here or general tips would be helpful.

thanks
Rishav

Large grammars break NPeg

For very large grammars (such as SQL) NPeg's code generation mechanism will fail:

..\npeg\src\npeg\codegen.nim(340, 18) Error: case statement has too many cases for computed goto

(another problem here is that the error message doesn't indicate what part of the grammar is the actual cause)

AssertionDefect when matching simple example

Hi,

I found npeg just now and tried it out with the following example.

`import npeg, strutils, tables

type Dict = Table[string, int]

let parser = peg("program", d: Dict):
program <- *statement
statement <- var_decl
word <- +Alpha
varC <- "var"
semi <- ';'
var_decl <- varC * >word * semi

var words: Table[string, int]
discard parser.match("var hello;", words).ok
echo words`

But it doesn't match - what am I doing wrong?
Thanks for any hint!

Nim parser generator story and npeg's role

Intro:
I realize npeg's original intention was (still is?) to be a nifty tool for parsing relatively simple grammars with pure Nim code. Therefore it's justifiably unfit for larger projects with more complex grammars.

But why is it unfit?
Npeg has come a looooong way! Many of its shortcomings are gone. VM problems are gone and recently you can even write grammars that parse from seq of tokens instead of string (#20)!

But well... the syntax is a rather ugly ad-hoc modification of peg. This is of course necessary in order to be accepted by Nim's parser. Nim-compatible syntax comes with several advantages: Firstly it allows you to leverage Nim's parser instead of implementing one yourself, and secondly it allows the user to write his DSL code in a macro block argument (which is cool and convenient). For large projects though you don't want to use such a syntax just so that you can write directly in a macro block argument and would very much prefer to give this up and use special files (e.g. .npeg instead of .nim) if it meant better syntax.

Another -less important imo, yet worth mentioning- limitation is lack of lexer/parser pipeline. Yes, you can now parse from tokens, but the two steps being separate hurts performance. Ideally lexer and parser should work concurrently.

What should we do?
The thing is... there is really no tool to use with Nim for that matter. Projects like Bison or ANTLR don't support Nim. I think this situation is unacceptable, or at least there is work that must be done in this direction. Lastly I list some options to improve the situation:

contribute to Bison or ANTLR (add Nim support)
extend npeg
- add support for reading standard PEG syntax mixed with Nim code from .npeg
- implement lexer/parser pipeline
- implement at least a linter for .npeg files
keep npeg nifty as it is and begin a new project oriented towards larger scale parsing

Please comment on which one you think is most logical, most preferable, most feasible, easiest.

FR: Deferred code block captures

Note that for code block captures, the Nim code gets executed during parsing, even if the match is part of a pattern that fails and is later backtracked.

It'd be cool to have some sort of API that defers code block execution until after parsing or at least beyond where backtracking is possible. Right now I'm using npeg for validation and to construct a tree, but I have to basically dispatch magic strings (using push) from the code block, and then re-parse those later to perform the checks I need. A simplified example would be: count how many times this rule has fired the code block. This count will currently include partial paths.

javascript support

First, thanks for this project is great. Trying to compile my code to javascript (just to check if that was possible) I got the following error:

$ nim js test_ex_01.nim
...
asciidoc/src/asciidoc/docheader/docheader.nim(18, 32) template/generic instantiation of `peg` from here
/home/jose/.nimble/pkgs/npeg-0.27.0/npeg/codegen.nim(429, 33) template/generic instantiation of `fn`gensym167` from here
/home/jose/.nimble/pkgs/npeg-0.27.0/npeg/codegen.nim(355, 22) Error: type mismatch
Expression: pop(e`gensym166.trace)
  [1] e`gensym166.trace: string

Expected one of (first mismatch at [position]):
[1] proc pop[A, B](t: OrderedTableRef[A, B]; key: A; val: var B): bool
[1] proc pop[A, B](t: TableRef[A, B]; key: A; val: var B): bool
[1] proc pop[A, B](t: var OrderedTable[A, B]; key: A; val: var B): bool
[1] proc pop[A, B](t: var Table[A, B]; key: A; val: var B): bool
[1] proc pop[A](t: CountTableRef[A]; key: A; val: var int): bool
[1] proc pop[A](t: var CountTable[A]; key: A; val: var int): bool
[1] proc pop[T](s: var seq[T]): T
[1] template pop[T](s: var Stack[T]): T

Is this something that could be fixed or simply npeg won't support javascript? (or maybe is something that I am doing wrong).

Attach line information to generate functions

NPeg doesn't appear to attach line information to its generated code. For instance, when the backtrace depth has been exceeded, a stack trace like the following will occur:

.\sample.nim(17) sample
..\npeg\src\npeg.nim(134) match
..\npeg\src\npeg\stack.nim(31) fn`gensym15613
..\npeg\src\npeg\stack.nim(27) grow
Error: unhandled exception: backtrace stack overflow, depth>1024 [NPegException]

Ideally, the traceback would indicate (either directly, or via the generated function name) what rule caused the error.

Streams / resuming

NPeg can at this time only match a complete subject string, which does not allow streaming. Find a way to make NPeg resumable without losing performance, which will then allow easy interfacing with Nim streams.

Failure to consume in particular case

I'm working on a SQL parser that converts a "WHEN" expression into reverse-polish notation. I'm coming across a problem where the system is ... possibly .. failing to consume the end-point numbers.

I must admit I'm new to PEG in general. So, if I'm making a newbie mistake, I do apologize.

Rather than go into great detail, I wrote a tiny program that demonstrates it:

import npeg
import strutils

let parser = peg("equation", p: seq[string]):

  equation <- allExpr
  
  plusMinus <- '+' | '-'

  number <- (integer)

  integer <- >+Digit:
    p.add "literal:" & $1

  whitespace <- +(Blank | Space | Cntrl)

  addExpr <- number * ?whitespace * >plusMinus * ?whitespace * (addExpr | number):
    p.add "op:" & $1

  allExpr <- (addExpr | number)

let testStr = "1 + 2 - 3 + 4 - 5"

var answer: seq[string] = @[]

let successFlag = parser.match(testStr, answer).ok

echo "test = [", testStr, "]"

echo "success = ", $successFlag

echo "answer = "
for item in answer:
  echo "  ", $item

The result I'm getting in answer is:

answer = 
  literal:1
  literal:2
  literal:3
  literal:4
  literal:5
  literal:5
  op:-
  op:+
  op:-
  op:+

but what I'd expect is:

answer = 
  literal:1
  literal:2
  literal:3
  literal:4
  literal:5
  op:-
  op:+
  op:-
  op:+

So, the section that is parsing a number is running twice on the same "5" text. Am I missing something or is there a bug?

PEG section in the Nim book

Dear Sir,

I have just shipped the first draft of the PEG section for the Nim book, see

http://ssalewski.de/nimprogramming.html#_parsing_expression_grammars

That section have not yet been checked by native English proof readers, so there may be still some grammar or spelling errors. Please let me know if my explanations are wrong or still unclear.

I have the hope that this section will make it a bit easier for kids and beginners to start with PEGs.

Some notes to your NPeg README text:

data up to the match; if his is

Plain typo, should be "this".

NPeg provides various prefix, infix and suffix operators.

I still wonder if there are really suffix operators.

even if the match is part of a pattern that fails and is later backtracked.

That full sentence is displayed in italics -- is that intended?

The syntax for defining a generic grammar is as follows:

peg(name, identifier: Type)

The phrase "generic grammar" sounds strange here, as you are not talking about grammar, but about passing an additional variable to the peg() macro.

Best regards, Stefan Salewski

Consolidate error methods

Npeg currently can fail a parse in two ways:

Normal failure when the subject does not match the grammar. In this case the ok field of the MatchResult will be false and the offset of failure is available in MatchResult
An user defined failure with E() pattern of fail() in a code block throw an NPegException with a user defined message

The above requires the caller to check for errors in two ways, which is not ideal.

I propose deprecating the exceptions and always return errors through the MatchResult object instead.

Npeg too heavy for NimVM

Currently the VM program space is 16bit. Npeg uses more than that when passes a non-trivial grammar to analyze. This means NimVM crashes on relatively complex grammars.

Examples of grammars that npeg can't support due to this:
https://play.nim-lang.org/#ix=22KQ
http://ix.io/22KC/nim

When defining grammar block parameterized rules block name needed?

I have a grammar block:

grammar "ib":
    EOL <- "\r\n" | "\n" | "\r"
    whitespace <- *Blank
    Static <- >whitespace

And with normal rules it works fine. But as soon as i add parameterized rules it requires me to add ib. in front of everything.

Erroring code:

grammar "ib":
    EOL <- "\r\n" | "\n" | "\r"
    whitespace <- *Blank
    Static <- >whitespace
    Line(content) <- Static * content * EOL

Working code:

grammar "ib":
    EOL <- "\r\n" | "\n" | "\r"
    whitespace <- *Blank
    Static <- >whitespace
    Line(content) <- ib.Static * content * ib.EOL

I would assume either that this is a bug or i am missing something, either way how do fix this?

Question about code blocks and backtracking

You write:
"Note that for code block captures, the Nim code gets executed during parsing, even if the match is part of a pattern that fails and is later backtracked."

I have been modeling a semi-complicated Prolog like language with lots of committed choices (and hence lots of places for backtracking). My global state is a stack which during code-block execution is popped and pushed eventually rolling up leaf-nodes into higher grains (I'm trying to model my nodes as variant objects like the JSON parser does).

My question is multi-fold: 1) given your statement above that code gets executed even in dead-end parsing is there any API for knowing this and possibly allowing code blocks to commit/rollback the global state? 2) do you have any suggestions of a good technique for this (the best I can think of is a stack of stacks where the inner stacks represent speculative parses that can be dropped/committed) 3) am I overthinking this?

thanks,
Daniel

[Question] Parsing comments with two characters limits

I am trying to parse comments with limits: (* and *).

I am trying the following without success:

import npeg

type
  Comment = string

let tmp = "(*This is a comment*)"  # Nested comments are not allowed

let parser = peg("comment", d: Comment):
  commentStart <- "(*"
  commentEnd <- "*)"
  #content <- *(1-commentEnd) 
  #comment <- commentStart * >content * commentEnd:
  comment <- "(*" * >(1-"*)") * "*)":
    d = $1


var comm:Comment
#assert parser.match(tmp, comm).ok
echo parser.match(tmp, comm).captures

#echo comm

The sort of comments I could expect are like:

(* This is an exmaple *)

(* This is 
 * an example
 *)

(********)
(* Ok   *)
(********)

(*
  enum_v1 : enum_type00;
*)

Input wanted: design and implement a proper API for code blocks

Here's a request to all NPeg users: I'm looking for some ideas and feedback for the proper design of the API for grammar code blocks. At this time the available functions feel ad hoc and scattered, and I don't dare to call this v1.0 before a better interface is in place.

Here is a list of things a code block capture can do, and how this works now:

Access captures: when a code block executes, all closed captures that were made inside the rule, and the code block capture itself are available to the code block as the variable capture which is of type seq[Capture]. The Capture object has a few public members that can be used from the code block: s is the captured string and si is the index into the original subject where the capture was matched. The capture strings s are also accessible through a sugar macro that rewrites the code block as $n:
- capture[0] or $0 is the total capture of the rule
- capture[1..] or $1 are the explicit string captures inside the rule
Force a rule to fail:
- code blocks have access to a proc fail() which will cause the match to fail - this can be used to validate matches with Nim code, for example to check integer ranges, enum values, or symbols that were captured earlier.
- code blocks have acecss to a proc validate(b: bool), which is nothing more then if not b: fail()
For example:

uint8 <- >Digit[1..3]:                                                     
  let v = parseInt($a)                                                         
  validate v>=0 and v<=255

Perform matching in Nim code: the internal match state is implicitly available to the code block capture, which can (ab)use that to do matching in Nim code: the code block can see and adjust the match index, and indicate success or failure, for example:

  let p = peg foo:
    foo <- "ba" * die * "da"
    die <- 0:
      if si < s.len-3 and if cast[string](s[si..si+2]) == "die":
        si += 3
      else:
        fail()

  echo p.match("badieda")

All of the above features were introduced in NPeg one at a time, growing the API lacking a proper design.

My question: does the above feel as kludgy as I think it is, or is it fine as it is now? Any ideas for improvement or what this should look like?

Captures only work when input is consumed

I think this behavior is somewhat surprising.

let p = peg o:
  o <- >&1:
    echo len($1)

doAssert p.match("a").ok

Possible solutions:

make it work
produce a warning/error when > is used along with &
mention it in the documentation

What does !1 do? (in the quickstart) [Question]

In the quickstart, there is a line:

pairs <- pair * *(',' * pair) * !1

I get that this means that pairs is a list of a pair object, followed by 0 or more pairs. But what is the !1 doing there? The only thing I can think of is that it is negating the literal 1.

Prevent code blocks captures from executing

I want to make a pattern that uses another pattern but doesn't execute that pattern's code block capture.

import npeg

# This parser parses words and no words
# Words are one or more alpha characters
# And no words are words with a dash before them
const parser = peg("nodes", data: seq[string]):
  nodes <- *(node * (' ' | !1))
  node <- noWord | word
  noWord <- '-' * word:
    data.add($0)
  word <- +Alpha:
    data.add($0)

var data: seq[string]
assert parser.match("a -b", data).ok
assert data == @["a", "b", "-b"] # I want it to be @["a", "-b"]

There could be, perhaps, an operator that allowed it:

%P # matches P without executing it's code block capture

Investigate parallelization

Would it be possible to implement some kind of parallization when parsing ordered choice or optionals?

NPeg's IR language should not be too hard to parallelize: when encountering an opChoice, both branches of the program can be executed in parallel. When the primary branch matches, the secondary branch is aborted, but when the primary branch fails, the secondary can continue parsing from there.

Challenges:

Keeping track of captures: each thread needs to keep track of each own capture stack.
Code block captures?

Parsing indentation-based grammars

Many languages, including Nim, use indentation to represent a tree structure. Would it be possible to add a mechanism for parsing such languages into NPEG? Indentation-based grammars are not strictly context-free, but it's quite easy to convert them into a context-free grammar by introducing special “indent” and “dedent” tokens, representing a change in the indentation level, which can then work like brackets.

Print return stack on overflow

When NPeg throws a depth error due to an overflow in the return stack, it would be nice to have the exception contain a printout of the stack.

Something like

Error: unhandled exception: return stack overflow, depth>1024
Return stack:
sample.nim(10)             rule_name_one
keywords.nim(100)       rule_name_two
...

Rather than just

Error: unhandled exception: return stack overflow, depth>1024

Backreferences

Being able to reference backreferences in match patterns would be nice, for the cases where something like bash heredocs are desired.

Enhancing parser flexibility: improving type parsing with the push template

I have developed a lexer/parser, similar to the example "tests/lexparse.nim", that offers the ability to parse types beyond just strings.

Using the push template within the parser has proven to be incredibly useful for achieving this functionality. I made a modification to the codegne.nim file:

template push(`s`: string) {.used.}

I made the following change to expand its capabilities:

template push(`s`: string|`sType`) {.used.}

By implementing this modification, the parser now performs smoothly and accurately.

AST building, another take

Currently, AST building with NPeg is quite difficult and bug-prone. You have to maintain a stack of nodes and use a bunch of relatively inefficient workarounds to build a nice, inspectable, traversable abstract syntax tree.

My proposal is to allow the user to specify their own Node type, and add primitives for building abstract syntax trees. Here's my idea for the API:

import std/strutils

type
  NodeKind = enum
    nkNumber, nkIdent, nkCall

  Node = ref object
    case kind: NodeKind
    of nkNumber: number: float
    of nkIdent: ident: string
    of nkCall:
      fn: Node
      args: seq[Node]

let p = peg(expr, ast = Node):
  ident <- {'a'..'z', '_'} * *{'a'..'z', '_', '0'..'9'}:
    # perhaps we could assign to result but i don't know
    # whether NPeg reserves it or not
    build Node(kind: nkIdent, ident: $0)
  number <- +Digit:
    build Node(kind: nkNumber, number: parseFloat($0))
  # AST captures would be done with A(), referencing
  # a capture inside of the code block would be done with node(n)
  # NPeg could detect whether the capture would result in
  # multiple nodes and expose nodes(n) which returns a
  # seq[Node] instead of a single Node
  call <- A(ident) * '(' * A(expr * *(',' * expr)) * ')':
    build Node(kind: nkCall, fn: node(1), arg: nodes(2))
  expr <- call | ident | number

Usage of node(n) on a sequence capture would result in either an error or the first node being returned. Usage of nodes(n) on a singular capture would result in an error or a sequence with only that node being returned.

Templates don't reject code block syntax

Templates obviously do not (and should not) support code blocks.
While this is true, when you do write one, npeg just ignores it instead of erroring.

lazy matching

hey,

I remember in REGEX when I can write (.+) bag and it matches the "faded blue bag"
[it's a very normal regex pattern]

but I can't do something like this in npeg.

const parser = peg("main", data: ...):
  main  <-   >*{ 'a' .. 'z', ' ' }  * " bags":
    echo "matched: ", $1

Pass local state to parser code blocks

Any variables used in code blocks need to be declared before the parser definition, which is not ideal. It would be nice to have a way to pass local state around which can be used in the code blocks.

Just wanted to say npeg is great

Hi, dear developers

Some months ago I started making a concept of a markup language which would be useful for UI design description.

And week ago I started development.
I know that parsing is actually pain, so I decided to take this library for usage.

It's so amazing! It only took 3 days to find out how can I use it and now I have clean code with several rules, which actually can parse the code!

The parser is still in development, but I'm sure I will be able to make first release soon.

Thank you very much! You are great!

Memorization & Bryan Ford

Huge thanks for nPEG — love it a lot, even its quirky syntax — as it gets highlighted, unlike common PEG. Love your other work as well.

PEGs creator — Bryan Ford — has a page which (besides other things) lists PEG implementations in various languages. I suggest to ask him to add Nim and its both PEG implementations. His email and twitter are at the index.

In my opinion it would also be a good opportunity to poke his brain on implementing some memorization for your amazing nPEG. I know that contacting people is not everyone's favorite thing, but here are some reasons for it:

He is a relatively young professor — approachable and curious.
He is the author of packrat — a practical memorization technique. His Master thesis describes it, and contains a Haskell implementation. The thesis predates PEG by 2 years, so it uses TDPL instead — but it's still helpful to get the practical gist of his idea. The mentioned info page has 6 more pdfs on packrat alone, plus more on related topics. Also has links to packrat implementations in Java, Python, Haskell, Scheme, and Ruby — for inspiration and borrowing.
Even just exchanging Hello's with the inventor would be cool, and potentially could be inspiring and helpful.

You might be busy or have other plans, but if there be time to improve nPEG in terms of memorization — contacting the dude can do no harm.

In any case — thank you, and best wishes.

Prettier graphs

I'd love to be able to draw graphs like this from a grammar:

Or ascii:

                ╭─(back)◂─╮
  ──▸(term)─╭───╰─────────╯──╮──▸
            ╰─(one)◂─╮─(+)◂──╯
                     ╰─(-)◂──╯

Get line/position number of capture

If possible, it would be nice to get some meta-information about a capture, such as the position it is located at in the passed-in input.

Investigate precedence climbing

See if Npeg could be extended with precedence levels for rules, which could be used for implementing precendence climbing as part of a parser

[Question] Why is my npeg parser done, then starts backtracking and fails.

I have this line of my parser: tinput <- tblock * !1 after EOL it should check again for the !1 and end but from the trace below you can see it backs up and starts again then fails. Why is this happening and how can i fix it?

 75|  0| 94|                        |EOL            |chr "\r"                                |***
 79|  0| 94|                        |EOL            |chr "\n"                                |***
 82|  0| 94|                        |EOL            |chr "\r"                                |***
 85|  0| 94|                        |EOL            |any                                     |***
 89|  0| 94|                        |tcontent       |return                                  |***
 93|  0| 94|                        |tblock         |commit 91                               |***
 91|  0| 94|                        |tblock         |choice 94                               |**
 92|  0| 94|                        |tblock         | call tcontent:9                        |***
  9|  0| 94|                        |tcontent       |choice 60                               |***
 10|  0| 94|                        |Static         | capopen ckAction -> 94                 |****
 11|  0| 94|                        |Static         |  capopen ckVal -> 94                   |****
 12|  0| 94|                        |whitespace     |   span {'\t',' '}                      |****
 13|  0| 94|                        |Static         |  capclose ckVal -> 94                  |****
 14|  0| 94|                        |Static         | capclose ckAction -> 94                |****
 15|  0| 94|                        |Alpha          | set {'A'..'Z','a'..'z'}                |****
 95|  0| 94|                        |               |opFail (backtrack)                      |****
 60|  0| 94|                        |Static         |capopen ckAction -> 94                  |***
 61|  0| 94|                        |Static         | capopen ckVal -> 94                    |***
 62|  0| 94|                        |whitespace     |  span {'\t',' '}                       |***
 63|  0| 94|                        |Static         | capclose ckVal -> 94                   |***
 64|  0| 94|                        |Static         |capclose ckAction -> 94                 |***
 65|  0| 94|                        |Alnum          |set {'0'..'9','A'..'Z','a'..'z'}        |***
 95|  0| 94|                        |               |opFail (backtrack)                      |***
 94|  0| 94|                        |tblock         |return                                  |**
 48|  0| 94|                        |tcontent       | choice 59                              |**
 49|  0| 94|                        |tcontent       |  choice 57                             |***
 50|  0| 94|                        |Dedent         |   capopen ckAction -> 94               |****
 51|  0| 94|                        |Dedent         |    capopen ckVal -> 94                 |****
 52|  0| 94|                        |whitespace     |     span {'\t',' '}                    |****
 53|  0| 94|                        |Dedent         |    capclose ckVal -> 94                |****
 54|  0| 94|                        |Dedent         |   capclose ckAction -> 94              |****
 95|  0| 94|                        |               |opFail (backtrack)                      |****
 57|  0| 94|                        |tcontent       | commit 58                              |***
 58|  0| 94|                        |               |opFail (backtrack)                      |**
 60|  0| 51|Fooo:\n    Barr\n    Fo |Static         |capopen ckAction -> 51                  |*
 61|  0| 51|Fooo:\n    Barr\n    Fo |Static         | capopen ckVal -> 51                    |*
 62|  0| 51|Fooo:\n    Barr\n    Fo |whitespace     |  span {'\t',' '}                       |*
 63|  0| 51|Fooo:\n    Barr\n    Fo |Static         | capclose ckVal -> 51                   |*
 64|  0| 51|Fooo:\n    Barr\n    Fo |Static         |capclose ckAction -> 51                 |*
 65|  0| 51|Fooo:\n    Barr\n    Fo |Alnum          |set {'0'..'9','A'..'Z','a'..'z'}        |*
 66|  0| 52|ooo:\n    Barr\n    Foo |value          |capopen ckAction -> 52                  |*
 67|  0| 52|ooo:\n    Barr\n    Foo |value          | capopen ckVal -> 52                    |*
 68|  0| 52|ooo:\n    Barr\n    Foo |value          |  span {'0'..'9','A'..'Z','a'..'z'}     |*
 69|  0| 55|:\n    Barr\n    Foooo: |value          | capclose ckVal -> 55                   |*
 70|  0| 55|:\n    Barr\n    Foooo: |value          | chr ":"                                |*
 71|  0| 56|\n    Barr\n    Foooo:  |               | jump :73                               |*
 73|  0| 56|\n    Barr\n    Foooo:  |               |opFail (backtrack)                      |*
  4|  0| 51|Fooo:\n    Barr\n    Fo |tinput         |any                                     |
  5|  0| 52|ooo:\n    Barr\n    Foo |               |jump :7                                 |
  7|  0| 52|ooo:\n    Barr\n    Foo |               |opFail (error)                          |

Another concatenation symbol?

I like this module very much! Thanks for your effort.

However, when I try to use it in my projects, I found it is hard to read the PEG grammar because that concatenation and matches 0 or more times use the same symbol *.

Is it possible to use another symbol for concatenation, for example: ~, or others?

Captures not working inside of proc with generic arguments

The > symbol used in an expression inside of a proc causes the following error: Error: wrong number of arguments. It seems like it is being recognized as the system operator rather than part of the expression.

Two examples of it not working:

import npeg

proc test(T: typedesc, str: string) = 
  let p = peg "start":
    start <- >"Test"
  var m = p.match(str)
  echo m.captures
test(string, "Test")

import npeg

proc test[T](obj: T, str: string) = 
  let p = peg "start":
    start <- >"Test"
  var m = p.match(str)
  echo m.captures
test(1, "Test")

If the '>' is removed, things work. But of course I cannot use the matched element. This is the only symbol that is causing me problems and the parser I've been trying to write is using almost all of them.

import npeg

proc test(T: typedesc, str: string) = 
  let p = peg "start":
    start <- "Test"
  var m = p.match(str)
  echo m.captures
test(string, "Test")

import npeg 

proc test[T](obj: T, str: string) = 
  let p = peg "start":
    start <- "Test"
  var m = p.match(str)
  echo m.captures
test(1, "Test")

A way to let every peg rule to produce its own object...

Here are some codes to demonstrate my problem.

import npeg, strutils

const
  testData = "10:10;12:00;22:40" # last item has no ';'

proc test1() =
  var list: seq[string]
  let peg = peg "start":
    start <- +statement
    statement <- >time * ';':
      list.add $1
    time <- Digit[1..2] * ':' * Digit[1..2]

  discard peg.match(testData)
  echo list

Output: @["10:10", "12:00"]

The output is what we want. However, we like object instead of string. So...

type
  Time = object
    hour: int
    min: int

proc test2() =
  var list: seq[Time]
  let peg = peg "start":
    start <- +statement
    statement <- time * ';'
    time <- >Digit[1..2] * ':' * >Digit[1..2]:
      let (hour, min) = (parseInt($1), parseInt($2))
      if hour in 0..23 and min in 0..59:
        list.add Time(hour: hour, min: min)

  discard peg.match(testData)
  echo list

Output: @[(hour: 10, min: 10), (hour: 12, min: 0), (hour: 22, min: 40)]
We get seq[Object] as output, however, code block capture are always executed even when the parser state is rolled back afterwards. The result is wrong.

proc test3() =
  var list: seq[Time]
  let peg = peg "start":
    start <- +statement
    statement <- time * ';':
      for i in countup(1, capture.len-1, step=2):
        let (hour, min) = (capture[i].s.parseInt, capture[i+1].s.parseInt)
        if hour in 0..23 and min in 0..59:
          list.add Time(hour: hour, min: min)

    time <- >Digit[1..2] * ':' * >Digit[1..2]

  discard peg.match(testData)
  echo list

Output: @[(hour: 10, min: 10), (hour: 12, min: 0)]

Finally, we get what we want. However I think this code is bad because we produce Time object outside of time rule.
If statment rule is statement <- (time | date | something) * ';' , the code will be really ugly.

The way I can resolve the problem for now is:

import marshal

proc test4() =
  let peg = peg "start":
    start <- +statement:
      var list: seq[Time]
      for i in 1..<capture.len:
        list.add to[Time](capture[i].s)
      push($$list)

    statement <- time * ';'

    time <- >Digit[1..2] * ':' * >Digit[1..2]:
      let (hour, min) = (parseInt($1), parseInt($2))
      if hour in 0..23 and min in 0..59:
        push($$Time(hour: hour, min: min))

  var list = to[seq[Time]](peg.match(testData).captures[0])
  echo list

Output: @[(hour: 10, min: 10), (hour: 12, min: 0)]

Ok, it works fine, and the code is clear, each rule produce the object of itself. But it will be very slow due to serialization/deserialization, and marshal cannot works at compile-time (https://github.com/treeform/jsony can).

In the end, is there a better/smarter way to do this?
Sorry for my bad English.

`-d:npegTrace` code allocates a number of closures equal to the number of states

When npegTrace is enabled, a closure is allocated for each state generated by NPeg. This is due to the fact that doTrace is defined inside genTraceCode. For large grammars, this can cause programs to suddenly crash, due to the program attempting to allocate stack memory for all the closure pointers.

Performance and Complexity

Hello!
I was wondering what is the complexity of npeg. Does it depend on the grammar ? It would be nice to have a section in the documentation about it.
For example consider the example in the README.md to parse arithmetic expression.

let parser = peg "line":
  exp      <- term   * *( ('+'|'-') * term)
  term     <- factor * *( ('*'|'/') * factor)
  factor   <- +{'0'..'9'} | ('(' * exp * ')')
  line     <- exp * !1

doAssert parser.match("3*(4+15)+2").ok

Is the match performed in linear time ?

I'm considering building a simple C compiler and I need a parsing library because I'm too lazy to read parser theory to build a parser by hand. I have https://github.com/vanyle/lrparser, but it's an SLR parser and I'm not sure it's powerful enough to parse C (also I haven't implemented any error handling.) Obviously, linear parsing time is a requirement when dealing with tens of thousands of lines of code to parse.

Use with threads and gcsafe code ?

Hi,

I am wondering how to use npeg with gcsafe code or threads. i declared my grammar constant, but I get errors like this when using the grammar match procedure:

Warning: 'decode_encoded_words' is not GC-safe as it calls 'match' [GcUnsafe2]

Is it possible to use npeg with gcsafe code ?

Can You Add Some Example Binary Structures?

I realize that you specifically indicate that npeg was extended over time and parsing binary data structures is a bit experimental at this point. However, it would be wonderful to have some examples of npeg parsed binary structures just to help potential users of the library wrap their heads around how it should work given that use-case.

I could be wrong, but I think a heavily commented working parser for Ethernet frames and IPv4 packets would probably be enough of a reference for a large number of binary use cases to help people get the ball rolling.

This is a very cool project. Writing parsers by hand is indeed quite a pain.

push() captures broken?

From EyeCon on IRC:

import std/[strformat, strutils]
import npeg

let tmp = "Z1Z + A1A / B1B + -C1C"

let parser = peg "exp":
  code <- >(?'-' * Alpha * Alnum[2..4]):
    push("Cod " & $1)
                                       
  parenExp <- ( "(" * exp * ")" ) ^ 0          
                                                 
  prefix <- code | parenExp                    
                                               
  infix <- >{'/'} * exp ^ 1 | >{'+'} * exp ^ 2: 
    push("Opr " & $1)                           
                                                
  exp <- prefix * *infix                        
                                                
echo parser.match(tmp.replace(" ", "")).captures

NPeg does not support left recursion

which would be nice to have.

http://www.lua.inf.puc-rio.br/~sergio/leftpeglist.pdf

Implement handling of left recursion

Handling left recursion is problematic in PEGs, see https://en.wikipedia.org/wiki/Parsing_expression_grammar#Indirect_left_recursion for a nice explanation of the problem.

A paper citing Npeg

Your work is being recognized in academia, too: https://people.seas.harvard.edu/~chong/pubs/gpeg_sle21.pdf
👍

Help - how to handle captures and optionals

I am having two difficutties and probably I am not using probably npeg here.

First issue: capturing optional elements

title <- >( whatever)
attributes <- >(whatever2)
delimitedBlocks <- ?title * >?attributes * >body:
   mycode

Given above's situation, $1 could be tittle, attributes or body. How can I check what it is? If in the code I do echo capture, the field name is empty.

Second issue: creating objects

I want to capture above's data in a ref object, where should I create it?

var blk:Block
new(blk)

I'd like to create the object in one place in order to do something like:

let parserBlocks* = peg("delimitedBlocks", myBlk):
  title <- >( whatever):
    blk.title = $1
  attributes <- >(whatever2):
    blk.attributes = $2
  delimitedBlocks <- ?title * >?attributes * >body:
    blk.body 
    myBlk.blocks &= blk

Composibiity, imports, reuse

Npeg would greatly benefit from a mechanism providing reuse of rules using some kind of import mechanism to pull in pre-defined patterns from libraries.

With this mechanism Npeg could provide a library of predefined common patterns to be shipped with Npeg to simplify common operations like parsing of IPv4/IPv6 addresses, URIs, e-mail addresses.

I'm looking at an implementation but I seem to run into the limits of the Nim VM with what I'm trying to do. I'd appreciate any help on this from someone with more experience in macro land.

Implement slicing operators for captures

Someone commented on an old YouTube video of mine showing Advent of Code solutions using NPeg. After 0e297ac the Captures type is no longer just an alias for a sequence, but rather an object. The object has [] defined for it, but not for slices which I used in my solution. The fix is simple enough, but for backwards compatibility and future usability it would be nice to have.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

zevv / npeg Goto Github PK

npeg's Introduction

Contents

Quickstart

Usage

Simple patterns

Grammars

Syntax

Atoms

Operators

Precedence operators

Captures

String captures

Code block captures

Custom match validations

Passing state

Backreferences

More about grammars

Ordering of rules in a grammar

Templates, or parameterized rules

Composing grammars with libraries

Library rule overriding/shadowing

Error handling

MatchResult

NpegParseError exceptions

Other exceptions

Parser stack trace

Advanced topics

Parsing other types then strings

Some notes on using PEGs

Anchoring and searching

Complexity and performance

End of string

Non-consuming atoms and captures

Left recursion

UTF-8 / Unicode

Tracing and debugging

Syntax diagrams

Grammar graphs

Tracing

Compile-time configuration

Tracing and debugging

Random stuff and frequently asked questions

Why does NPeg not support regular PEG syntax?

Can NPeg be used to parse EBNF grammars?

NPeg and generic functions

Examples

Parsing arithmetic expressions

A complete JSON parser

Captures

More examples

Future directions / Todos / Roadmap / The long run

npeg's People

Contributors

Stargazers

Watchers

Forkers

npeg's Issues

First issue: capturing optional elements

Second issue: creating objects

Recommend Projects

Recommend Topics

Recommend Org