rain-1 / racket-peg Goto Github PK

View Code? Open in Web Editor NEW

23.0 5.0 8.0 249 KB

racket scheme peg-parser

License: GNU General Public License v3.0

Racket 99.17% Shell 0.83%

scheme racket drracket peg parser parse

racket-peg's Introduction

racket-peg

This library implements a PEG parser generator.

Getting Started

run chmod +x ./make to turn ./make executable
Run ./make install to install the peg library. You should be able to (require peg) in your own racket programs after that. Or use #lang peg!
Run ./make update to apply changes, if you're hacking on it.
Run ./make docs to build the documentation.
Run ./make test to check that it is working after installation.

Source Code Map

The PEG parser system is built in two stages. There is a lispy s-expression version. Then there is a self-hosted PEG syntax version.

The lispy version is implemented in:

peg.rkt - The PEG parsing VM and parser macro.
peg-result.rkt - The fundamental data structure used for parse results. It's a kind of automatically joinable sequence.

The PEG aspect is implemented in these files:

peg-src/peg-in-peg.rkt - The syntax of our peg language. In peg.
peg-src/sexp-parser.rkt - A basic s-expression parser. In peg.

Both of the above files are "bootstrapped" using the racket macro expander to produce the following:

peg-in-peg.rkt - expanded version of peg-src/peg-in-peg.peg.
s-exp.rkt - expanded version of peg-src/s-exp.peg.
peg-to-scheme.rkt - support for peg-in-peg.
main.rkt - This adds the #lang peg glue to racket.

Authors

Raymond Nicholson
João Pedro Abreu de Souza

racket-peg's People

Contributors

Stargazers

Watchers

Forkers

samth kstrafe petrolifero atharvamahajan wwall guitarvydas lwhjp liampwll

racket-peg's Issues

add positive lookahead operator

add & operator, that is equivalent to !!

&a == ! (! a)

so, if a matchs, !a fails without consume, and !(!a) match, again, without consumes.

Is just a syntatic sugar but make easy write intersection of languages

generate #lang peg/typed

this will allow the parser to be integrated in typed racket's programs

add quantifier repetition

to write something like password <- [a-zA-Z0-9]{8}[a-zA-Z0-9]* ; to the password has minimun length of 8.
or a <- [a-z]{8,10} to between 8 and 10 char.

or a <- [a-z]{8,} without upper limit, with the semantics of minimun of 8

add epsilon operator

In the same way that positive lookahead, this is a syntatic sugar

epsilon <-- (! .) / (& .) ;

I dont know of a good symbol of that, but I think that will be a good add.

add a primitive to read an s-expression

We will need a primitive that either:

(A) invokes rackets s-expression reader
(B) calls on our own s-expr reader

that allows us to easily include s-expressions inside PEG grammars.

(A) would depend on #2

perhaps (B) is the better option though. Even though we have to "re"implement s-expr parsing, it would blend more seamlessly with the rest of the peg parsing system.

reorganize file structure to match this nice clean layout

https://github.com/racket/draw

add keyword support in s-exp

#lang peg

(struct add (a b) #:transparent);

(define (my-sort lst #:comparator [cmp <])
             (sort lst cmp))

Any of this lines works. In parser level we need to add support to keywords.

Can PEG results objects become more predictable?

I am just a but frustrated with the way peg-result objects work. There is a fair bit of "magic" involved and the results can be somewhat unpredictable.

We can't use #f as a semantic value

#lang peg

a <- 'a' -> #f ;

This don't work because is translated to use #f as semantic value this is a hard-coded code interpreted as don't have semantic action(in real world, this repeat this input) instead of returning #f when I use (peg a "a")

process ports instead of just strings

Make the peg function handle ports instead of just strings

support to "..."

#lang peg


(define (eval-rpn l [aux '()])
  (match l
    ['() (car aux)]
    [(list (? number? a) b ...) (eval-rpn (cdr l) (cons (car l) aux))]
    [(list a b ...) (eval-rpn (cdr l) (cons (a (car aux) (cadr aux)) (cddr aux)))]));

_ < [ \t\n]* ;
number <- value:[0-9]+ -> (string->number value);

top <- v:rpn ! .  -> (eval-rpn v);

rpn <- number (_ rpn _ Bop)* _ ;

Bop <- v:[+\-] -> (if (equal? v "+") + -) ;

this dont work because the identifier parser dont understand "..."

error messages

it needs to produce useful error messages on a parse failure.

make peg-src/sexp-parser.rkt more complete

it doesn't cover the full scheme langauge yet. it doesn't need to cover the complete syntax of scheme. but we should add a bit more than this current bare minimum.

characters

allow more permissive rules

A well-formed PEG always accept or fails, never go in a infinite loop.

There's only two ways a PEG is not-well-formed : left recursion or star over nullable expression.

E <- E '+' T ; is a example of left recursion
A <- ('a'*)* ; is a example of star over nullable expression

If we can detect this states, we can show to the user, improving error message instead of the user see a infinite parsing time .

Other possibility is allow more permissive rules

There are algorithms that receive a left-recursive grammar, turn in a non-left recursive grammar, obtain the parse tree and transform that to a parse tree compatible with the original grammar. If we use this, mostly grammar can be just ported to peg, without any change.

PEG.js do this. I am studing this algorithm, but yet don't understand total.

I don't know if there's algorithm to detect star over nullable expression to do this with the second restriction.

benchmark this

do some timing measurements against other parsers e.g. lalr parser in guile scheme. see how efficient this parser is on a range of languages compared to other systems.

difficult to parse a basic language

https://www.rosettacode.org/wiki/Brace_expansion

#lang peg

item <- [^{},]+ / ~'{' choice ~'}';
choice <- x:item+ xs:(~',' item+)* -> (cons x xs);

; > (peg (+ item) "It{{em,alic}iz,erat}e{d,}")
; '("It" ((("em" . "alic") "iz") . "erat") "e")

The resulting parse is confusing and we can't change the char class to *. it's too hard to parse this simple language.

Can we make the syntax match guiles?

or at least closer.

https://www.gnu.org/software/guile/manual/html_node/PEG-Parsing.html

shorthand for dropping a string

A common idiom is to define rules like OB < '{" ;

What shorthand could be created for this?

maybe "{" could be used, but it does't seem logical.
maybe a drop operator that can be applied to anything, which char though?

Automatically generate the 'expanded' files from the sources

documentation for the peg #lang

use semantic actions in our scheme parser

phase out the sexp-to-scheme.rkt file completely.

change (? e) to (or e (epsilon))

the optional operator (?) is not fundamental. He can be written using "or" in the way on title.

If we do that, as we do when add the (&) operator, in a future when a feature must be added, will be less code to alter, less tests to do.

generate #lang peg/typed/syntax

#17 #14
With this, we can generate typed languages using the peg

require peg instead of peg/peg

Is there any way we can support (require peg) as well as the #lang peg, instead of being forced to do (require peg/peg)?

cannot pkg

cannot make info.rkt right for the racket pkg page.

Parse Failure

#lang peg

item <-- [^{]+ / '{' item '}' ;

;> (peg item "{abc}")
;     {abc}          
;          ^ here
; peg: parse failed in rule peg-rule:item at location 5 with options '(or (+ (and (! #\{) (any-char))) (and "{" item "}"))

semantic value

put capabilities to get semantic value from "#lang peg"

something like

#lang peg

number <- (name value [0-9]*) (lambda () value) ;

import feature

We need an import feature like this:

import name1.rkt ;
import name2.rkt ;
import name3.rkt ;
rule1 <- .. ;
rule2 <- .. ;

To be able to split parsers up into multiple files, and also to import supporting functions for semantic actions.

comments for #lang peg

We would like // comments.

document pegvm semantics

#69

When I think of put other continuation, I bumped into my ignorance about the formal functioning of pegvm.

add empty string support

#lang peg

a <- '' 'a' ;

this don't work. Because expect to have at least one char inside single quotes

Allow qualified import

The great advantage of peg is composition. Imagine someone write a peg to a language and define expressions only as arithmetic Expressions. I want to put Boolean expressions, function call and other things. I can't just import aritexp.rkt because he export expression. I will need to change the code.

One possible alternative can be import only some Peg's, or import with a prefix.

import aritexp from otherexp.rkt;
import exp as aritexp from otherexp.rkt;
import otherexp.rkt with prefix other;

In this way, the number of languages can be multiplied by 10 because of reuse

add '[' support

#lang peg

(define (f l)
      (match
                  [(list a b) a]
                  [(list a) (list a 1)]))

a <- 'a' ;

This file don't work because we only support ( and don't support [ but [ is normal in racket code

Implement a extension of PEG

https://www.youtube.com/watch?v=4QrZxfgZFDg

This is a very future function.

With this, will be possible to parse declaratively and easyly all of C, python and other syntax-ambiguous languages.

line number and column in the error message

peg: parse failed in rule peg-rule:identifier at location 269 with options '(or #\) #\[ #\] #\{ #\} #\" #\, #\' #\` #\; #\# #\| #\\)

Bug with context-sensitive grammar

I took example from Ford's original paper:

A ← aAb/ε
B ← bBc/ε
S ← &(A!b)a∗ B!.

#lang racket

(require peg)

(define-peg A
  (or (and 	(char #\a) A (char #\b)) 	(epsilon)))

(define-peg B
  (or (and 	(char #\b) B (char #\c)) 	(epsilon)))

(define-peg S
  (and (& (and A (! (char #\b)))) (* (char #\a)) B (! (any-char))))

(peg S "aaabbcc") ; should not match

(peg S "aabbcc")  ; works as expected

(peg S "abbcc")   ; works as expected

Interesting that this bug is consistent across several implementations.

Precise defitition of the peg syntax?

What's the easiest way to understand the peg syntax? I saw https://docs.racket-lang.org/peg/index.html#%28part._.P.E.G_.Syntax_.Reference%29 . Is there some additional reference material one should review? Thx

Port to guile

For porting define-for-syntax consider:

(eval-when (expand load eval) (define (gen-id xxx) ...))

define those functions in a separate library and do (import (for (helper-library) expand run))

'capture' should be 'name' in reference

In peg.scribble @subsection{define-peg}:

Change 'capture' with 'name'

bad: (code:line (capture name <rule> ))

good: (code:line (name nm <rule>))

implement memorization of results

So, when we do something like

A <- ifElse / if
ifElse <- if else
if <- ....

When try to parse a simple if with A, ifElse will fail, then will try "if", but will remember "Hey, I match a if starting in this text position, so I will apply this result to the continuation sucess and try to parse the rest"

And in this way, the cost of memory can be bigger if not implemented in a clever waty, but the time cost will be smaller

unicode safe

Currently the system only really works with ASCII. We should support full Unicode. This may be difficult, but here is an issue for it.

No way to use bake from PEG language.

peg-in-peg don't accept itself

To study how to resolv #6 I try extend peg-in-peg but the peg don't accept itself

#lang peg

nt-char <- [a-zA-Z0-9_\-] ;
nonterminal <-- nt-char+ !nt-char SP ;
SP < [ \t\n]* ;

literal <-- SQ (BS ['\\] / !['\\] .)* SQ SP ;
SQ < ['] ;
BS < [\\] ;

charclass <-- LB '^'? (cc-range / cc-escape / cc-single)+ RB SP ;
cc-range <-- cc-char DASH cc-char ;
cc-escape <-- BS . ;
cc-single <-- cc-char ;
cc-char <- !cc-escape-char . ;
cc-escape-char <- '[' / ']' / '-' / '^' / '\\' / 'n' / 't' ;
LB < '[' ;
RB < ']' ;
DASH < '-' ;

peg <-- SP grammar+ ;
grammar <-- (nonterminal ('<--' / '<-' / '<') SP pattern) ';' SP ;
pattern <-- alternative (SLASH SP alternative)* ;
alternative <-- expression+ ;
expression <-- [!]? SP primary ([*+?] SP)? ;
primary <-- '(' SP pattern ')' SP / '.' SP / literal / charclass / nonterminal ;
SLASH < '/' ;"

DRRACKET TERMINAL

(peg peg "nt-char <- [a-zA-Z0-9_\-] ;
nonterminal <-- nt-char+ !nt-char SP ;
SP < [ \t\n]* ;

literal <-- SQ (BS ['\\] / !['\\] .)* SQ SP ;
SQ < ['] ;
BS < [\\] ;

charclass <-- LB '^'? (cc-range / cc-escape / cc-single)+ RB SP ;
cc-range <-- cc-char DASH cc-char ;
cc-escape <-- BS . ;
cc-single <-- cc-char ;
cc-char <- !cc-escape-char . ;
cc-escape-char <- '[' / ']' / '-' / '^' / '\\' / 'n' / 't' ;
LB < '[' ;
RB < ']' ;
DASH < '-' ;

peg <-- SP grammar+ ;
grammar <-- (nonterminal ('<--' / '<-' / '<') SP pattern) ';' SP ;
pattern <-- alternative (SLASH SP alternative)* ;
alternative <-- expression+ ;
expression <-- [!]? SP primary ([*+?] SP)? ;
primary <-- '(' SP pattern ')' SP / '.' SP / literal / charclass / nonterminal ;
SLASH < '/' ;
")

OUTPUT

read: unknown escape sequence - in string

generate #lang peg/syntax

#7 #13

this is a very future functionality.

The peg lang actually generate structs that is cool to do a lot of things, but is not so cool to implement languages inside racket. In #lang peg/syntax the generate object is a syntax object, so the parser file can be the reader file directly.

We don't need to add lexical information on create a language, just create the semantic, import in the parser, and the parser becomes a reader because generate syntax object.

use case :

The user write a language that have a formal semantic defined.
So, he write the grammar under #lang peg, make some tests, extend, etc
after this, he change the lang in the parser to #lang peg/syntax and voilà, a reader.

[n] charclass

I'm not allowed to put [n] in a charclass. fix this.

Add verbose option on peg

There's only two options to working peg : he match or he dont. When I have problems with peg, I miss some path information

#lang peg

number <- [0-9]+;
identifier <- [a-zA-Z]+ ;
terminal <- number /    identifier ;

(peg terminal "123" #:verbose) outputs :

match "123" with terminal -> number -> "123"

or (peg terminal "foobar" #:verbose) outputs :

match "foobar" with terminal -> number -> fail -> terminal ->identifier -> "foobar"

\n escapes inside strings in the peg #lang

Release under LGPLv3+

Hi,

I note that the library is released under GPLv3. Is it possible relicensing under LGPLv3+ so it can be used in other projects without forcing them to become GPL projects?

In case of affirmative answer, let me know, that I will create a pull request with correct headers on all source-files, and a clarification of the license in the README, following the GNU guidelines.

Otherwise if the GPL license was chosen deliberately, it is obviously 100% fair, i.e. it is your work! :-)

Many thanks in any case,
Massimo

Overriding/adding alternatives to rules from imported modules

It would be useful to be able to declare a grammar which extends an existing one with new alternatives for a rule, e.g. something like:

expr.rkt:

#lang peg

program <- expr;

expr <- sum;

sum <- (a:product op:('+' / '-') b:sum) / x:product ->
  (if x x (list (string->symbol op) a b));

product <- (a:value op:('*' / '/') b:product) / x:value ->
  (if x x (list (string->symbol op) a b));

value <- x:number / '(' x:expr ')' -> x;

number <- n:[0-9]+ -> (string->number n);

enhanced-expr.rkt:

#lang peg

(require "expr.rkt");

program <- expr;
value <- 'sqrt(' e:expr ')' -> (list sqrt e);

and then have the REPL behave like

Welcome to Racket v7.4.
> (require "enhanced-expr.rkt")
> (require peg)
> (peg program "1+sqrt(16)")
'(+ 1 (#<procedure:sqrt> 16))

I hope the example is clear enough to express the problem I'm facing. Ideally, I'd like to have a way of adding new alternatives to rules defined in another source file, or to override them and have the ability to access the original alternatives to keep them available.

I'm very new to Racket and I apologize if this isn't a limitation of racket-peg but of the Racket module system and I'm open to any workaround suggestion. I think I understand why it's not working this way - definitions in a module have nothing to do with definitions in required modules, which keep their original bindings - however I have no idea how to workaround the issue. Maybe there is some alternative to require which "embeds" the specified module in the current one and has some policy for definition overrides?