neogeny / tatsu Goto Github PK

竜 TatSu generates Python parsers from grammars in a variation of EBNF

Home Page: https://tatsu.readthedocs.io/

License: Other

Makefile 0.29% Python 98.90% Vim Script 0.81%

grammar parser-generator parser parser-library ebnf python python2 python3 walker ast

tatsu's Introduction

At least for the people who send me mail about a new language that they're designing, the general advice is: do it to learn about how to write a compiler. Don't have any expectations that anyone will use it, unless you hook up with some sort of organization in a position to push it hard. It's a lottery, and some can buy a lot of the tickets. There are plenty of beautiful languages (more beautiful than C) that didn't catch on. But someone does win the lottery, and doing a language at least teaches you something.

Dennis Ritchie (1941-2011) Creator of the C programming language and of Unix

竜 TatSu

竜 TatSu is a tool that takes grammars in a variation of EBNF as input, and outputs memoizing (Packrat) PEG parsers in Python.

Why use a PEG parser? Because regular languages (those parsable with Python's re package) "cannot count". Any language with nested structures or with balancing of demarcations requires more than regular expressions to be parsed.

竜 TatSu can compile a grammar stored in a string into a tatsu.grammars.Grammar object that can be used to parse any given input, much like the re module does with regular expressions, or it can generate a Python module that implements the parser.

竜 TatSu supports left-recursive rules in PEG grammars using the algorithm by Laurent and Mens. The generated AST has the expected left associativity.

竜 TatSu requires a maintained version of Python (3.11+ at the moment). While no code in 竜 TatSu yet depends on new language or standard library features, the authors don't want to be constrained by Python version compatibility considerations when developing features that will be part of future releases.

Installation

$ pip install TatSu

Using the Tool

竜 TatSu can be used as a library, much like Python's re, by embedding grammars as strings and generating grammar models instead of generating Python code.

This compiles the grammar and generates an in-memory parser that can subsequently be used for parsing input with.

parser = tatsu.compile(grammar)

Compiles the grammar and parses the given input producing an AST as result.

ast = tatsu.parse(grammar, input)

The result is equivalent to calling:

parser = compile(grammar)
ast = parser.parse(input)

Compiled grammars are cached for efficiency.

This compiles the grammar to the Python sourcecode that implements the parser.

parser_source = tatsu.to_python_sourcecode(grammar)

This is an example of how to use 竜 TatSu as a library:

GRAMMAR = '''
    @@grammar::CALC


    start = expression $ ;


    expression
        =
        | expression '+' term
        | expression '-' term
        | term
        ;


    term
        =
        | term '*' factor
        | term '/' factor
        | factor
        ;


    factor
        =
        | '(' expression ')'
        | number
        ;


    number = /\d+/ ;
'''


if __name__ == '__main__':
    import json
    from tatsu import parse
    from tatsu.util import asjson

    ast = parse(GRAMMAR, '3 + 5 * ( 10 - 20 )')
    print(json.dumps(asjson(ast), indent=2))

竜 TatSu will use the first rule defined in the grammar as the start rule.

This is the output:

[
  "3",
  "+",
  [
    "5",
    "*",
    [
      "10",
      "-",
      "20"
    ]
  ]
]

Documentation

For a detailed explanation of what 竜 TatSu is capable of, please see the documentation.

Questions?

Please use the [tatsu] tag on StackOverflow for general Q&A, and limit Github issues to bugs, enhancement proposals, and feature requests.

Changes

See the RELEASES for details.

License

You may use 竜 TatSu under the terms of the BSD-style license described in the enclosed LICENSE.txt file. If your project requires different licensing please email.

tatsu's People

Contributors

Stargazers

Watchers

Forkers

kkoci jean paulhoule manueljacob dtrckd gegenschall lukas-w azazel75 boriel acw1251 andreabonchi sauerburger matsjoyce r-chaves nyimbi mjdominus hipe victorious3 vdavid4472 vreuter pombredanne davesque sdcloudt barichd egrunzke dogwars barbashovtd progval unigrammar josephchinedu aturkelson tenjinlab winstonwolff 4144 ssteve dnicolodi fdezariassara by-exist conao3 sandy4321 timeless-way andrewsw rinchenlama0075 rootfs joshtgl alanpost olejorgenb

tatsu's Issues

Create an entry for each of the @@directive in the documentation

Directives like @@nameguard and @namechars are documented only within the docs for the syntax. See #95.

Fails with error if a regexp contains escapes available only in "regex" module

I have a rule like this:

name
    =
    ?'[$_\p{L}][$_\p{L}\p{N}]*'
    ;

Tatsu complaints and exit with an error:

typescript.ebnf(67:32) regexp error: bad escape \p at position 3 :
    ?'[$_\p{L}][$_\p{L}\p{N}]*'

even if the regexp is valid for the regex optional module.

It seems to me that is caused by the parser_semantics module that does a direct import of the stdlib re module instead of an from .utils import re like in other modules. utils correctly tries to import regex before re

Compiling grammar from command or lib does not yield the same result

I have a grammar with a directive @@whitespace :: /[\t ]+/. Then, if I use tatsu.to_python_sourcecode, it produces a code where the Buffer class is initialised with whitespace='[\\t ]+'. But, if I use tastu command, the code has whitespace=re.compile('[\\t ]+') instead.

I've encountered very weird parsing errors due to this, I presume that white space removal was eating a t at the beginning of a keyword called trans because the error was something like:

... expecting 'trans'
    trans ....
     ^

with white space before trans and the caret under the r.

asjson undefined on lists

With both python2 and python3, asjson() as used in the example code in the README causes an error. Specifically:

print(json.dumps(ast.asjson(), indent=2))

This is pretty easily fixed by changing to this:

print(json.dumps(ast, indent=2))

Is this a bug and its solution, or am I doing something wrong? 🌳

Recommendations for converting a project from ply to tatsu

I have a fairly complex example of parsing some text data. It is currently based on ply and I'm finding it quite hard to debug.

I was wondering if you can give some general guidance regarding conversion of a project from ply to TatSu and whether you think it's worth it, where you see the (dis-)advantages. As you can tell I have never used tatsu before but it looks very promising.

Some of the features that I definitely need from ply:

Setting precedence rules
Nested contexts, i.e., different rules for token parsing inside of certain contexts
Might need semantic actions although that probably depends on specific mechanics of tatsu

In the end I want to translate content from a flat file into database models.

Allow False as a value for @@whitespace

Currently using this inside my grammar:

@@whitespace :: /(?!)/

which doesn't match anything.

@@whitespace :: False

would be more descriptive and probably more performant. True would set it to the default value.
Alternatively it could accept a string and then convert it to a regex, just like it does when calling parse with whitespace=...

TatSu needs a mascot...

Like Tatsu!

Sorry, I couldn't help it...I always loved Grako but never liked compiling my grammars. Imagine my excitement when I saw this, especially after I saw the name!

Allow namechars to appear in token when performing nameguard check

Consider:

myrule = 
"key" ~ ";" |
"key-word" ~ ";" |
"key-word-extra" ~ ";" ;

And set namechars = '-'.

This will happily parse "key" and "key-word" but not "key-word-extra", because nameguard requires that the token being matched is alphanumeric:

partial_match = (
                token.isalnum() and
                token[0].isalpha() and
                self.is_name_char(self.current())
            )

I think this check should allow namechars to appear in the token being matched too.

calc example: parse_factored() fails against "1*2+3-4"

I played around with example/calc, and modified expressions in calc.py like below:

"3 + 5 * (10 - 20)" -> "1 * 2 + 3 - 4"

Then, calc.py failed on parse_factored(). Is this a problem of the grammer?

Here is the modified calc.py and error log.

my environment

Python 3.6.3
tatsu 4.2.5

Example code won't run with Python 2.7.12

I ran pip install tatsu in my virtualenv and tried evaluating the example arithmetic parser as found at https://pypi.python.org/pypi/tatsu (the exact code I ran is included below) and the exact output with uncaught exception can be found below:

$ python testtatsu.py
# PPRINT
[ u'3',
  u'+',
  [ u'5',
    u'*',
    [ u'(',
      [ u'10',
        u'-',
        u'20'],
      u')']]]
()
# JSON
Traceback (most recent call last):
  File "testtatsu.py", line 46, in <module>
    print(json.dumps(ast.asjson(), indent=2))
AttributeError: 'list' object has no attribute 'asjson'

I'm running Python 2.7.12 on Ubuntu 16.04. Here's the output of my pip freeze and apologies for some of the packages listed not being relevant:

asn1crypto==0.22.0
bcrypt==3.1.3
cffi==1.11.0
cryptography==2.0.3
enum34==1.1.6
idna==2.6
ipaddress==1.0.18
paramiko==2.3.0
pkg-resources==0.0.0
pyasn1==0.3.5
pycparser==2.18
PyNaCl==1.1.2
six==1.11.0
TatSu==4.2.3

Here's the arithmetic example I ran:

GRAMMAR = '''
    @@grammar::CALC


    start = expression $ ;


    expression
        =
        | expression '+' term
        | expression '-' term
        | term
        ;


    term
        =
        | term '*' factor
        | term '/' factor
        | factor
        ;


    factor
        =
        | '(' expression ')'
        | number
        ;


    number = /\d+/ ;
'''


if __name__ == '__main__':
    import pprint
    import json
    from tatsu import parse

    ast = parse(GRAMMAR, '3 + 5 * ( 10 - 20 )')
    print('# PPRINT')
    pprint.pprint(ast, indent=2, width=20)
    print()

    print('# JSON')
    print(json.dumps(ast.asjson(), indent=2))
    print()

Colorized output broken on windows

As the title says, running tatsu in cli mode with "--color" or passing "colorize=True" to e.g. parser.parse() causes garbled output to be produced. The color control characters appear literally.
If I manually init colorama and print something myself colors for my test output will show up, but colors output by tatsu still appear as garbled characters.
I suspect there's something problematic with the integration of trace events and logging that's used to output traces.

Return type of { e }+, { e } and { e }*

Regarding { e }+, { e } and { e* }, the documentation states:

The AST returned for a closure is always a list.

It seems like this isn't true, however. When just a single element is parsed, rather than returning a singleton list just the element is returned. Further, when nothing is parsed, we get null.

Relevant grammar example:

redirection
    =
    command:base_command { '>' outfile:arg }
    ;

base_command
    =
    | cd: (/cd/ | /'cd'/) [ dest_dir:arg ]
    | args: { arg }+
    ;

arg
    =
    | quoted_arg: /'[^'|;>]*'/
    | unquoted_arg: /[^\s|;>]+/
    ;

(I've included stuff about the command for context, but I think that it can be safely ignored)

Input-AST:

ls > out:

            "redirection": {
                "command": {
                    "args": [
                        {
                            "unquoted_arg": "ls",
                            "quoted_arg": null
                        }
                    ],
                    "cd": null,
                    "dest_dir": null
                },
                "outfile": {
                    "unquoted_arg": "out",
                    "quoted_arg": null
                }
            },

ls >

            "redirection": {
                "command": {
                    "args": [
                        {
                            "unquoted_arg": "ls",
                            "quoted_arg": null
                        }
                    ],
                    "cd": null,
                    "dest_dir": null
                },
                "outfile": null
            },

Better support for linters and documentation

All public facing API should have type information attached to it, and the usage should be documented in the source code.

python example generates parser with invalid syntax

tatsu.exe --outfile .\test.py .\python.ebnf
python .\test.py
  File ".\test.py", line 1272
    self._pattern(r'["\\']')
                         ^
SyntaxError: invalid syntax

I think this is in the TRIAPOS definition.

Value of ignorecase gets lost

If I use the @@ignorecase directive in the EBNF grammar or pass ignorecase=True at compilation time, I find that the value is not carried through correctly to the Context object, thus I have to do

mygrammar.parse(something,ignorecase=True)

which is close enough for rock and roll but does not match the documentation.

Editor support for vscode

I thought I should post this here since the repo contains grammars for other editors, and it could be useful to someone.

I've written a vscode addon a while ago to support syntax highlighting and eventually autocompletion for symbols: https://github.com/Victorious3/vscode-TatSu

It is quite involved, but sadly unfinished since I discovered a glaring issue in vscode that makes writing it a complete pain.

Waiting for this PR: microsoft/vscode-languageserver-node#367
And this: microsoft/vscode#580

Apart from that, writing typescript is only marginally less annoying than writing Javascript so my motivation to enhance it is limited.

It's not on the Marketplace yet since its performance characteristics are a bit... questionable (due to the issue mentioned above, it parses the entire file on every edit) and its lacking support for goto symbol.

PEG order does not apply when left&right options

    def test_peg_associativity(self):
        left_grammar = '''
            @@left_recursion :: True
            @@nameguard :: False

            start = A $ ;
            A = | A 'a' | 'a' A | 'a' ;
        '''

        assert [['a', 'a'], 'a'] == parse(left_grammar, 'aaa')  # warning: fails

        right_grammar = '''
            @@left_recursion :: True
            @@nameguard :: False

            start = A $ ;
            A = | 'a' A | A 'a' | 'a' ;
        '''

        assert ['a', ['a', 'a']] == parse(right_grammar, 'aaa')

ModelBuilderSemantics Subclassing

start = expression $;

expression =
    | addition
    | subtraction
    | number;

addition::BinaryOp::Op = left:expression op:'+' ~ right:number;
subtraction::BinaryOp::Op = left:expression op:'-' ~ right:number;

number = /\d+/;

This triggers an exception: https://pastebin.com/raw/qMUUXEKb
It works when you remove left:expression in addition and subtraction

I'm not sure how this feature is supposed to work in the first place, how does it decide what to put on the base class and what on the inherited one? I suppose the workaround would be to create the base class manually and pass it to the ModelBuilderSemantics instance.

In either case, this should be documented better.

Void doesn't end recursion

    def test_nullable_void(self):
        left_grammar = '''
            @@left_recursion :: True
            @@nameguard :: False

            start = A $ ;
            A = | A 'a' | () ;
        '''

        assert [['a', 'a'], 'a'] == parse(left_grammar, 'aaa')  # warning: infinite recursion

Document support for preprocessing code

Handling comments inside my grammar has been such a performance drag that I decided to strip them in a preprocessing step (I have /* nested /* comments */*/). There's a method called _preprocess in Buffer which I ended up overriding for this purpose.

Sadly that completely messes with the line numbers. I found no provision in TatSu for this so I ended up generating my own LineCache in a very similar way to TatSu and converting the "wrong" line numbers to my "real" line numbers before giving my diagnostics. This... worked, but its obviously not ideal.

I have no idea how to generalize my solution but I still think TatSu could support this in some way, so I leave it open for discussion.

Here are my thoughts on it:

While adding nested comments would fix my issue for the time being, adding more features to my language will probably bring this up again in the future.
Parsing languages which rely on indentation would be another example where this could prove useful.
I was thinking about running a different grammar first instead of rolling my own ugly (but fast) lexer, but that wouldn't fix the line numbers either. Maybe this could be included? A "preprocessing grammar"?
Maybe I could let my preprocessor insert #line directives (see C). If those were supported by TatSu it could make for a simple but powerful solution. (Such a feature would have to be customizable to avoid clashes). Cons: I'd have to do math to figure out what directives to generate. Math is annoying.

Left recursion is broken

NOTE: Original post was by Oleg Broytman

This simple grammar:

@@grammar :: Minus

start = expression $ ;

expression = minus_expression | sub_expression ;

sub_expression = paren_expression | value ;

minus_expression = expression '-' sub_expression ;

paren_expression = '(' expression ')' ;

value = /[0-9]+/ ;

has indirect left recursions. The generated parser (grako 3.10.0) correctly parses expressions

3
3 - 2
(3 - 2)
(3 - 2) - 1
3 - 2 - 1

but fails on 3 - (2 - 1):

grako.exceptions.FailedParse: (1:3) Expecting end of text. :
3 - (2 - 1)
  ^
start

Running ./parser.py -t test where test is a file with content 3 - (2 - 1) I see that parsing stops after 2 - the parser expects a closing paren ); not finding one the parser fails to recognize (2 - 1) as paren_expression and aborts the entire expression. The parser recognizes 2 as value and thus as expressions and expects a closing paren ) to finish paren_expression.

Is it a problem with my grammar or a bug in grako?

Minor change in the grammar:

paren_expression = '(' ( minus_expression | value ) ')' ;

fixes the problem, but shouldn't the first version work too?

Need a minimal hello-tatsu test case; getting errors.

I did not want to hijack another issue where they say more examples are required, but I agree with it.

I wanted to make a "hello tatsu" example to see if I could use it. Right now, I created a file with nothing but one line in it:

hello.

If I create the test.ebnf file like this:

# test of Tatsu
source = sentence* $ ;
sentence = 'hello.' ;

I received the following error:

tatsu.exceptions.FailedRef: in(1:1) could not resolve reference to rule 'start' :
hello.
^

Okay, is "start" a reserved word for the top-most rule? I changed the rule name "source" to "start" but that doesn't change anything. Still get the same error.

Anyway, a walkthrough to introduce the workflow of using Tatsu would be welcome.

ModelBuilderSemantics doesn't include optional attributes

If I have a rule that with an optional named component, that component shows up as None when using the AST, but is missing from the model class when using ModelBuilderSemantics, see the code below:

import tatsu
from tatsu.model import ModelBuilderSemantics

grammar = r"""
foo::Foo = left:identifier [ ':' right:identifier ] $ ;
identifier = /\w+/ ;
"""

grammar = tatsu.compile(grammar)

a = grammar.parse('foo : bar', semantics=ModelBuilderSemantics())
assert a.left == 'foo'
assert a.right == 'bar'

b = grammar.parse('foo', semantics=ModelBuilderSemantics())
assert b.left == 'foo'
assert b.right is None  # AttributeError: 'Foo' object has no attribute 'right'

Generated parser fails with "could not resolve reference to rule 'start'"

I am trying to use TatSu on a new project (after having used grako successfully on a previous project), but I am running into this issue. I have created a (hopefully) minimal test case that reproduces the same issue I am seeing; my actual grammar is more complex, but I don't think it is the issue.

Setup

Python running in virtual environment:

Python 2.7.10 (default, Jul 15 2017, 17:16:57)
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.31)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

TatSu version: TatSu==4.2.5
Input grammar: test.ebnf.txt
Parser generated with python -m tatsu test.ebnf.txt > test_parser.py:

------------------------------------------------------------------------
          19  lines in grammar
           3  rules in grammar
          12  nodes in AST

The issue

Below is the output when testing the parser with the input string 4 + 5 on stdin:

(py2env) ryan@mrecho ~/Developer/projects/tmp $ python test_parser.py
4+5
Traceback (most recent call last):
  File "test_parser.py", line 123, in <module>
    ast = generic_main(main, TESTParser, name='TEST')
  File "/Users/ryan/Developer/projects/py2env/lib/python2.7/site-packages/tatsu/util.py", line 338, in generic_main
    colorize=args.color
  File "test_parser.py", line 116, in main
    return parser.parse(text, start=start, filename=filename, **kwargs)
  File "/Users/ryan/Developer/projects/py2env/lib/python2.7/site-packages/tatsu/contexts.py", line 211, in parse
    raise self._furthest_exception
tatsu.exceptions.FailedRef: -(1:1) could not resolve reference to rule 'start' :
4 + 5
^

I get the same result and output when the input string 4 + 5 is placed in test_input.txt and run via python test_parser.py test_input.txt.

Expected result

The expected result would what is output from the following short program:

def main():
    """
    Main script function.
    """
    with open(sys.argv[1], 'r') as grammar_file:
        grammar = grammar_file.read()

    with open(sys.argv[2], 'r') as input_file:
        input_text = input_file.read()

    model = tatsu.compile(grammar, trace=False)
    ast = model.parse(input_text, trace=False, colorize=True)
    print(tatsu.util.asjsons(ast))

if __name__ == "__main__":
    main()

Expected output:

[
  "4",
  "+",
  "5"
]

Error message doesn't contains line on witch error happens

import tatsu

GRAMMAR_TEXT = """
start = ATTRIBUTE operator value ;

ATTRIBUTE = /[A-Za-z]+/;

operator = '='
         | '!='
         ;

value = NUMBER ;

NUMBER = /[0-9]+/ ;
"""


def parse(data):
    with open('gr.py', 'w') as fd:
        fd.write(tatsu.to_python_sourcecode(GRAMMAR_TEXT))
    grammar = tatsu.compile(GRAMMAR_TEXT)
    return grammar.parse(data)


for sample in ['foo = 1', 'foo']:
    print(sample)
    print(parse(sample))

python t.py
foo = 1
['foo', '=', '1']
foo
Traceback (most recent call last):
  File "t.py", line 27, in <module>
    print(parse(sample))
  File "t.py", line 22, in parse
    return grammar.parse(data)
  File "/Users/wizard/Code/TatSu/tatsu/grammars.py", line 946, in parse
    **kwargs
  File "/Users/wizard/Code/TatSu/tatsu/contexts.py", line 211, in parse
    raise self._furthest_exception
  File "/Users/wizard/Code/TatSu/tatsu/contexts.py", line 203, in parse
    result = rule()
  File "/Users/wizard/Code/TatSu/tatsu/grammars.py", line 694, in parse
    result = self._parse_rhs(ctx, self.exp)
  File "/Users/wizard/Code/TatSu/tatsu/grammars.py", line 701, in _parse_rhs
    result = ctx._call(ruleinfo)
  File "/Users/wizard/Code/TatSu/tatsu/contexts.py", line 509, in _call
    result = self._recursive_call(ruleinfo)
  File "/Users/wizard/Code/TatSu/tatsu/contexts.py", line 540, in _recursive_call
    result = self._invoke_rule(ruleinfo, key)
  File "/Users/wizard/Code/TatSu/tatsu/contexts.py", line 573, in _invoke_rule
    ruleinfo.impl(self)
  File "/Users/wizard/Code/TatSu/tatsu/grammars.py", line 349, in parse
    ctx.last_node = [s.parse(ctx) for s in self.sequence]
  File "/Users/wizard/Code/TatSu/tatsu/grammars.py", line 349, in <listcomp>
    ctx.last_node = [s.parse(ctx) for s in self.sequence]
  File "/Users/wizard/Code/TatSu/tatsu/grammars.py", line 650, in parse
    return rule()
  File "/Users/wizard/Code/TatSu/tatsu/grammars.py", line 694, in parse
    result = self._parse_rhs(ctx, self.exp)
  File "/Users/wizard/Code/TatSu/tatsu/grammars.py", line 701, in _parse_rhs
    result = ctx._call(ruleinfo)
  File "/Users/wizard/Code/TatSu/tatsu/contexts.py", line 509, in _call
    result = self._recursive_call(ruleinfo)
  File "/Users/wizard/Code/TatSu/tatsu/contexts.py", line 540, in _recursive_call
    result = self._invoke_rule(ruleinfo, key)
  File "/Users/wizard/Code/TatSu/tatsu/contexts.py", line 573, in _invoke_rule
    ruleinfo.impl(self)
  File "/Users/wizard/Code/TatSu/tatsu/grammars.py", line 401, in parse
    ctx._error('no available options')
  File "/Users/wizard/Code/TatSu/tatsu/contexts.py", line 435, in _error
    raise self._make_exception(item, exclass=exclass)
tatsu.exceptions.FailedParse: (1:1) no available options :

^
operator
start

Expected something like

  File "/Users/wizard/Code/TatSu/tatsu/contexts.py", line 435, in _error
    raise self._make_exception(item, exclass=exclass)
tatsu.exceptions.FailedParse: (1:1) no available options :
foo
^
operator
start

@@nameguard not honored in simple grammar

I am puzzled by this behavior:

from pprint import pprint
from tatsu import parse

GRAMMAR = """
sequence = {fingering}+ ;
fingering = '1' | '2' | '3' | '4' | '5' | 'x' ;
"""

test = "22"
ast = parse(GRAMMAR, test)
pprint(ast)  # Prints ['2', '2']

test = "xx"
ast = parse(GRAMMAR, test)
pprint(ast)  # tatsu.exceptions.FailedParse: (1:1) no available options :

Setting the grammar to the following works as expected:

GRAMMAR = """
sequence = {fingering}+ ;
fingering = /[1-5x]/ ; 
"""

Am I missing something?

File missing in pip install of the module

The g2e/antlr.ebnf file is missing from the pip module https://pypi.python.org/pypi/tatsu/4.2.5.

Left recursion bug when not at top level?

I have the following grammar that purports to parse declarations of the form: <type> <name>. The type is either an identifier, or an array of type, so int x and int[] x and int[][] x would be valid declarations.

import tatsu

grammar = r"""
identifier = /\w+/ ;

type = (vector_type | leaf_type) ;
vector_type = base:type '[]' ;
leaf_type = id:identifier ;

decl = type:type name:identifier ;
"""

print(tatsu.parse(grammar, 'int x', start='decl'))

Yet when I try to parse int x, I get:

tatsu.exceptions.FailedToken: (1:1) expecting '[]' :

^
vector_type
type
decl

int[] x works fine, and so does int if parsed at the top level (start='type'). I'm using TatSu 4.3.0.

No boolean literals

`10` -> 10
`string` -> 'string'
`'long string'` -> 'long string'
`True` -> 'True'

Since number literals are passed on I would expect the same to happen for boolean literals, but instead they get converted to strings. The same happens for None as well. I can use 0 and 1 instead, but it's not really the same.

Workaround:

t_bool::bool = k_true @:`1` | k_false @:`0`;
t_bool_lit::Boolean = VALUE:t_bool;

Unit tests don't cover generated parser

There's one issue with how the unittests run currently.
There are significant differences between how the parser behaves when running from the generated code and when running in immediate.
Tatsu's self test doesn't cover all the features that have been added, so for completeness all the other unit tests should run twice.

Help needed with a certain grammar

I have the following grammar:

@@grammar :: Shell
start                    = word $;
word                     = squoted | dquoted | bareword;
squoted                  = /'[^']*'/;
identifier               = /[a-zA-Z_][a-zA-Z0-9_]*/;
variable                 = '$' ( ( identifier | '?') | '{' ( identifier | '?' ) '}');
commsub                  = '$(' word ')' | '`' word '`';
dquoted                  = '"' { ( commsub | variable | dquoted_nonspec ) }* '"';
dquoted_nonspec          = /(?:\\.|[^\\`])/;
bareword                 = { ( commsub | variable | bareword_nonspec) }*;
bareword_nonspec         = /(?:\\.|[^'"&\s()<>`|;#])/;

According to my understanding, the above rules should cause a string such as "hello" to be parsed as a set of dquoted_nonspec tokens.

However, on trying to parse "hello" with the above grammar, I get a FailedParse exception, since it seems tatsu only tries commsub and does not bother with checking if it is a variable or dquoted_nonspec.

tatsu.exceptions.FailedParse: (1:1) no available options :

^
commsub
dquoted
word
start

Am I doing something wrong here?

Move g2e from example to feature

Conversion of ANTLR grammars to TatSu grammars should be a tested feature.

Need more examples.

As the title said. Tatsu need more examples which are easier to understand and following
Current examples are too big and hard to understand.

Pyparsing has a lot of examples which i found easier for beginner.
http://pyparsing.wikispaces.com/Examples

Gather works with string literals/regexps but not with names

I have the following grammar (whitespaces are disabled with whitespace='' on the parser.parse() method, and this compiles fine when using tatsu.compile()

start = /\s+/.{ /\S+/ }* $;

However, when trying to use this grammar instead:

ws     = /\s+/;
start  = ws.{ /\S+/ }* $;

I get the error mentioned below. Shouldn't both of these should be valid and accepted by Tatsu?

Traceback (most recent call last):
  File "/me/venv/lib/python3.7/site-packages/tatsu/contexts.py", line 672, in _option
    yield
  File "/me/venv/lib/python3.7/site-packages/tatsu/contexts.py", line 756, in _repeat
    self._isolate(block)
  File "/me/venv/lib/python3.7/site-packages/tatsu/contexts.py", line 735, in _isolate
    block()
  File "/me/venv/lib/python3.7/site-packages/tatsu/bootstrap.py", line 99, in block5
    self._rule_()
  File "/me/venv/lib/python3.7/site-packages/tatsu/contexts.py", line 54, in wrapper
    return self._call(ruleinfo)
  File "/me/venv/lib/python3.7/site-packages/tatsu/contexts.py", line 509, in _call
    result = self._recursive_call(ruleinfo)
  File "/me/venv/lib/python3.7/site-packages/tatsu/contexts.py", line 540, in _recursive_call
    result = self._invoke_rule(ruleinfo, key)
  File "/me/venv/lib/python3.7/site-packages/tatsu/contexts.py", line 573, in _invoke_rule
    ruleinfo.impl(self)
  File "/me/venv/lib/python3.7/site-packages/tatsu/bootstrap.py", line 286, in _rule_
    self._token(';')
  File "/me/venv/lib/python3.7/site-packages/tatsu/contexts.py", line 610, in _token
    self._error(token, exclass=FailedToken)
  File "/me/venv/lib/python3.7/site-packages/tatsu/contexts.py", line 435, in _error
    raise self._make_exception(item, exclass=exclass)
tatsu.exceptions.FailedToken: (2:12) expecting ';' :
start  = ws.{ /\S+/ }* $;
           ^
rule
grammar
start

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/me/venv/lib/python3.7/site-packages/tatsu/contexts.py", line 203, in parse
    result = rule()
  File "/me/venv/lib/python3.7/site-packages/tatsu/contexts.py", line 54, in wrapper
    return self._call(ruleinfo)
  File "/me/venv/lib/python3.7/site-packages/tatsu/contexts.py", line 509, in _call
    result = self._recursive_call(ruleinfo)
  File "/me/venv/lib/python3.7/site-packages/tatsu/contexts.py", line 540, in _recursive_call
    result = self._invoke_rule(ruleinfo, key)
  File "/me/venv/lib/python3.7/site-packages/tatsu/contexts.py", line 573, in _invoke_rule
    ruleinfo.impl(self)
  File "/me/venv/lib/python3.7/site-packages/tatsu/bootstrap.py", line 84, in _start_
    self._grammar_()
  File "/me/venv/lib/python3.7/site-packages/tatsu/contexts.py", line 54, in wrapper
    return self._call(ruleinfo)
  File "/me/venv/lib/python3.7/site-packages/tatsu/contexts.py", line 509, in _call
    result = self._recursive_call(ruleinfo)
  File "/me/venv/lib/python3.7/site-packages/tatsu/contexts.py", line 540, in _recursive_call
    result = self._invoke_rule(ruleinfo, key)
  File "/me/venv/lib/python3.7/site-packages/tatsu/contexts.py", line 573, in _invoke_rule
    ruleinfo.impl(self)
  File "/me/venv/lib/python3.7/site-packages/tatsu/bootstrap.py", line 100, in _grammar_
    self._positive_closure(block5)
  File "/me/venv/lib/python3.7/site-packages/tatsu/contexts.py", line 782, in _positive_closure
    self._repeat(block, prefix=sep, dropprefix=omitsep)
  File "/me/venv/lib/python3.7/site-packages/tatsu/contexts.py", line 759, in _repeat
    self._error('empty closure')
  File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/me/venv/lib/python3.7/site-packages/tatsu/contexts.py", line 678, in _option
    raise FailedCut(e)
tatsu.exceptions.FailedCut: (2:12) expecting ';' :
start  = ws.{ /\S+/ }* $;
           ^
rule
grammar
start

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/me/venv/lib/python3.7/site-packages/tatsu/tool.py", line 145, in compile
    model = cache[grammar] = gen.parse(grammar, **kwargs)
  File "/me/venv/lib/python3.7/site-packages/tatsu/contexts.py", line 208, in parse
    raise self._furthest_exception
  File "/me/venv/lib/python3.7/site-packages/tatsu/contexts.py", line 672, in _option
    yield
  File "/me/venv/lib/python3.7/site-packages/tatsu/bootstrap.py", line 408, in _named_
    self._named_list_()
  File "/me/venv/lib/python3.7/site-packages/tatsu/contexts.py", line 54, in wrapper
    return self._call(ruleinfo)
  File "/me/venv/lib/python3.7/site-packages/tatsu/contexts.py", line 509, in _call
    result = self._recursive_call(ruleinfo)
  File "/me/venv/lib/python3.7/site-packages/tatsu/contexts.py", line 540, in _recursive_call
    result = self._invoke_rule(ruleinfo, key)
  File "/me/venv/lib/python3.7/site-packages/tatsu/contexts.py", line 573, in _invoke_rule
    ruleinfo.impl(self)
  File "/me/venv/lib/python3.7/site-packages/tatsu/bootstrap.py", line 417, in _named_list_
    self._token('+:')
  File "/me/venv/lib/python3.7/site-packages/tatsu/contexts.py", line 610, in _token
    self._error(token, exclass=FailedToken)
  File "/me/venv/lib/python3.7/site-packages/tatsu/contexts.py", line 435, in _error
    raise self._make_exception(item, exclass=exclass)
tatsu.exceptions.FailedToken: (2:12) expecting '+:' :
start  = ws.{ /\S+/ }* $;
           ^
named_list
named
element
sequence
choice
expre
rule
grammar
start

Another left recursion problem

Grammar:

identifier = /\w+/ ;
expr = mul | identifier ;
mul = expr '*' identifier ;

Parsing a * b with the start rule expr gives the expected result: ['a', '*', 'b']. But parsing with the start rule mul gives the following error:

Traceback (most recent call last):
  File "/home/manu/vcs/TatSu/tatsu/contexts.py", line 203, in parse
    result = rule()
  File "/home/manu/vcs/TatSu/tatsu/contexts.py", line 54, in wrapper
    return self._call(ruleinfo)
  File "/home/manu/vcs/TatSu/tatsu/contexts.py", line 513, in _call
    result = self._recursive_call(ruleinfo)
  File "/home/manu/vcs/TatSu/tatsu/contexts.py", line 546, in _recursive_call
    result = self._invoke_rule(ruleinfo, pos, key)
  File "/home/manu/vcs/TatSu/tatsu/contexts.py", line 580, in _invoke_rule
    ruleinfo.impl(self)
  File "parser.py", line 96, in _mul_
    self._token('*')
  File "/home/manu/vcs/TatSu/tatsu/contexts.py", line 617, in _token
    self._error(token, exclass=FailedToken)
  File "/home/manu/vcs/TatSu/tatsu/contexts.py", line 436, in _error
    raise self._make_exception(item, exclass=exclass)
tatsu.exceptions.FailedToken: /proc/self/fd/12(1:1) expecting '*' :

^
mul

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "parser.py", line 122, in <module>
    ast = generic_main(main, UnknownParser, name='Unknown')
  File "/home/manu/vcs/TatSu/tatsu/util.py", line 335, in generic_main
    colorize=args.color
  File "parser.py", line 115, in main
    return parser.parse(text, startrule, filename=filename, **kwargs)
  File "/home/manu/vcs/TatSu/tatsu/contexts.py", line 211, in parse
    raise self._furthest_exception
  File "/home/manu/vcs/TatSu/tatsu/contexts.py", line 679, in _option
    yield
  File "parser.py", line 88, in _expr_
    self._mul_()
  File "/home/manu/vcs/TatSu/tatsu/contexts.py", line 54, in wrapper
    return self._call(ruleinfo)
  File "/home/manu/vcs/TatSu/tatsu/contexts.py", line 513, in _call
    result = self._recursive_call(ruleinfo)
  File "/home/manu/vcs/TatSu/tatsu/contexts.py", line 546, in _recursive_call
    result = self._invoke_rule(ruleinfo, pos, key)
  File "/home/manu/vcs/TatSu/tatsu/contexts.py", line 580, in _invoke_rule
    ruleinfo.impl(self)
  File "parser.py", line 96, in _mul_
    self._token('*')
  File "/home/manu/vcs/TatSu/tatsu/contexts.py", line 617, in _token
    self._error(token, exclass=FailedToken)
  File "/home/manu/vcs/TatSu/tatsu/contexts.py", line 436, in _error
    raise self._make_exception(item, exclass=exclass)
tatsu.exceptions.FailedToken: /proc/self/fd/12(1:1) expecting '*' :

^
mul
expr
mul

@@keyword confusion

This here is valid:

@@namechars :: '_'
@@keyword :: if else not and or false true var let type

But if you swap those two lines:

@@keyword :: if else not and or false true var let type
@@namechars :: '_'

It gives the following error:

no available options :
@@namechars :: '_'
 ^
decorator
rule
grammar
start

Subclassing wins over python style comment

start = expression $;

expression =
#    | addition
    | subtraction
    | number;

#addition::BinaryOp::Op = left:expression op:'+' ~ right:number;
subtraction::BinaryOp = left:expression op:'-' ~ right:number;

number = /\d+/;

Fails with: https://pastebin.com/raw/gzJwygGy

This however, works just fine;

+(*addition::BinaryOp::Op = left:expression op:'+' ~ right:number;*)
-#addition::BinaryOp::Op = left:expression op:'+' ~ right:number;
 subtraction::BinaryOp = left:expression op:'-' ~ right:number;

Unreasonably long exception traces

This is more of a quality of life change, and no priority.
Something I've noticed from reading several stack traces involving TatSu is that they always span several hundred lines, including re-raising with mostly useless information. This is due to a mixture of using exceptions for flow control, and the nature of recursive descent parsers that tend to call a lot of methods.

Since the entire stack trace is barely ever needed I would propose to suppress everything but the exception message at the top level (i.e the compile/parse functions) unless a flag is set.
Python 3 has

raise exception from None

but someone has probably backported this feature to support 2.7 as well.

Scala has a mixin NoStackTrace which prevents it from getting generated in the first place, if something like this exists from Python it could spare quite a bit of memory usage and some CPU time as well.

Allow AST changes

I would like to make changes to an AST data structure after generating one. But such operations are not allowed. Is there a compelling reason for this? Could this be allowed via an option of some sort? Another approach would be a function to copy the AST into a analogous mutable data structure.

Indirect Left Recursion

Taken from the calculator example, simplified:

start = expression $;
number = /\d+/;

addition = expression '+' number;
subtraction = expression '-' number;

expression =
    | addition
    | subtraction
    | number;

This fails on the input 1-1+1 with

(1:4) infinite left recursion :
1-1+1
   ^
expression
subtraction
expression
start

and on 1+1-1 with

(1:4) Expecting end of text :
1+1-1
   ^
start

There is a workaround for this, changing the grammar as follows:

 expression =
+    | >addition
+    | >subtraction
-    | addition
-    | subtraction
     | number;

or like this:

-addition = expression '+' number;
+addition = expression ('+' | '-' ) number;
-subtraction = expression '-' number;

 expression =
     | addition
-    | subtraction
     | number;

yields the correct result with both inputs.

Sadly both workarounds make walking the AST more difficult.

Trace for 1+1-1:

↙start ~1:1
1+1-1
↙expression↙start ~1:1
1+1-1
↙addition↙expression↙start ~1:1
1+1-1
↙expression↙addition↙expression↙start ~1:1
1+1-1
⟲ expression↙addition↙expression↙start ~1:1
1+1-1
⟲ addition↙expression↙start ~1:1
1+1-1
↙subtraction↙expression↙start ~1:1
1+1-1
↙expression↙subtraction↙expression↙start ~1:1
1+1-1
⟲ expression↙subtraction↙expression↙start ~1:1
1+1-1
⟲ subtraction↙expression↙start ~1:1
1+1-1
↙number↙expression↙start ~1:1
1+1-1
≡'1' /\d+/
+1-1
≡number↙expression↙start ~1:2
+1-1
↙addition↙expression↙start ~1:2
+1-1
↙expression↙addition↙expression↙start ~1:2
+1-1
≡expression↙addition↙expression↙start ~1:2
+1-1
≡'+' 
1-1
↙number↙addition↙expression↙start ~1:3
1-1
≡'1' /\d+/
-1
≡number↙addition↙expression↙start ~1:4
-1
↙expression↙addition↙expression↙start ~1:4
-1
↙addition↙expression↙addition↙expression↙start ~1:4
-1
≡addition↙expression↙addition↙expression↙start ~1:4
-1
≡expression↙addition↙expression↙start ~1:4
-1
≢'+' 
-1
≡addition↙expression↙start ~1:4    # Why does this match even tho '+' didn't on the line above?
-1                                 # Now it should try the subtraction, but doesn't
≡expression↙start ~1:4
-1
≢start ~1:1
1+1-1

Comment handling interferes with simple grammars

Using TatSu version 208e7e8:

>>> import tatsu
>>> g=tatsu.compile("start = '#' $;")
>>> g.parse("#", trace=True)
↙start ~1:1
#
≢'#' 
≢start ~1:1
#
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "deps/modules/tatsu/grammars.py", line 945, in parse
    **kwargs
  File "deps/modules/tatsu/contexts.py", line 211, in parse
    raise self._furthest_exception
  File "deps/modules/tatsu/contexts.py", line 203, in parse
    result = rule()
  File "deps/modules/tatsu/grammars.py", line 693, in parse
    result = self._parse_rhs(ctx, self.exp)
  File "deps/modules/tatsu/grammars.py", line 700, in _parse_rhs
    result = ctx._call(ruleinfo)
  File "deps/modules/tatsu/contexts.py", line 509, in _call
    result = self._recursive_call(ruleinfo)
  File "deps/modules/tatsu/contexts.py", line 540, in _recursive_call
    result = self._invoke_rule(ruleinfo, key)
  File "deps/modules/tatsu/contexts.py", line 573, in _invoke_rule
    ruleinfo.impl(self)
  File "deps/modules/tatsu/grammars.py", line 348, in parse
    ctx.last_node = [s.parse(ctx) for s in self.sequence]
  File "deps/modules/tatsu/grammars.py", line 348, in <listcomp>
    ctx.last_node = [s.parse(ctx) for s in self.sequence]
  File "deps/modules/tatsu/grammars.py", line 266, in parse
    return ctx._token(self.token)
  File "deps/modules/tatsu/contexts.py", line 610, in _token
    self._error(token, exclass=FailedToken)
  File "deps/modules/tatsu/contexts.py", line 435, in _error
    raise self._make_exception(item, exclass=exclass)
tatsu.exceptions.FailedToken: (1:1) expecting '#' :

^
start

This can be overridden by setting the comments_re and eol_comments_re to be /$/, but the default should really be None.

Cleanup recovery of grammar comments

# comments retain the # and are output as (* *) comments.

Prettification of comments is not well defined.

It's best to stick to Python-style (#) comments.

Case insensitivity should not affect pattern regexes

Also re.MULTILINE and other reoptions should be left to the user through the (?aiLmsux) pattern.

This change would break backwards compatibility.

the standalone generic_main should support stdin

Once you generate a standalone parser test (like a script), the usage statement of that parser says that you must provide a filename.

[hariedo@hariedo14:~]$ tatsu -m Test t.ebnf >t.py
------------------------------------------------------------------------
          31  lines in grammar
           2  rules in grammar
           7  nodes in AST

[hariedo@hariedo14:~]$ python t.py
usage: t.py [-h] [-c] [-l] [-n] [-t] [-w WHITESPACE] FILE [STARTRULE]
t.py: error: too few arguments

[hariedo@hariedo14:~]$ python t.py -h
usage: t.py [-h] [-c] [-l] [-n] [-t] [-w WHITESPACE] FILE [STARTRULE]

Simple parser for Test.

positional arguments:
  FILE                  the input file to parse
  STARTRULE             the start rule for parsing

optional arguments:
  -h, --help            show this help message and exit
  -c, --color           use color in traces (requires the colorama library)
  -l, --list            list all rules and exit
  -n, --no-nameguard    disable the 'nameguard' feature
  -t, --trace           output trace information
  -w WHITESPACE, --whitespace WHITESPACE
                        whitespace specification

[hariedo@hariedo14:~]$ echo FOO | python t.py
usage: t.py [-h] [-c] [-l] [-n] [-t] [-w WHITESPACE] FILE [STARTRULE]
t.py: error: too few arguments

I suggest that you allow stdin to be read for input data if given no filename. Barring that, allow stdin to be read for input data if you specifically use '-' (one dash) as the input filename.

Parser drops part of input

Grammar:

identifier = /\w+/ ;
start = '{' start '}' | start '->' identifier | identifier ;

Input:

{ size }
test

Expected output: ['{', 'size', '}']
Actual output: test

Removing the (left recursive) rule start '->' identifier fixes the problem.

Drop support for Python 2.7

End of life for Python 2.7 is 2020/01/01. Dropping support for 2.7 in TatSu would allow optimizations through the new syntax, and exploration of parsing of streams with the help of asyincio.

cc @Victorious3

Remove deprecated grammar syntax

Also cleanup the syntax, even at the expense of some incompatibilities with Grako.

Reverse the meaning of the s%{ e } and s.{ e } expressions.
Reverse the meaning of the op<{ e }+ and op<{ e }+ expressions.

Generated Python code doesn't cope with multi-line regexs

Consider:

myrule = /(?x) 
   foo
   bar
/ $ ;

This generates:

self._pattern(r'(?x) \n   foo\n   bar\n')

Which will look for literal newlines in the string, rather than being ignored.

Is this a bug of code generation with template?

import tatsu
from pprint import pprint
import sys

from tatsu.codegen import ModelRenderer
from tatsu.codegen import CodeGenerator

grammar = """
@@grammar::test

start
    = stmtlist $
    ;

stmtlist :: Stmtlist
    = val:{stmt ';'} +
    ;

stmt :: Stmt
    = val: /\w+/
    ;
"""

prog = """
hello;
world;
nihao;
"""

THIS_MODULE =  sys.modules[__name__]
class MyCodeGenerator(CodeGenerator):
    def __init__(self):
        super(MyCodeGenerator, self).__init__(modules=[THIS_MODULE])

class Stmtlist(ModelRenderer):
    template = '''
    gen stmtlist:
    {val:::}
    '''

class Stmt(ModelRenderer):
    template = '''
    gen stmt:
    {val}
    '''

parser = tatsu.compile(grammar, asmodel=True)
model = parser.parse(prog)

code = MyCodeGenerator().render(model)

print(code)

output

gen stmtlist:
{
  "__class__": "Stmt",
  "val": "hello"
};{
  "__class__": "Stmt",
  "val": "world"
};{
  "__class__": "Stmt",
  "val": "nihao"
};

I expect something like:

gen stmtlist:
gen stmt:
hello;
gen stmt:
world;
gen stmt:
nihao;

Associativity in calc example

NOTE: Originally posted against Grako by Austin Whittington

In the grammar that the calc examples use, order of operations is not always preserved. For example, given the expression 4 / 2 / 2, the examples return 4 / 2 / 2 = 4.0, versus the python prompt which returns

>>> 4 / 2 / 2
1.0

It looks like this is because the operators are behaving as right-associative, instead of left-associative, which is probably a result of how it was converted away from being left recursive. Using calc/v4 as my guinea pig (because the output is the cleanest), I can change the division rule from

division
    =
    left:factor op:'/' ~ right:term
    ;

division
    =
    left:term op:'/' ~ right:factor
    ;

which fixes the associativity (using indirect left recursion), but doesn't solve the problem of showing how to create this type of grammar (with the correct associativity) in a PEG parser without relying on said left recursion.