Code Monkey home page Code Monkey logo

lark's Introduction

Lark - a parsing toolkit for Python

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.

Lark can parse all context-free languages. To put it simply, it means that it is capable of parsing almost any programming language out there, and to some degree most natural languages too.

Who is it for?

  • Beginners: Lark is very friendly for experimentation. It can parse any grammar you throw at it, no matter how complicated or ambiguous, and do so efficiently. It also constructs an annotated parse-tree for you, using only the grammar and an input, and it gives you convenient and flexible tools to process that parse-tree.

  • Experts: Lark implements both Earley(SPPF) and LALR(1), and several different lexers, so you can trade-off power and speed, according to your requirements. It also provides a variety of sophisticated features and utilities.

What can it do?

  • Parse all context-free grammars, and handle any ambiguity gracefully
  • Build an annotated parse-tree automagically, no construction code required.
  • Provide first-rate performance in terms of both Big-O complexity and measured run-time (considering that this is Python ;)
  • Run on every Python interpreter (it's pure-python)
  • Generate a stand-alone parser (for LALR(1) grammars)

And many more features. Read ahead and find out!

Most importantly, Lark will save you time and prevent you from getting parsing headaches.

Quick links

Install Lark

$ pip install lark --upgrade

Lark has no dependencies.

Tests

Syntax Highlighting

Lark provides syntax highlighting for its grammar files (*.lark):

Clones

These are implementations of Lark in other languages. They accept Lark grammars, and provide similar utilities.

Hello World

Here is a little program to parse "Hello, World!" (Or any other similar phrase):

from lark import Lark

l = Lark('''start: WORD "," WORD "!"

            %import common.WORD   // imports from terminal library
            %ignore " "           // Disregard spaces in text
         ''')

print( l.parse("Hello, World!") )

And the output is:

Tree(start, [Token(WORD, 'Hello'), Token(WORD, 'World')])

Notice punctuation doesn't appear in the resulting tree. It's automatically filtered away by Lark.

Fruit flies like bananas

Lark is great at handling ambiguity. Here is the result of parsing the phrase "fruit flies like bananas":

fruitflies.png

Read the code here, and see more examples here.

List of main features

  • Builds a parse-tree (AST) automagically, based on the structure of the grammar
  • Earley parser
    • Can parse all context-free grammars
    • Full support for ambiguous grammars
  • LALR(1) parser
    • Fast and light, competitive with PLY
    • Can generate a stand-alone parser (read more)
  • EBNF grammar
  • Unicode fully supported
  • Automatic line & column tracking
  • Interactive parser for advanced parsing flows and debugging
  • Grammar composition - Import terminals and rules from other grammars
  • Standard library of terminals (strings, numbers, names, etc.)
  • Import grammars from Nearley.js (read more)
  • Extensive test suite codecov
  • Type annotations (MyPy support)
  • And much more!

See the full list of features here

Comparison to other libraries

Performance comparison

Lark is fast and light (lower is better)

Run-time Comparison

Memory Usage Comparison

Check out the JSON tutorial for more details on how the comparison was made.

For a more thorough and objective comparison, checkout the Python Parsing Benchmarks repo.

Feature comparison

Library Algorithm Grammar Builds tree? Supports ambiguity? Can handle every CFG? Line/Column tracking Generates Stand-alone
Lark Earley/LALR(1) EBNF Yes! Yes! Yes! Yes! Yes! (LALR only)
PLY LALR(1) BNF No No No No No
PyParsing PEG Combinators No No No* No No
Parsley PEG EBNF No No No* No No
Parsimonious PEG EBNF Yes No No* No No
ANTLR LL(*) EBNF Yes No Yes? Yes No

(* PEGs cannot handle non-deterministic grammars. Also, according to Wikipedia, it remains unanswered whether PEGs can really parse all deterministic CFGs)

Projects using Lark

  • Poetry - A utility for dependency management and packaging
  • Vyper - Pythonic Smart Contract Language for the EVM
  • PyQuil - Python library for quantum programming using Quil
  • Preql - An interpreted relational query language that compiles to SQL
  • Hypothesis - Library for property-based testing
  • mappyfile - a MapFile parser for working with MapServer configuration
  • tartiflette - GraphQL server by Dailymotion
  • synapse - an intelligence analysis platform
  • Datacube-core - Open Data Cube analyses continental scale Earth Observation data through time
  • SPFlow - Library for Sum-Product Networks
  • Torchani - Accurate Neural Network Potential on PyTorch
  • Command-Block-Assembly - An assembly language, and C compiler, for Minecraft commands
  • EQL - Event Query Language
  • Fabric-SDK-Py - Hyperledger fabric SDK with Python 3.x
  • required - multi-field validation using docstrings
  • miniwdl - A static analysis toolkit for the Workflow Description Language
  • pytreeview - a lightweight tree-based grammar explorer
  • harmalysis - A language for harmonic analysis and music theory
  • gersemi - A CMake code formatter
  • MistQL - A query language for JSON-like structures
  • Outlines - Structured generation with Large Language Models

Full list

License

Lark uses the MIT license.

(The standalone tool is under MPL2)

Contributors

Lark accepts pull-requests. See How to develop Lark

Big thanks to everyone who contributed so far:

Sponsor

If you like Lark, and want to see us grow, please consider sponsoring us!

Contact the author

Questions about code are best asked on gitter or in the issues.

For anything else, I can be reached by email at erezshin at gmail com.

-- Erez

lark's People

Contributors

chanicpanic avatar chsasank avatar decorator-factory avatar erezsh avatar evandrocoan avatar evilnose avatar evtn avatar henryiii avatar hf-kklein avatar jmishra01 avatar julienmalard avatar kasbah avatar kevinlatimer avatar klauer avatar kmolyuan avatar ldbo avatar megaing avatar michael-k avatar orcharddweller avatar ornariece avatar pjcampi avatar plannigan avatar raekye avatar robroseknows avatar rogdham avatar starwarswii avatar suprasummus avatar tg-techie avatar thatxliner avatar wiene avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lark's Issues

Document things like '%import' and '%ignore'

It doesn't appear that syntax features like '%import' and '%parse' are documented anywhere. This, plus the fact that there is no Python API reference, makes using lark more difficult than it should be.

Collision in LALR grammars

I may very well be a bit rusty on LALR grammars, but I can't find what I'm doing wrong with that one:

_expr          : or_test

?or_test       : and_test
               | or_test  "or"  and_test    -> binary_expr
?and_test      : _primary
               | and_test "and" _primary    -> binary_expr

_primary       : call_expr
               | ident
               | "(" _expr ")"

call_expr      : _expr "(" _expr ")"

ident          : NAME
NAME           : /[^\W\d][\w]*/
NUMBER         : /(0|([1-9][0-9]*))(\.[0-9]+)?([Ee][+-]?[0-9]+)?/
STRING         : /'[^']*'/

%ignore /[\t \f]+/

(This is a simplified version of a my complete grammar, which can be seen here: light.g).

I initialize my parser like that:

parser = Lark(f, parser='lalr', start='_expr')

This grammar gets me a lark.common.GrammarError: Collision .... The remainder of the error message doesn't seem to be deterministic, as I always get a different one. It seems like I can't handle recursion of the _expr rule with the call_expr. Looks pretty much like a left-recursion issue, which is quite surprising for a LALR parser.

Everything works fine with the Earley parser.

Is there a Java grammar for Lark

Would like to parse Java. Right now, Im translating a ANTLR Java grammar to the Lark format, but do you know if there is one available already? Thanks!

Parse tree for ambiguous grammar contains `drv` subtree?

Hi -- I recently updated the version of Lark I'm using and am trying to debug the following program (which contains a deliberately ambiguous grammar):

from lark import Lark

grammar = '''
stmt : ifthenelse
     | ifthen
     | assign

ifthen : "if" "cond" "then" stmt

ifthenelse : "if" "cond" "then" stmt "else" stmt

assign : "a:=1"

%import common.WS
%ignore WS
'''

foo = 'if cond then if cond then a:=1 else a:=1'

parser = Lark(grammar, start='stmt', ambiguity='explicit')
print(parser.parse(foo))

When I was using lark 0.2.7, I got the following output.

Once I upgraded to any version of lark 0.3.0 or above, I instead get the following output, which I've pretty-printed for readability:

Tree(_ambig, [
    Tree(stmt, [
        Tree(ifthenelse, [
            Tree(stmt, [
                Tree(ifthen, [
                    Tree(stmt, [Tree(drv, [Token(__ANONSTR_4, 'a:=1')])])])]), 
            Tree(stmt, [Tree(drv, [Token(__ANONSTR_4, 'a:=1')])])])]), 
    Tree(stmt, [
        Tree(ifthen, [
            Tree(stmt, [
                Tree(ifthenelse, [
                    Tree(stmt, [Tree(assign, [])]), 
                    Tree(stmt, [Tree(assign, [])])])])])])])

While I appreciate that the output is significantly shorter (it certainly does make debugging easier!), I'm also a little baffled as to why my parse tree contains the subtree Tree(drv, [Token(__ANONSTR_4, 'a:=1')]) instead of Tree(assign, []), which is what I was expecting.

Is this behavior intentional or a bug? (The documentation didn't seem to mention anything about drv subtrees). Either way, is there any way I can get lark to return Tree(assign, []) instead of the drv thing?

Thanks!

(For context, I'm updating some code that pretty-prints ambiguous trees for debugging purposes and ran into this case. I'm sure I could modify my code to work around this issue if necessary, but figured I might as well ask first.)

parsing time seems to increase dramatically with depth of parse tree

I am trying to parse Fortran95 programs with lark. The grammar is medium size but mostly modeled after the Fortran 95 specification (using the same names for non-terminal symbols). I use a separate lexer to deal with all the weird stuff like Hollerith constants. The programs I want to parse are considerably larger than the test program below.

The sample fortran program

PROGRAM TEST
do i=1,10
if (i<5) write(,) i;
write(,) i*i;
end do
END

is converted to the token list C below. To compare with slightly simpler programs I skipped once the first write statement (token list B below) and once the second write statement (token list A below)

A
'PROGRAM NAME EOS EOS DO NAME EQUAL NATNUMBER COMMA NATNUMBER EOS IF LPAREN NAME LT NATNUMBER RPAREN WRITE LPAREN TIMES COMMA TIMES RPAREN NAME EOS EOS ENDDO EOS END EOS'
B
'PROGRAM NAME EOS EOS DO NAME EQUAL NATNUMBER COMMA NATNUMBER EOS IF LPAREN NAME LT NATNUMBER RPAREN WRITE LPAREN TIMES COMMA TIMES RPAREN NAME TIMES NAME EOS EOS ENDDO EOS END EOS'
C
'PROGRAM NAME EOS EOS DO NAME EQUAL NATNUMBER COMMA NATNUMBER EOS IF LPAREN NAME LT NATNUMBER RPAREN WRITE LPAREN TIMES COMMA TIMES RPAREN NAME EOS EOS WRITE LPAREN TIMES COMMA TIMES RPAREN NAME TIMES NAME EOS EOS ENDDO EOS END EOS'

To give an idea of the resulting parse trees I include the complete output for C below. Parse times are

start=dt.now(); L.parse(A); print(dt.now()-start)
...
0:00:01.807732

start=dt.now(); L.parse(B); print(dt.now()-start)
...
0:00:17.732535

start=dt.now(); L.parse(C); print(dt.now()-start)
...
0:13:35.003853

EDIT START:
It seems that the size of the parse trees may be to blame.
if ambiguity='resolve' is used then the respective sizes (len(str(...))) of the results are
A => 2728 B => 2978 C => 3528
while ambiguity='explicit' results in
A = 2582677 B=16834357 C=252801325

So it may be that a more space efficient method to store ambiguous (intermediate) parsing results is needed.
EDIT END.

Parse tree for C:
Tree(executable_program, [Tree(program_unit, [Tree(main_program, [Tree(program_stmt, [Token(PROGRAM, 'PROGRAM'), Tree(program_name, [Token(NAME, 'NAME')]), Tree(eos, [Token(EOS, 'EOS'), Token(EOS, 'EOS')])]), Tree(execution_part, [Tree(executable_construct, [Tree(block_do_construct, [Tree(do_stmt, [Tree(nonlabel_do_stmt, [Token(DO, 'DO'), Tree(loop_control, [Tree(do_variable, [Tree(array_section, [Tree(data_ref, [Tree(part_ref, [Tree(part_name, [Token(NAME, 'NAME')])])])])]), Token(EQUAL, 'EQUAL'), Tree(scalar_numeric_expr, [Tree(expr, [Tree(level_5_expr, [Tree(equiv_operand, [Tree(or_operand, [Tree(and_operand, [Tree(level_4_expr, [Tree(level_3_expr, [Tree(level_2_expr, [Tree(add_operand, [Tree(mult_operand, [Tree(level_1_expr, [Tree(primary, [Tree(constant, [Token(NATNUMBER, 'NATNUMBER')])])])])])])])])])])])])])]), Token(COMMA, 'COMMA'), Tree(scalar_numeric_expr, [Tree(expr, [Tree(level_5_expr, [Tree(equiv_operand, [Tree(or_operand, [Tree(and_operand, [Tree(level_4_expr, [Tree(level_3_expr, [Tree(level_2_expr, [Tree(add_operand, [Tree(mult_operand, [Tree(level_1_expr, [Tree(primary, [Tree(constant, [Token(NATNUMBER, 'NATNUMBER')])])])])])])])])])])])])])])]), Tree(eos, [Token(EOS, 'EOS')])])]), Tree(do_block, [Tree(block, [Tree(execution_part_construct, [Tree(if_stmt, [Token(IF, 'IF'), Token(LPAREN, 'LPAREN'), Tree(scalar_logical_expr, [Tree(expr, [Tree(level_5_expr, [Tree(equiv_operand, [Tree(or_operand, [Tree(and_operand, [Tree(level_4_expr, [Tree(level_3_expr, [Tree(level_2_expr, [Tree(add_operand, [Tree(mult_operand, [Tree(level_1_expr, [Tree(primary, [Tree(variable, [Tree(array_section, [Tree(data_ref, [Tree(part_ref, [Tree(part_name, [Token(NAME, 'NAME')])])])])])])])])])])]), Tree(rel_op, [Token(LT, 'LT')]), Tree(level_3_expr, [Tree(level_2_expr, [Tree(add_operand, [Tree(mult_operand, [Tree(level_1_expr, [Tree(primary, [Tree(constant, [Token(NATNUMBER, 'NATNUMBER')])])])])])])])])])])])])])]), Token(RPAREN, 'RPAREN'), Tree(action_stmt, [Tree(write_stmt, [Token(WRITE, 'WRITE'), Token(LPAREN, 'LPAREN'), Tree(io_control_spec_list, [Tree(io_control_spec, [Tree(format, [Token(TIMES, 'TIMES')])]), Token(COMMA, 'COMMA'), Tree(io_control_spec, [Tree(format, [Token(TIMES, 'TIMES')])])]), Token(RPAREN, 'RPAREN'), Tree(output_item_list, [Tree(output_item, [Tree(expr, [Tree(level_5_expr, [Tree(equiv_operand, [Tree(or_operand, [Tree(and_operand, [Tree(level_4_expr, [Tree(level_3_expr, [Tree(level_2_expr, [Tree(add_operand, [Tree(mult_operand, [Tree(level_1_expr, [Tree(primary, [Tree(variable, [Token(NAME, 'NAME')])])])])])])])])])])])])])])]), Tree(eos, [Token(EOS, 'EOS'), Token(EOS, 'EOS')])])])])]), Tree(execution_part_construct, [Tree(write_stmt, [Token(WRITE, 'WRITE'), Token(LPAREN, 'LPAREN'), Tree(io_control_spec_list, [Tree(io_control_spec, [Tree(format, [Token(TIMES, 'TIMES')])]), Token(COMMA, 'COMMA'), Tree(io_control_spec, [Tree(io_unit, [Token(TIMES, 'TIMES')])])]), Token(RPAREN, 'RPAREN'), Tree(output_item_list, [Tree(output_item, [Tree(expr, [Tree(level_5_expr, [Tree(equiv_operand, [Tree(or_operand, [Tree(and_operand, [Tree(level_4_expr, [Tree(level_3_expr, [Tree(level_2_expr, [Tree(add_operand, [Tree(add_operand, [Tree(mult_operand, [Tree(level_1_expr, [Tree(primary, [Tree(constant, [Token(NAME, 'NAME')])])])])]), Tree(mult_op, [Token(TIMES, 'TIMES')]), Tree(mult_operand, [Tree(level_1_expr, [Tree(primary, [Tree(constant, [Token(NAME, 'NAME')])])])])])])])])])])])])])])]), Tree(eos, [Token(EOS, 'EOS'), Token(EOS, 'EOS')])])])])]), Tree(end_do, [Tree(end_do_stmt, [Token(ENDDO, 'ENDDO'), Tree(eos, [Token(EOS, 'EOS')])])])])])]), Tree(end_program_stmt, [Token(END, 'END'), Tree(eos, [Token(EOS, 'EOS')])])])])])

Bug in handling ambiguity?

When running this code:

grammar = """
expression: "c" | "d" | "c" "d"
unit: expression "a"
    | "a" expression
    | "b" unit
    | "b" expression
start: unit*

%import common.WS
%ignore WS
"""

l = Lark(grammar, parser='earley', ambiguity='explicit')
print(l.parse('b c d a a c').pretty())

It is expected to have an ambiguous parse, but there is no '_ambig' node.

At least these options are valid:

unit(
    b
    unit(
        expression(
            c
            d
        )
        a
    )
)
unit(
    a
    expression(
        c
    )
)

and this parse:

unit(
    b
    expression(
        c
    )
)
unit(
    expression(
        d
    )
    a
)
unit(
    a
    expression(
        c
    )
)

The only parse that comes back is the second one.
When one removes the "b" expression option, you get the first one.

Naming rule parts

I wonder if you thought of adding support to naming parts of the rules, such as:

ifstmt: "if" "(" expr: condition ")" "{" (stmt*): body "}"

(syntax could be changed of course)

This would be helpful because, as far as I know, the only way to refer to children of the tree is by tree.children[0] etc. - this feature would allow something akin to tree.condition or tree["condition"].

Packaging 0.5.0 test failed : AttributeError: 'module' object has no attribute 'test_parser'

During the refresh of package for openSUSE I got errors during the %check pass

[   22s] + mv _build.python2 build
[   22s] + echo python2
[   22s] + /usr/bin/python2 setup.py test
[   22s] running test
[   22s] running egg_info
[   22s] writing lark_parser.egg-info/PKG-INFO
[   22s] writing top-level names to lark_parser.egg-info/top_level.txt
[   22s] writing dependency_links to lark_parser.egg-info/dependency_links.txt
[   22s] package init file 'lark/grammars/__init__.py' not found (or not a regular file)
[   22s] reading manifest file 'lark_parser.egg-info/SOURCES.txt'
[   22s] reading manifest template 'MANIFEST.in'
[   22s] writing manifest file 'lark_parser.egg-info/SOURCES.txt'
[   22s] running build_ext
[   22s] Traceback (most recent call last):
[   22s]   File "setup.py", line 53, in <module>
[   22s]     "License :: OSI Approved :: MIT License",
[   22s]   File "/usr/lib/python2.7/site-packages/setuptools/__init__.py", line 129, in setup
[   22s]     return distutils.core.setup(**attrs)
[   22s]   File "/usr/lib64/python2.7/distutils/core.py", line 151, in setup
[   22s]     dist.run_commands()
[   22s]   File "/usr/lib64/python2.7/distutils/dist.py", line 953, in run_commands
[   22s]     self.run_command(cmd)
[   22s]   File "/usr/lib64/python2.7/distutils/dist.py", line 972, in run_command
[   22s]     cmd_obj.run()
[   22s]   File "/usr/lib/python2.7/site-packages/setuptools/command/test.py", line 226, in run
[   22s]     self.run_tests()
[   22s]   File "/usr/lib/python2.7/site-packages/setuptools/command/test.py", line 248, in run_tests
[   22s]     exit=False,
[   22s]   File "/usr/lib64/python2.7/unittest/main.py", line 94, in __init__
[   22s]     self.parseArgs(argv)
[   22s]   File "/usr/lib64/python2.7/unittest/main.py", line 113, in parseArgs
[   22s]     self._do_discovery(argv[2:])
[   22s]   File "/usr/lib64/python2.7/unittest/main.py", line 214, in _do_discovery
[   22s]     self.test = loader.discover(start_dir, pattern, top_level_dir)
[   22s]   File "/usr/lib64/python2.7/unittest/loader.py", line 206, in discover
[   22s]     tests = list(self._find_tests(start_dir, pattern))
[   22s]   File "/usr/lib64/python2.7/unittest/loader.py", line 287, in _find_tests
[   22s]     for test in self._find_tests(full_path, pattern):
[   22s]   File "/usr/lib64/python2.7/unittest/loader.py", line 268, in _find_tests
[   22s]     yield self.loadTestsFromModule(module)
[   22s]   File "/usr/lib/python2.7/site-packages/setuptools/command/test.py", line 52, in loadTestsFromModule
[   22s]     tests.append(self.loadTestsFromName(submodule))
[   22s]   File "/usr/lib64/python2.7/unittest/loader.py", line 100, in loadTestsFromName
[   22s]     parent, obj = obj, getattr(obj, part)
[   22s] AttributeError: 'module' object has no attribute 'test_parser'
[   22s] error: Bad exit status from /var/tmp/rpm-tmp.knFpkP (%check)

Full build log is available here
Is there anything I've missed ?

Grammar sends Earley implementation to an infinite loop

Earley implementation never terminates when running with the input "a", and the following grammar:

start: a
a: a | "a"

Of course the grammar is badly formed, but since we aspire to parse "any grammar", an error is a more appropriate behavior.

Question (possibly bug?)

Hello Erez,

I'm having trouble coming up with the correct rule when parsing the sudo grammar. Here's my grammar:

      ?sudo_item : (alias | user_spec)*

      alias : "User_Alias"    user_alias  (":" user_alias)*
              | "Runas_Alias" runas_alias (":" runas_alias)*
              | "Host_Alias"  host_alias  (":" host_alias)*
              | "Cmnd_Alias"  cmnd_alias  (":" cmnd_alias)*

      user_alias  : ALIAS_NAME "=" user_list
      host_alias  : ALIAS_NAME "=" host_list
      runas_alias : ALIAS_NAME "=" runas_list
      cmnd_alias  : ALIAS_NAME "=" cmnd_list

      user_spec      : user_list host_list "=" cmnd_spec_list (":" host_list "=" cmnd_spec_list)*
      cmnd_spec_list : cmnd_spec ("," cmnd_spec)*
      cmnd_spec      : runas_spec? tag_spec* command
      runas_spec     : "(" runas_list? (":" runas_list)? ")"
      command        : "!"* command_name | "!"* cmnd_alias
      command_name   : COMMAND
      file_name      : /[-_.a-z0-9A-Z\/]\/]+/

      host_list  : host      ("," host)*
      user_list  : user      ("," user)*
      runas_list : user      ("," user)*
      cmnd_list  : command   ("," command)*

      host : HOST_NAME

      user :    "!"*       USER_NAME
              | "!"* "%"   GROUP_NAME
              | "!"* "#"   UID
              | "!"* "%#"  GID
              | "!"* "+"   NETGROUP_NAME
              | "!"* "%:"  NONUNIX_GROUP_NAME
              | "!"* "%:#" NONUNIX_GID
              | "!"* user_alias

      tag_spec : (tag_nopwd
                  | tag_pwd
                  | tag_noexec
                  | tag_exec
                  | tag_setenv
                  | tag_nosetenv
                  | tag_log_output
                  | tag_nolog_output) ":"

      tag_pwd          : "PASSWD"
      tag_nopwd        : "NOPASSWD"

      tag_exec         : "EXEC"
      tag_noexec       : "NOEXEC"
      tag_setenv       : "SETENV"
      tag_nosetenv     : "NOSETENV"
      tag_log_output   : "LOG_OUTPUT"
      tag_nolog_output : "NOLOG_OUTPUT"

      UID                : /[0-9]+/
      GID                : /[0-9]+/
      NONUNIX_GID        : /[0-9]+/
      USER_NAME          : /[-_.a-z0-9A-Z]+/
      GROUP_NAME         : CNAME
      NETGROUP_NAME      : CNAME
      NONUNIX_GROUP_NAME : CNAME
      ALIAS_NAME         : CNAME
      HOST_NAME          : /[-_.a-z0-9A-Z\[\]*]+/
      COMMAND            : /[^,:\n]+/

      %import common.CNAME
      %import common.WS
      %ignore /[\\\\]/
      %ignore WS

A sudo rule can look like this:

DBA ALL = (oracle) ALL, !SU

or like this:

DBA ALL = (oracle) ALL, !SU : ALL = (postgres) ALL, !SU

The grammar handles the first case fine. It has trouble parsing the second variant. It boils down to the COMMAND token, which is anything following the runas bit ((postgres), (oracle), etc) and should include anything but [:,\n]. That regex doesn't seem to work.

Also, sudo lines may have the \ line continuation at the end. Is %ignore /[\\\\]/ the correct way to handle that situation? It does seem to operate correctly.

Last question, when i specify parser='lalr', I get this:

MY INPUT: DBA ALL = (oracle) ALL
ERROR:

lark.common.UnexpectedToken: Unexpected token Token(COMMAND, ' ALL = (oracle) ALL') at line 1, column 3.

Does that mean my grammar is not LALR compatible?

Ignoring comments start at the beginning of the line

Hi,

First, thank you for all the work!

Here's my grammar:

      sudo_item : (alias | user_spec)*

      alias : "User_Alias"    user_alias  (":" user_alias)*
              | "Runas_Alias" runas_alias (":" runas_alias)*
              | "Host_Alias"  host_alias  (":" host_alias)*
              | "Cmnd_Alias"  cmnd_alias  (":" cmnd_alias)*

      user_alias  : ALIAS_NAME "=" user_list
      host_alias  : ALIAS_NAME "=" host_list
      runas_alias : ALIAS_NAME "=" runas_list
      cmnd_alias  : ALIAS_NAME "=" cmnd_list

      user_spec      : user_list host_list "=" cmnd_spec_list (":" host_list "=" cmnd_spec_list)*
      cmnd_spec_list : cmnd_spec | cmnd_spec "," cmnd_spec_list
      cmnd_spec      : runas_spec? tag_spec* COMMAND
      runas_spec     : "(" runas_list? (":" runas_list)? ")"

      host_list  : HOST_NAME ("," HOST_NAME)*
      user_list  : user      ("," user)*
      runas_list : USER_NAME ("," USER_NAME)*
      cmnd_list  : COMMAND   ("," COMMAND)*

      user :    "!"*                     USER_NAME
              | "!"* PERCENT             GROUP_NAME
              | "!"* HASH                UID
              | "!"* PERCENT_HASH        GID
              | "!"* PLUS                NETGROUP_NAME
              | "!"* PERCENT_COLON       NONUNIX_GROUP_NAME
              | "!"* PERCENT_COLON_HASH  NONUNIX_GID
              | "!"* user_alias

      tag_spec : (tag_nopwd
                  | tag_pwd
                  | tag_noexec
                  | tag_exec
                  | tag_setenv
                  | tag_nosetenv
                  | tag_log_output
                  | tag_nolog_output) ":"

      tag_pwd          : "PASSWD"
      tag_nopwd        : "NOPASSWD"
      tag_exec         : "EXEC"
      tag_noexec       : "NOEXEC"
      tag_setenv       : "SETENV"
      tag_nosetenv     : "NOSETENV"
      tag_log_output   : "LOG_OUTPUT"
      tag_nolog_output : "NOLOG_OUTPUT"

      PERCENT            : "%"
      HASH               : "#"
      PERCENT_HASH       : "%#"
      PLUS               : "+"
      PERCENT_COLON      : "%:"
      PERCENT_COLON_HASH : "%:#"
      UID                : /[0-9]+/
      GID                : /[0-9]+/
      NONUNIX_GID        : /[0-9]+/
      USER_NAME          : /[-_.a-z0-9A-Z]+/
      GROUP_NAME         : CNAME
      NETGROUP_NAME      : CNAME
      NONUNIX_GROUP_NAME : CNAME

      %import common.CNAME
      %import common.WS
      %ignore "\\\\"
      %ignore WS
      %ignore /# comment/

Here's the input:

sudocfg = """
# comment
"""

The output is:

Traceback (most recent call last):
  File "./parse_sudoers.py", line 86, in <module>
    tree = sudo_parser.parse(sudocfg)
  File "/home/sgerasenko/proj/sudoers/lib/python2.7/site-packages/lark_parser-0.4.1-py2.7.egg/lark/lark.py", line 193, in parse
    return self.parser.parse(text)
  File "/home/sgerasenko/proj/sudoers/lib/python2.7/site-packages/lark_parser-0.4.1-py2.7.egg/lark/parser_frontends.py", line 144, in parse
    return self.parser.parse(text)
  File "/home/sgerasenko/proj/sudoers/lib/python2.7/site-packages/lark_parser-0.4.1-py2.7.egg/lark/parsers/xearley.py", line 128, in parse
    raise ParseError('Incomplete parse: Could not find a solution to input')
lark.common.ParseError: Incomplete parse: Could not find a solution to input

Ultimately my goal is to ignore everything from the beginning of the line to the end of the line though. I've tried /^#[^\\n]+/, but that didn't work either. I'm also not sure if the regex match is re.MULTILINE or not.

Support tokens amount in range

Hi,
Would a code that adds support for the following:

start -> some_expression{1,3}

Be merged?

(The meaning here is that we are willing to parse 1-3 consecutive occurrences of some_expression)

Nearley macros not supported

I had a go at converting my nearley grammar but I encountered the error:

File "/home/kaspar/.local/lib/python3.5/site-packages/lark/tools/nearley.py", line 116, in _nearley_to_lark
    assert False, directive
AssertionError: include

So I am guessing @include is not supported yet. I also noticed the TODO for macros. Is my assessment correct that includes and macros are not yet supported?

Assert fails on earley parser

Hi,
Should the assert in line 221 earley.py, in _compare_rules assert rule1.origin == rule2.origin ever fail?

Because it does :)

Issues with parser translated from Nearley

Hey I managed to re-work and translate my nearley grammar. It works but I noticed some issues (you can compare to the JS demo here).

First, I get a Token when I am expecting a string:

import electro_grammar
electro_grammar.parse('100uF 0805')
Out[11]: [{'type': 'capacitor'}, {'capacitance': 0.0001}, None, {'size': Token(__ANONSTR_103, '0805')}, None]

Second, it goes for an infinite loop when I add certain descriptions:

electro_grammar.parse('100uF 0805 5%')
...
electro_grammar.parse('100uF 0805 x7r')
...

When I Ctrl-C I noticed in both stack traces:

earley.py in add(self, items)
    109             if item.is_complete:
    110                 # XXX Potential bug: What happens if there's ambiguity in an empty rule?
--> 111                 if item.rule.expansion and item in self.completed:
    112                     old_tree = self.completed[item].tree
    113                     if old_tree.data != '_ambig':
Full stack trace
In [15]: electro_grammar.parse('100uF 0805 x7r')
^C---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-15-8b9cfb50244a> in <module>()
----> 1 electro_grammar.parse('100uF 0805 x7r')

~/projects/electro-grammar/electro-grammar/python/electro_grammar.py in parse(text)
    683 parser = Lark(grammar, start="n_main")
    684 def parse(text):
--> 685     return TranformNearley().transform(parser.parse(text))
    686 

~/projects/electro-grammar/electro-grammar/python/venv/src/lark-parser/lark/lark.py in parse(self, text)
    191 
    192     def parse(self, text):
--> 193         return self.parser.parse(text)
    194 
    195         # if self.profiler:

~/projects/electro-grammar/electro-grammar/python/venv/src/lark-parser/lark/parser_frontends.py in parse(self, text)
    142 
    143     def parse(self, text):
--> 144         return self.parser.parse(text)
    145 
    146 def get_frontend(parser, lexer):

~/projects/electro-grammar/electro-grammar/python/venv/src/lark-parser/lark/parsers/xearley.py in parse(self, stream, start_symbol)
    119 
    120 
--> 121         predict_and_complete(column)
    122 
    123         # Parse ended. Now build a parse tree

~/projects/electro-grammar/electro-grammar/python/venv/src/lark-parser/lark/parsers/xearley.py in predict_and_complete(column)
     75                         if new_item.similar(item):
     76                             raise ParseError('Infinite recursion detected! (rule %s)' % new_item.rule)
---> 77                     column.add(new_items)
     78 
     79         def scan(i, token, column):

~/projects/electro-grammar/electro-grammar/python/venv/src/lark-parser/lark/parsers/earley.py in add(self, items)
    109             if item.is_complete:
    110                 # XXX Potential bug: What happens if there's ambiguity in an empty rule?
--> 111                 if item.rule.expansion and item in self.completed:
    112                     old_tree = self.completed[item].tree
    113                     if old_tree.data != '_ambig':

~/projects/electro-grammar/electro-grammar/python/venv/src/lark-parser/lark/parsers/earley.py in __eq__(self, other)
     58 
     59     def __eq__(self, other):
---> 60         return self.similar(other) and (self.tree is other.tree or self.tree == other.tree)
     61 
     62     def __hash__(self):

~/projects/electro-grammar/electro-grammar/python/venv/src/lark-parser/lark/tree.py in __eq__(self, other)
     42     def __eq__(self, other):
     43         try:
---> 44             return self.data == other.data and self.children == other.children
     45         except AttributeError:
     46             return False

~/projects/electro-grammar/electro-grammar/python/venv/src/lark-parser/lark/tree.py in __eq__(self, other)
     42     def __eq__(self, other):
     43         try:
---> 44             return self.data == other.data and self.children == other.children
     45         except AttributeError:
     46             return False

~/projects/electro-grammar/electro-grammar/python/venv/src/lark-parser/lark/tree.py in __eq__(self, other)
     40             self.children[i:i+1] = kid.children
     41 
---> 42     def __eq__(self, other):
     43         try:
     44             return self.data == other.data and self.children == other.children

KeyboardInterrupt: 


I realize it is likely due to ambiguity in my grammar and I am working to reduce it (but not sure it can be totally unambiguous). Any hints on how best to resolve would be very much appreciated.

TODO:

  • See why Tokens appear from parsing string literals
  • Investigate if instantiating with parser='earley', lexer=None, earley__all_derivations=False is closer to Nearley's implementation
  • Investigate why this edit is needed for consistent results between Nearley and Lark.

correct (as far as I can tell) but large grammar triggers exception in version 0.4.1

At the end of the file parse_tree_builder.py in the class ParseTreeBuilder at line 150-151 you use:

if hasattr(callback, callback_name):
raise GrammarError("Rule expansion '%s' already exists in rule %s" % (' '.join(expansion), origin))

The grammar I use triggers this exception. Since I couldn't find any concrete reason why the callback name needs to be constructed according to this rule I changed these two lines to

from random import randint
while hasattr(callback, callback_name):
callback_name = 'autoalias_%s_%s_%d' % ( origin, ''.join(expansion), randint(0, 999))

as a temporary solution which avoided the exception and got my test cases to run.
Are there any problems with this change (besides a certain ugliness)?

Official support (and tests) for Python grammar?

Hi,

I'm currently writing a Python transpiler and looking for a stable Python library that can parse Python 3 code, like Lark.

I noticed that you have the grammars under the examples directory. My questions are:

  1. Are these (particularly the Python 3 one) ever going to be updated? Can I rely on this being current?
  2. I don't see any tests that parse Python code and at least validate if a tree can be generated. Are you planning to add any?

If the answer to question one is "yes," I can probably help with test coverage for parsing Python code.

But, if this is just an example and not something you plan to maintain, I will probably have to look at other libraries (sadly).

Transformer isn't transforming output

Hi,

I have a modified version of your Python 3 example (Python 3 grammar, python_parser starter with an additional transformer) here.

It seems like the transformer, although it correctly visits the nodes and prints something out, doesn't actually replace the value in the tree with the value I provide. (Or maybe it does, and the way I'm printing it out is causing problems.)

Can you please take a look and let me know where I'm going wrong?

Parser code:

        try:
            full_path = os.path.join(path, f)
            try:
                xrange
            except NameError:
                tree = python_parser3.parse(_read(full_path) + '\n')
                raw_tree = python_parser3.parse(_read(full_path) + '\n')
            else:
                tree = python_parser2.parse(_read(full_path) + '\n')
                raw_tree = python_parser2.parse(_read(full_path) + '\n')
            
            HaxeTransformer().transform(tree)
            convert_and_print(tree, full_path)    

Transformer code:

from lark import Transformer

class HaxeTransformer(Transformer):

    _LAST_NODE = None
    MARKER = "!!!"

    def import_stmt(self, node):
        HaxeTransformer._LAST_NODE = node

        output = "from "
        
        node = node[0].children # import_stmt => import_from
        package_name = node[0].children
        for child_node in package_name:
            output = "{}{}.".format(output, child_node.value)
        output = output[:-1] # trim last dot

        class_name = node[1].children[0].children[0].value
        output = "{} import {}".format(output, class_name)

        print("{} => {}".format(output, node))
        return output

    # def file_input(self, node):
    #     return HaxeTransformer.MARKER

    # def compound_stmt(self, node):
    #     return HaxeTransformer.MARKER

    # def funcdef(self, node):
    #     pass#return HaxeTransformer.MARKER

    # def funccall(self, node):
    #     return HaxeTransformer.MARKER

Repeating a formula

I'm building a dice parser. The lark ebnf looks something like this:

?expr : smath

?smath : smath "+" smath        -> add
       | smath "-" smath        -> sub
       | pmath

?pmath : pmath "*" pmath        -> mul
       | pmath "/" pmath        -> div
       | sum
       | number

?sum  : filter                  -> sum

?filter : filter "M" number     -> max
        | filter "m" number     -> min
        | filter "<" number     -> lt
        | filter "<=" number    -> le
        | filter ">" number     -> gt
        | filter ">=" number    -> ge
        | filter "!=" number    -> ne
        | filter "==" number    -> eq
        | dice

?dice : number "d" percentage   -> dice
      | "d" percentage          -> dice


?percentage : "%"               -> perc
            | number

number : NUMBER

%import common.NUMBER
%import common.WS
%ignore WS

This will parse things like:

 4d6M3  -- roll a 6 sided die 4 times and pick the top 3
 6d6<6>1 -- roll a 6 sided die 6 times and keep all the numbers between 2 and 5

What I'd like to do is:

6x4d6M3 -- roll the 6 sided die 4 times and pick the top 3 ... repeat 6 times

or

4d6M3x6 -- roll the 6 sided die 4 times and pick the top 3 ... repeat 6 times

Is there a mechanism to control something like that? More or less, it seems like I'd need to hold off on actually reducing the formula immediately so I can determine how many times to run the formula and then calculate the formula.

Missing tag & release in github versus pypi

As enhancement, It would be super cool to have tagged release here in conjunction with tgz release on pypi.
This can help to see changes happening between too release (There's no official changelog).

Ambiguity bug?

When running this example:

grammar = """
start: ab_ b_ a_ | a_ bb_ a_
a_: "a"
b_: "b"
ab_: "ab"
bb_: "bb"
"""

l = Lark(grammar, parser='earley', ambiguity='explicit', lexer='standard')
res = l.parse('abba')
print(res.pretty())

The result is:

start
  ab_
  b_
  a_

Where one would expect two trees with '_ambig'

Advice on implementing precedence with parentheses?

I'm working on a grammar for a logical expression mini-language, and I have hit a snag: I can't figure out how to implement explicit precedence using parentheses in the grammar. I know about rule priorities (and I'm making use of them), but I want to be able to use parentheses to override precedence.

How would I go about doing something like this in the grammar? Or is it something I would need to implement in the tree transformer?

regex rule failing with whitespace

I have a regex rule in my grammar that looks like this (modified from the python sample):

comment: /\/\/[^\n]*/

Which matches fine when a comment does NOT have a trailing space, like this:

//I am a comment

However, when there IS a trailing space, like this:

//I am a comment

It fails, resulting in a ''Incomplete parse: Could not find a solution to input'' error.

The regex works fine in python, like so:

>>> teststring = "//I am a comment "
>>> c = re.compile('\/\/[^\n]*')
>>> c.match(teststring)
<_sre.SRE_Match object; span=(0, 17), match='//I am a comment '>

which does match the trailing whitespace

Python example grammars are broken

parser=Lark(open("python2.g.txt").read())

fails with

Traceback (most recent call last):
File "D:/Users/502744471/Desktop/Projects/GEPyPlot/code/GMC_parser.py", line 53, in
parser=Lark(open("python2.g.txt").read())
File "C:\ProgramData\Anaconda3\lib\site-packages\lark\lark.py", line 151, in init
self.grammar = load_grammar(grammar, source)
File "C:\ProgramData\Anaconda3\lib\site-packages\lark\load_grammar.py", line 612, in load_grammar
resolve_token_references(token_defs)
File "C:\ProgramData\Anaconda3\lib\site-packages\lark\load_grammar.py", line 513, in resolve_token_references
raise GrammarError("Rules aren't allowed inside tokens (%s in %s)" % (item, name))
lark.common.GrammarError: Rules aren't allowed inside tokens (s in LONG_STRING)

And parser=Lark(open("python3.g.txt").read())

results in

Traceback (most recent call last):
File "D:/Users/502744471/Desktop/Projects/GEPyPlot/code/GMC_parser.py", line 53, in
parser=Lark(open("python3.g.txt").read())
File "C:\ProgramData\Anaconda3\lib\site-packages\lark\lark.py", line 159, in init
self.parser = self._build_parser()
File "C:\ProgramData\Anaconda3\lib\site-packages\lark\lark.py", line 180, in _build_parser
return self.parser_class(self.lexer_conf, parser_conf, options=self.options)
File "C:\ProgramData\Anaconda3\lib\site-packages\lark\parser_frontends.py", line 122, in init
rules = [(n, list(self._prepare_expansion(x)), a, o) for n,x,a,o in parser_conf.rules]
File "C:\ProgramData\Anaconda3\lib\site-packages\lark\parser_frontends.py", line 122, in
rules = [(n, list(self._prepare_expansion(x)), a, o) for n,x,a,o in parser_conf.rules]
File "C:\ProgramData\Anaconda3\lib\site-packages\lark\parser_frontends.py", line 137, in _prepare_expansion
width = sre_parse.parse(regexp).getwidth()
File "C:\ProgramData\Anaconda3\lib\sre_parse.py", line 855, in parse
p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
File "C:\ProgramData\Anaconda3\lib\sre_parse.py", line 416, in _parse_sub
not nested and not items))
File "C:\ProgramData\Anaconda3\lib\sre_parse.py", line 765, in _parse
p = _parse_sub(source, state, sub_verbose, nested + 1)
File "C:\ProgramData\Anaconda3\lib\sre_parse.py", line 416, in _parse_sub
not nested and not items))
File "C:\ProgramData\Anaconda3\lib\sre_parse.py", line 768, in _parse
source.tell() - start)
sre_constants.error: missing ), unterminated subpattern at position 12

python2.g.txt
python3.g.txt

In python3.g.txt I've tried to simplify the test case down by changing the definition of LONG_STRING.

Any ideas? Looking at it casually I think that the regex parser doesn't believe that the STRING token regex is complete and is trying to parse the LONG_STRING token regex as another part of it.

Comments in grammar?

Hi there, a useful feature is to have comments in grammars.
I couldn't find any support for this perusing the code.
Am I missing something or is the feature not there?
Any plans to add?

Thanks!
Nice framework.

Create git tags for releases

Is it possible to push tags for the releases you make in PyPI?

This helps a lot to keep track and see what's actually in a package. It's also pretty useful for mirroring repositories.

Ambiguity resolution not working as expected

I'm getting some results that I didn't expect, so maybe my expectations are wrong.

template_parser = lark.Lark("""
  expr : (varref | literal)+
  varref.5 : "${" VARNAME "}"
  literal.1 : LITERAL
  LITERAL : /.+/
  VARNAME : /[a-z]+/
""", start='expr', debug=True, parser='earley')

print(template_parser.parse('${foo}').pretty())

The result is

expr
  literal	$
  literal	{f
  literal	foo
  literal	oo}

This result surprises me in three ways:

  1. I expected to get this, since the varref rule has higher priority than literal:
  expr
    varref	foo
  1. If it's going to go with a literal result, why doesn't it choose this simpler tree?
  expr
    literal	${foo}
  1. How are parts of the input showing up in multiple terminals? For example, f shows up twice, in {f and foo.

Lark failing non-deterministically

Running the following lambda calculus parser on Python 3.6 exhibits non-determistic behaviour. That is, the exact same code run multiple times gives different results.

from lark import Lark

parser = Lark("""
    ?term: "(" term ")"
         | /true/
         | /false/
         | /if/ term /then/ term /else/ term
         | var
         | /ฮป/ var /:/ type /./ term
         | term term

    ?type: "(" type ")"
         | /Bool/
         | type /โ†’/ type

    ?var: /[a-z]+'*/

    %import common.WS
    %ignore WS
""", lexer='contextual', start='term', parser='lalr')

print(parser.parse("(ฮปa:(Boolโ†’Bool). a) (ฮปb:Bool. b)").pretty())

It seems that the cause of the non-determinism is PYTHONHASHSEED. On my machine PYTHONHASHSEED=3 produces the following result consistently

$ PYTHONHASHSEED=3 python lark_error.py 
term
  ฮป
  b
  :
  Bool
  .
  b

Whereas PYTHONHASHSEED=1 raises an exception every time.

$ PYTHONHASHSEED=1 python lark_error.py 
Traceback (most recent call last):
  File "/Users/richard/Documents/lark_test/venv/lib/python3.6/site-packages/lark/parsers/lalr_parser.py", line 31, in get_action
    return states_idx[state][key]
KeyError: 'ANONRE_7'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "lark_error.py", line 22, in <module>
    print(parser.parse("(ฮปa:(Boolโ†’Bool). a)  (ฮปb:Bool. b)").pretty())
  File "/Users/richard/Documents/lark_test/venv/lib/python3.6/site-packages/lark/lark.py", line 184, in parse
    return self.parser.parse(text)
  File "/Users/richard/Documents/lark_test/venv/lib/python3.6/site-packages/lark/parser_frontends.py", line 48, in parse
    return self.parser.parse(tokens, self.lexer.set_parser_state)
  File "/Users/richard/Documents/lark_test/venv/lib/python3.6/site-packages/lark/parsers/lalr_parser.py", line 60, in parse
    action, arg = get_action(token.type)
  File "/Users/richard/Documents/lark_test/venv/lib/python3.6/site-packages/lark/parsers/lalr_parser.py", line 35, in get_action
    raise UnexpectedToken(token, expected, seq, i)
lark.common.UnexpectedToken: Unexpected token Token(ANONRE_7, 'โ†’') at line 1, column 9.
Expected: dict_keys(['__RPAR', 'ANONRE_9'])
Context: <no context>

If I don't specify PYTHONHASHSEED at all then both results happen about 50% of the time.

(As an aside, even when it "works" it's not producing the parse tree I'm expecting. The whole "ฮปa" branch of the parse seems to be missing. That might be user error though.)

Tree-less LALR hangs when final `start` returns False

If we parse an empty string using tree-less LALR

p = lark.Lark(GRAMMAR, start='file', parser='lalr', transformer=transformer)
p.parse(r'')

then the transformer will hang at def file(self, x): when this function returns False. This happens e.g. in case of an empty list.

Position of the parse trees

Hey,
unless I missed something, the only way to give the location of nodes of the AST is by checking the line/column pair of the final tokens we parse. Consider for instance, let the following rule:

var_decl : "var" NAME ":" type_ident

I could get the position of the NAME token:

class MyTransformer(Transformer):
  def var_decl(self, items):
    location = {'line': items[0].line, 'column': items[1].column}
    # ...

The problem with this solution is that it's difficult to get the actual start location of the token. In my small example, that would be the location of the "v" letter. I would have to let Lark give me the terminals (prefixing my rule with !), which would add noise to my parse tree.
Furthermore, it is often desirable to know the length of a node, that is not only where it starts, but also where it ends, so as to produce better compilation error messages.

Would it be possible to automatically add that information to parse trees? That way, one could get the location information as an additional parameter of the transformers:

class Transformer(object):
  def transform(self, tree):
    items = [self.transform(c) if isinstance(c, Tree) else c for c in tree.children]
    try:
      f = self._get_func(tree.data)
    except AttributeError:
      return self.__default__(tree.data, tree.location, items)
    else:
      return f(items, tree.location)
  # ...

# ...

class MyTransformer(Transformer):
  def var_decl(self, items, location):
    # ...

Unfortunately, my suggestion would be a breaking change, because it would require transformer handlers to take an additional parameter.

Is there a better/easier way to implement what I suggest?
And if you think the whole thing isn't worth the effort, are there downsides of letting Lark add all terminals in parse trees?

Suggestion for keeping some terminals

I'm interested in keeping some - but not all - of the literal terminals in my parse tree.

I notice that there is a keep_all_tokens option which is not fully implemented. There is also the !term syntax which is what I am currently using.

I think it would be a bit nicer if there was a way to specify exactly which tokens are kept as part of the grammar. My first instinct would be to use different quotes. Maybe "literal" could mean discard and 'literal' mean keep. But really any lightweight per-literal syntax in the grammar would do the job.

In my grammar I want to keep all tokens except brackets. My suggestion would let me convert this:

    !?type: bracketed_type
          | "Bool"
          | type "โ†’" type

    ?bracketed_type: "(" type ")"

to this:

    ?type: "(" type ")"
         | 'Bool'
         | type 'โ†’' type

Which seems like a small improvement. Totally understand if it's not worth the effort though.

Parsing quasi-xml file with punctuation and special chars

I have the data that looks like this:
<LABEL="some value"> <OTHER="something else">

Each line contains 0 or more such XML tags. The tag labels are all caps but the value is basically any string, including punctuation.

I came up with the following grammar but couldn't get it working on string where values contain characters, digits, punctuation and other chars.

ltr_parser = Lark(r"""
?line: facet* 
?facet: "<" label "=\"" value+ "\">"
label: /[A-Z]+/
value: /.+?/

%import common.NUMBER
%import common.WORD
%import common.ESCAPED_STRING   -> STRING
%ignore " "
""", start='line')

text = '<TAG="value value"> <BUZ="jazz 100%">'

tree = ltr_parser.parse(text)
print(tree)

Is there a way to reliably match the "value" that can contain different characters, including punctuation and special symbols?

Thanks

Grammar works as expected with Earley, parser fails with LALR

Below you find an adapted version of the calc example. I have explicitly added white space tokens, rather than ignoring white space in general.

It works with Earley, but not with LALR. As I understand, Lark should complain if the grammar is not parsable by LALR.

Note that I have defined a rule (token?) _WS. It corresponds to WS_INLINE, but the added underscore keeps it from appearing in the tree. _WS is capitalized, and should hence a terminal. On the other hand, it is not defined by either a string nor a regexp. Is it intentional that such a construct is allowed?

I tried Lark version 0.3.6.

from lark import Lark, InlineTransformer

calc_grammar = """
    ?sum: product
        | sum [_WS] "+" [_WS] product   -> add
        | sum [_WS] "-" [_WS] product   -> sub

    ?product: atom
        | product [_WS] "*" [_WS]  atom  -> mul
        | product [_WS] "/" [_WS] atom  -> div

    ?atom: NUMBER           -> number
         | "-" atom         -> neg
         | "(" sum ")"

    %import common.NUMBER
    %import common.WS_INLINE

    _WS: WS_INLINE
"""


class CalculateTree(InlineTransformer):
    from operator import add, sub, mul, truediv as div, neg
    number = float


calc_parser = Lark(calc_grammar, parser='earley', start='sum')
# calc_parser = Lark(calc_grammar, parser='lalr', start='sum')


def calc(expr):
    tree = calc_parser.parse(expr)
    return CalculateTree().transform(tree)


if __name__ == '__main__':
    print(calc("1 + 3 * 4"))

Parser failes with Context: <no context>

Hello,

I want to parse out the "interesting" parts of a sentence and discard the rest. For instance, I want to parse out only strings of the form word "parsley" followed by an integer, e.g. "parsley 1", "parsley 212"...What is the correct way to do it?

Here's my tiny grammar:

rules = u"""
       sentence: (parsley|rubbish)+
       parsley: PARSLEY INT
       rubbish: RUBBISH
       
       PARSLEY: "parsley"
       RUBBISH: /[^\s]+/

       %import common.INT
       %import common.WS
       %ignore WS
       """

If I parse with Earley, parser doesn't parse anything at all.

>>> parser = Lark(rules, start="sentence")
>>> parser.parse(u"I saw parsley 12")
Tree(sentence, [Tree(rubbish, [u'k']), Tree(rubbish, [u'dk']), Tree(rubbish, [Token(RUBBISH, u'kd')]), Tree(rubbish, [Token(RUBBISH, u'djdj')]), Tree(rubbish, [Token(RUBBISH, u'parsley')]), Tree(rubbish, [Token(RUBBISH, u'12')])])

If I parse with LALR, it works if

>>> parser.parse(u"I saw a parsley 1")
Tree(sentence, [Tree(rubbish, [Token(RUBBISH, u'I')]), Tree(rubbish, [Token(RUBBISH, u'saw')]), Tree(rubbish, [Token(RUBBISH, u'a')]), Tree(parsley, [Token(PARSLEY, u'parsley'), Token(INT, u'1')])])

But if there are "parsley" rule parts it fails:

>>> parser.parse(u"I saw a parsley 1, maybe another parsley without a number")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/lark/lark.py", line 193, in parse
    return self.parser.parse(text)
  File "/usr/local/lib/python2.7/dist-packages/lark/parser_frontends.py", line 31, in parse
    return self.parser.parse(tokens)
  File "/usr/local/lib/python2.7/dist-packages/lark/parsers/lalr_parser.py", line 61, in parse
    action, arg = get_action(token.type)
  File "/usr/local/lib/python2.7/dist-packages/lark/parsers/lalr_parser.py", line 36, in get_action
    raise UnexpectedToken(token, expected, seq, i)
lark.common.UnexpectedToken: Unexpected token Token(RUBBISH, u'without') at line 1, column 41.
Expected: [u'INT']
Context: <no context>
>>> parser.parse(u"I saw a parsley 1, maybe another 12 without")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/lark/lark.py", line 193, in parse
    return self.parser.parse(text)
  File "/usr/local/lib/python2.7/dist-packages/lark/parser_frontends.py", line 31, in parse
    return self.parser.parse(tokens)
  File "/usr/local/lib/python2.7/dist-packages/lark/parsers/lalr_parser.py", line 61, in parse
    action, arg = get_action(token.type)
  File "/usr/local/lib/python2.7/dist-packages/lark/parsers/lalr_parser.py", line 36, in get_action
    raise UnexpectedToken(token, expected, seq, i)
lark.common.UnexpectedToken: Unexpected token Token(INT, u'12') at line 1, column 33.
Expected: [u'PARSLEY', u'RUBBISH', '$end']
Context: <no context>

What is the correct way to parse out strings of interest from a bigger string? Thanx in advance!

Rule prioritization when running earley

Hi,
So I've come to a situation where the grammar rules needs to be prioritized, some are more important than others.

What is the best way to achieve this?

I imagine something like confidence score added to the different parse trees, and an API either in the transformer or the rules themselves that determines a production rule confidence.

If we want something simpler, a simple prioritization of rules will also satisfy my needs.

Also - could it be that the order of rules affects their priority, by accident?

GrammarError: Collision when using `lalr`

I have a language using the following grammar:

    action          : STRING ACTION_OPERATOR (ESCAPED_STRING | STRING)
    attr            : (action | MACRO | conditional)
    attrs           : (attr ";")*  attr ";"?  // Colon is only used as separator, and thus optional for final attr
    operator        : OPERATOR
    conditional     : STRING "?" STRING ":"
    
    expr            : MACRO OPERATOR attrs
    line            : expr COMMENT?
    file            : line+
    
    comment         : COMMENT

    MACRO           : STRING
    COMMENT         : /#.*$/m
    
    STRING          : /[a-zA-Z0-9_.-]+/
    
    OPERATOR        : ":" | "+:"
    ACTION_OPERATOR : "===" | "==" | "+=" | "-=" | "="
    
    %import common.WS
    %import common.NEWLINE
    %import common.ESCAPED_STRING
    %ignore WS
    %ignore COMMENT
    %ignore NEWLINE

Parsing my files works fine with earley, however, it is quite slow: 18 seconds versus 1.5 seconds for my implementation in pyparsing, although the latter may still be incomplete. Parsing with lalr gives

GrammarError: Collision in MACRO: [('reduce', <attrs : attr __SEMICOLON>), ('reduce', <__anon_star_0 : attr __SEMICOLON>)]

Could you explain what the reason is for this?


Possibly related issue #10

Limits of tree.InlineTransformer

I was toying around with lark, when I realized that certain things will break InlineTransformer and by extension inline_args.

Let's assume a simple example like this:

import lark

grammar = lark.Lark('''
    hex_uint: HEX_UINT

    HEX_UINT: "0x" ("a".."f" | "A".."F" | "0".."9")+
''', start='hex_uint')

class Shaper(lark.InlineTransformer):
    def hex_uint(self, token):
        return int(token[2:], 16)

tree = grammar.parse('0x10Ff')
print(tree)
print(Shaper().transform(tree))

This will work fine. But if we replace Shaper.hex_uint with a class like so:

class Integer(object):
    def __call__(self, token):
        return int(token[2:], 16)

class Shaper(lark.InlineTransformer):
    hex_uint = Integer()

it will break, because utils.inline_args assumes the callable to always be a function.

ParseError on extremely simple grammar

I'm building a grammar to process a kind of assignment statement. I started from the beginning and built one that just finds the left and right sides, but this never completes parsing.

degen_expex = Lark("""
start: lhs "=" rhs

lhs: ESCAPED_STRING

rhs: ESCAPED_STRING

%import common.ESCAPED_STRING
%import common.WS
%ignore WS
""")

deparsed = degen_expex.parse('foo=bar')

My actual grammar is more complex than this, of course. It doesn't complete parsing either, but I feel like it would be pointless to debug that if I can't get the most basic thing working.

I'm in Python 3.6 on Windows with Lark version 0.3.7

Parsing ambiguous data results in unhelpful exception.

Consider the following snippet:

from lark import Lark

parser = Lark("""
?inner: "a"

outer: "a" "b" -> one
    | inner "b" -> two
    
start: outer

%import common.WS
%ignore WS
""")

parser.parse("a b")

This grammar definition makes no sense, but never mind that. The output produced is:

Traceback (most recent call last):
  File "test2.py", line 15, in <module>
    parser.parse("a b")
  File "*/site-packages/lark/lark.py", line 190, in parse
    return self.parser.parse(text)
  File "*/site-packages/lark/parser_frontends.py", line 137, in parse
    return self.parser.parse(text)
  File "*/site-packages/lark/parsers/xearley.py", line 131, in parse
    ResolveAmbig().visit(tree) 
  File "*/site-packages/lark/tree.py", line 132, in visit
    getattr(self, subtree.data, self.__default__)(subtree)
  File "*/site-packages/lark/parsers/earley.py", line 282, in _ambig
    _resolve_ambig(tree)
  File "*/site-packages/lark/parsers/earley.py", line 273, in _resolve_ambig
    best = min(tree.children, key=cmp_to_key(_compare_drv))
  File "*/site-packages/lark/parsers/earley.py", line 263, in _compare_drv
    c = _compare_drv(t1, t2)
  File "*/site-packages/lark/parsers/earley.py", line 263, in _compare_drv
    c = _compare_drv(t1, t2)
  File "*/site-packages/lark/parsers/earley.py", line 241, in _compare_drv
    return -compare(tree1, tree2)
  File "*/site-packages/lark/utils.py", line 86, in compare
    elif a > b:
TypeError: '>' not supported between instances of 'Derivation' and 'Token'

I would expect some kind of parser error that notifies me of an unresolvable ambiguity. (I found this because my parser would work in version 0.3.1 but not in 0.3.6, so somewhere along the line the behaviour of ambiguities changed. I've been unable to construct a small example that shows this behaviour though.)

How to write the regex for a single-quoted string?

The regex for a single quoted string is

SINGLE_QUOTED_STRING  : /'[^']*'/

However, using this grammar results in

s = "\\'[^\\']*\\'"

    def _fix_escaping(s):
        s = s.replace('\\"', '"').replace("'", "\\'")
        w = ''
        i = iter(s)
        for n in i:
            w += n
            if n == '\\':
                n2 = next(i)
                if n2 == '\\':
                    w += '\\\\'
                elif n2 not in 'unftr':
                    w += '\\'
                w += n2
    
        to_eval = "u'''%s'''" % w
        try:
            s = literal_eval(to_eval)
        except SyntaxError as e:
>           raise ValueError(s, e)
E           ValueError: ("\\'[^\\']*\\'", SyntaxError('EOL while scanning string literal', ('<unknown>', 1, 20, "u'''\\\\'[^\\\\']*\\\\''''"))) occured when parsing ...

What kind of escaping is needed?

Grammar is case sensitive?

I'm currently experimenting with lark, and am attempting to run the following code:

from lark import Lark

grammar = '''
    Stmt: "foo"
'''

parser = Lark(grammar, start='Stmt')

This results in the following exception:

Traceback (most recent call last):
  File "C:\Python35\lib\site-packages\lark\parsers\lalr_parser.py", line 31, in get_action
    return states_idx[state][key]
KeyError: 'RULE'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tester.py", line 81, in <module>
    parser = construct_parser(grammar, root)
  File "tester.py", line 77, in construct_parser
    return Lark(lark_grammar, start=root)
  File "C:\Python35\lib\site-packages\lark\lark.py", line 143, in __init__
    self.grammar = load_grammar(grammar, source)
  File "C:\Python35\lib\site-packages\lark\load_grammar.py", line 537, in load_grammar
    tree = self.simplify_tree.transform( self.parser.parse(grammar_text+'\n') )
  File "C:\Python35\lib\site-packages\lark\parser_frontends.py", line 31, in parse
    return self.parser.parse(tokens)
  File "C:\Python35\lib\site-packages\lark\parsers\lalr_parser.py", line 60, in parse
    action, arg = get_action(token.type)
  File "C:\Python35\lib\site-packages\lark\parsers\lalr_parser.py", line 35, in get_action
    raise UnexpectedToken(token, expected, seq, i)
lark.common.UnexpectedToken: Unexpected token Token(RULE, 'tmt') at line 2, column 6.
Expected: dict_keys(['_COLON'])
Context: <no context>

However, when I convert the phrase Stmt to stmt, the grammar starts working.

Is this behavior intentional? I didn't see anything in the documentation about this (unless I missed it?) and as far as I'm aware, both grammars are valid EBNF grammars -- the casing of the non-terminals and terminals should be irrelevant (at least, according to the EBNF specification on wikipedia).

Lark can be made to submit an invalid lexer regex in e9603b5

When I try to load the attached grammar
gplus3.lark.txt with LALR parsing and contextual lexing I get the stack trace here .

I tracked this down to _create_unless(tokens) in lexer.py creating duplicate entries in the returned tokens, which turns into a regex with named groups with duplicate names and patterns. Turning delayed_strs into a set instead of a list seems to solve the issue and pass all tests it was passing without the change, at the cost of increased parser creation time. I don't know much about parsers or their internals, though, and less about how you'd like Lark structured.

Am I somehow abusing Lark, here, and should change my grammar or invocation? Is this actually a fix, or just a bandaid?

Earley fails when parsing comments

I'm attempting to parse a grammar that uses '!' or '//' for commenting everything starting with the token to the end of the line. Everything is fine in LALR(1) but Earley throws a ParseError.

The code in question, with the grammar;

from __future__ import print_function
from lark import Lark

grammar = r"""
COMMENT: /(!|(\/\/))[^\n]*/
%ignore COMMENT

%import common.WS
// _WS: WS

%import common.INT
int_stmnt: "INT"i WS+ INT WS*
"""

parser = Lark(grammar, start="int_stmnt", parser="earley")

tree = parser.parse("int 1 ! This is a comment\n")

print(tree.pretty())

For emphasis, the regular expression I'm using for comments; COMMENT: /(!|(\/\/))[^\n]*/

The traceback from this script when run as is;

Traceback (most recent call last):
  File "lark_earley_bug.py", line 18, in <module>
    tree = parser.parse("int 1 ! This is a comment\n")
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/lark/lark.py", line 193, in parse
    return self.parser.parse(text)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/lark/parser_frontends.py", line 144, in parse
    return self.parser.parse(text)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/lark/parsers/xearley.py", line 128, in parse
    raise ParseError('Incomplete parse: Could not find a solution to input')
lark.common.ParseError: Incomplete parse: Could not find a solution to input

The error free output of this script when using the LALR parser;

int_stmnt
   
  1

You'll find that removing the comment will also get the error free output from Earley.

infinite loop

Hello!

I have (unfortunately) a complicated syntax to parse. I will (no longer) include the entire file at the end. The original syntax had the line

type_expression:"(" radix_type ")"|adjective_cluster type_expression|radix_type

which results in a "recursion error":

    def predict_and_complete(column):
        while True:
            to_predict = {x.expect for x in column.to_predict.get_news()
                          if x.ptr}  # if not part of an already predicted batch
            to_reduce = column.to_reduce.get_news()
            if not (to_predict or to_reduce):
                break

            for nonterm in to_predict:
                column.add( predict(nonterm, column) )
            for item in to_reduce:
                new_items = list(complete(item))
                for new_item in new_items:
                    if new_item.similar(item):
>                       raise ParseError('Infinite recursion detected! (rule %s)' % new_item.rule)
E                       lark.common.ParseError: Infinite recursion detected! (rule <type_expression : adjective_cluster type_expression>)

/usr/local/lib/python3.6/site-packages/lark/parsers/xearley.py:76: ParseError

In an attempt to solve this error, I changed the line to

type_expression:(adjective_cluster)* ("(" radix_type ")"|radix_type)

It successfully suppresses the "recursion error", but now the script runs indefinitely, which may be related to something else entirely. My test code says

with open('test_easy.miz') as file:
	string = file.read()
ar = ArticleReader()
parseTree = ar.parse(string)
print(parseTree)

and the file I'm trying to parse is

environ

begin

registration
  cluster reflexive -> complete for 1-element RelStr;
  coherence;
end;

definition
  let T be RelStr;
  mode type of T is Element of T;
end;
...

and the parser is

import re

from lark import Lark


class ArticleReader:


	def __init__(self):
		self._init_parser()

	def parse(self, string):
		""" Takes in a string (.miz file content) and parses it """
		# Remove any comments
		commentless_string = re.sub(r'\s*::.*', '', string, re.MULTILINE)

		# Parse
		parseTree = self._parser.parse(commentless_string)
		return parseTree.pretty()

	def _init_parser(self):
		self._parser = Lark(r"""
			upperalphas: "A"|"B"|"C"|"D"|"E"|"F"|"G"|"H"|"I"|"J"|"K"|"L"|"M"|"N"|"O"|"P"|"Q"|"R"|"S"|"T"|"U"|"V"|"W"|"X"|"Y"|"Z"
			loweralphas: "a"|"b"|"c"|"d"|"e"|"f"|"g"|"h"|"i"|"j"|"k"|"l"|"m"|"n"|"o"|"p"|"q"|"r"|"s"|"t"|"u"|"v"|"w"|"x"|"y"|"z"
			alphas: upperalphas | loweralphas
			nums: "0"|"1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9"
			upperalphanums: upperalphas | nums
			alphanums: alphas | nums
			non_zero_digit: "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
			numeral: non_zero_digit ("0" | non_zero_digit)*
			identifier: (alphanums)+
			file_name: upperalphas /("A"|"B"|"C"|"D"|"E"|"F"|"G"|"H"|"I"|"J"|"K"|"L"|"M"|"N"|"O"|"P"|"Q"|"R"|"S"|"T"|"U"|"V"|"W"|"X"|"Y"|"Z"|"0"|"1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9" | "_"){4,7}/
			symbol: /(?:(?!::)[^\s])+/

			private_definition_parameter:"$1"|"$2"|"$3"|"$4"|"$5"|"$6"|"$7"|"$8"|"$9"|"$10"
			postqualified_variable:identifier
			postqualifying_segment:postqualified_variable ("," postqualified_variable)* [("is"|"are") type_expression]
			postqualification:"where" postqualifying_segment ("," postqualifying_segment)*
			term_expression_list:term_expression ("," term_expression)*
			adjective_arguments:term_expression_list|"(" term_expression_list ")"
			arguments:term_expression|"(" term_expression_list ")"
			term_expression:"(" term_expression ")"|[arguments] functor_symbol [arguments]|left_functor_bracket term_expression_list right_functor_bracket|functor_identifier "(" [term_expression_list] ")"|structure_symbol "(#" term_expression_list "#)"|"the" structure_symbol "of" term_expression|variable_identifier|"{" term_expression (postqualification)* ":" sentence "}"|"the" "set" "of" "all" term_expression (postqualification)*|numeral|term_expression "qua" type_expression|"the" selector_symbol "of" term_expression|"the" selector_symbol|"the" type_expression|private_definition_parameter|"it"
			type_expression_list:type_expression ("," type_expression)*
			radix_type:mode_symbol ["of" term_expression_list]|structure_symbol ["over" term_expression_list]
			structure_type_expression:"(" structure_symbol ["over" term_expression_list] ")"|adjective_cluster structure_symbol ["over" term_expression_list]
			type_expression:(adjective_cluster)* ("(" radix_type ")"|radix_type)
			qualification:("being"|"be") type_expression
			variables:variable_identifier ("," variable_identifier)*
			qualified_segment:variables qualification
			explicitly_qualified_variables:qualified_segment ("," qualified_segment)*
			implicitly_qualified_variables:variables
			qualified_variables:implicitly_qualified_variables|explicitly_qualified_variables|explicitly_qualified_variables "," implicitly_qualified_variables
			quantified_formula_expression:"for" qualified_variables ["st" formula_expression] ("holds" formula_expression|quantified_formula_expression)|"ex" qualified_variables "st" formula_expression
			atomic_formula_expression:[term_expression_list] [("does"|"do") "not"] predicate_symbol [term_expression_list] ([("does"|"do") "not"] predicate_symbol term_expression_list)*|predicate_identifier "[" [term_expression_list] "]"|term_expression "is" adjective (adjective)*|term_expression "is" type_expression
			formula_expression:"(" formula_expression ")"|atomic_formula_expression|quantified_formula_expression|formula_expression "&" formula_expression|formula_expression "&" "..." "&" formula_expression|formula_expression "or" formula_expression|formula_expression "or" "..." "or" formula_expression|formula_expression "implies" formula_expression|formula_expression "iff" formula_expression|"not" formula_expression|"contradiction"|"thesis"
			sentence:formula_expression
			proposition:[label_identifier ":"] sentence
			conditions:"that" proposition ("and" proposition)*
			scheme_number:numeral
			definition_number:numeral
			theorem_number:numeral
			library_scheme_reference:article_name ":" "sch" scheme_number
			library_reference:article_name ":" (theorem_number|"def" definition_number) ("," (theorem_number|"def" definition_number))*
			local_scheme_reference:scheme_identifier
			local_reference:label_identifier
			scheme_reference:local_scheme_reference|library_scheme_reference
			reference:local_reference|library_reference
			references:reference ("," reference)*
			scheme_justification:"from" scheme_reference ["(" references ")"]
			straightforward_justification:["by" references]
			proof:"proof" reasoning "end"
			simple_justification:straightforward_justification|scheme_justification
			justification:simple_justification|proof
			diffuse_statement:[label_identifier ":"] "now" reasoning "end" ";"
			iterative_equality:[label_identifier ":"] term_expression "=" term_expression simple_justification ".=" term_expression simple_justification (".=" term_expression simple_justification)* ";"
			type_change_list:(equating|variable_identifier) ("," (equating|variable_identifier))*
			type_changing_statement:"reconsider" type_change_list "as" type_expression simple_justification ";"
			choice_statement:"consider" qualified_variables "such" conditions simple_justification ";"
			compact_statement:proposition justification ";"
			linkable_statement:compact_statement|choice_statement|type_changing_statement|iterative_equality
			statement:["then"] linkable_statement|diffuse_statement
			example:term_expression|variable_identifier "=" term_expression
			exemplification:"take" example ("," example)* ";"
			diffuse_conclusion:"thus" diffuse_statement|"hereby" reasoning "end" ";"
			conclusion:("thus"|"hence") (compact_statement|iterative_equality)|diffuse_conclusion
			existential_assumption:"given" qualified_variables ["such" conditions] ";"
			collective_assumption:"assume" conditions ";"
			single_assumption:"assume" proposition ";"
			assumption:single_assumption|collective_assumption|existential_assumption
			generalization:"let" qualified_variables ["such" conditions] ";"
			skeleton_item:generalization|assumption|conclusion|exemplification
			reasoning_item:auxiliary_item|skeleton_item
			suppose:"suppose" (proposition|conditions) ";" reasoning "end" ";"
			suppose_list:suppose (suppose)*
			case:"case" (proposition|conditions) ";" reasoning "end" ";"
			case_list:case (case)*
			reasoning:(reasoning_item)* [["then"] "per" "cases" simple_justification ";" (case_list|suppose_list)]
			private_predicate_pattern:predicate_identifier "[" [type_expression_list] "]"
			private_functor_pattern:functor_identifier "(" [type_expression_list] ")"
			private_predicate_definition:"defpred" private_predicate_pattern "means" sentence ";"
			private_functor_definition:"deffunc" private_functor_pattern "=" term_expression ";"
			equating:variable_identifier "=" term_expression
			equating_list:equating ("," equating)*
			constant_definition:"set" equating_list ";"
			private_definition:constant_definition|private_functor_definition|private_predicate_definition
			auxiliary_item:statement|private_definition
			functor_identifier:identifier
			functor_segment:functor_identifier ("," functor_identifier)* "(" [type_expression_list] ")" specification
			predicate_identifier:identifier
			predicate_segment:predicate_identifier ("," predicate_identifier)* "[" [type_expression_list] "]"
			scheme_segment:predicate_segment|functor_segment
			scheme_premise:proposition
			scheme_conclusion:sentence
			scheme_parameters:scheme_segment ("," scheme_segment)*
			scheme_identifier:identifier
			scheme_block:"scheme" scheme_identifier "{" scheme_parameters "}" ":" scheme_conclusion ["provided" scheme_premise ("and" scheme_premise)*] ("proof"|";") reasoning "end"
			scheme_item:scheme_block ";"
			theorem:"theorem" compact_statement
			correctness_condition:("existence"|"uniqueness"|"coherence"|"compatibility"|"consistency"|"reducibility") justification ";"
			correctness_conditions:(correctness_condition)* ["correctness" justification ";"]
			reduction_registration:"reduce" term_expression "to" term_expression ";" correctness_conditions
			property_registration:"sethood" "of" type_expression justification ";"
			identify_registration:"identify" functor_pattern "with" functor_pattern ["when" locus "=" locus ("," locus "=" locus)*] ";" correctness_conditions
			functorial_registration:"cluster" term_expression "->" adjective_cluster ["for" type_expression] ";" correctness_conditions
			conditional_registration:"cluster" adjective_cluster "->" adjective_cluster "for" type_expression ";" correctness_conditions
			adjective:["non"] [adjective_arguments] attribute_symbol
			adjective_cluster:(adjective)*
			existential_registration:"cluster" adjective_cluster "for" type_expression ";" correctness_conditions
			cluster_registration:existential_registration|conditional_registration|functorial_registration
			attribute_loci:loci|"(" loci ")"
			attribute_symbol:symbol
			attribute_antonym:"antonym" attribute_pattern "for" attribute_pattern ";"
			attribute_synonym:"synonym" attribute_pattern "for" attribute_pattern ";"
			attribute_pattern:locus "is" [attribute_loci] attribute_symbol
			attribute_definition:"attr" attribute_pattern "means" definiens ";" correctness_conditions
			predicate_symbol:symbol|"="
			predicate_antonym:"antonym" predicate_pattern "for" predicate_pattern ";"
			predicate_synonym:"synonym" predicate_pattern "for" predicate_pattern ";"
			predicate_property:("symmetry"|"asymmetry"|"connectedness"|"reflexivity"|"irreflexivity") justification ";"
			predicate_pattern:[loci] predicate_symbol [loci]
			predicate_definition:"pred" predicate_pattern ["means" definiens] ";" correctness_conditions (predicate_property)*
			right_functor_bracket:symbol|"}"|"]"
			left_functor_bracket:symbol|"{"|"["
			functor_symbol:symbol
			functor_loci:locus|"(" loci ")"
			functor_synonym:"synonym" functor_pattern "for" functor_pattern ";"
			functor_property:("commutativity"|"idempotence"|"involutiveness"|"projectivity") justification ";"
			functor_pattern:[functor_loci] functor_symbol [functor_loci]|left_functor_bracket loci right_functor_bracket
			functor_definition:"func" functor_pattern [specification] [("means"|"equals") definiens] ";" correctness_conditions (functor_property)*
			mode_property:"sethood" justification ";"
			partial_definiens:(sentence|term_expression) "if" sentence
			partial_definiens_list:partial_definiens ("," partial_definiens)*
			conditional_definiens:[":" label_identifier ":"] partial_definiens_list ["otherwise" (sentence|term_expression)]
			label_identifier:identifier
			simple_definiens:[":" label_identifier ":"] (sentence|term_expression)
			definiens:simple_definiens|conditional_definiens
			mode_synonym:"synonym" mode_pattern "for" mode_pattern ";"
			mode_symbol:symbol|"set"
			mode_pattern:mode_symbol ["of" loci]
			mode_definition:"mode" mode_pattern ([specification] ["means" definiens] ";" correctness_conditions|"is" type_expression ";") (mode_property)*
			specification:"->" type_expression
			selector_symbol:symbol
			field_segment:selector_symbol ("," selector_symbol)* specification
			variable_identifier:identifier
			locus:variable_identifier
			fields:field_segment ("," field_segment)*
			loci:locus ("," locus)*
			structure_symbol:symbol
			ancestors:structure_type_expression ("," structure_type_expression)*
			structure_definition:"struct" ["(" ancestors ")"] structure_symbol ["over" loci] "(#" fields "#)" ";"
			redefinition:"redefine" (mode_definition|functor_definition|predicate_definition|attribute_definition)
			definition:structure_definition|mode_definition|functor_definition|predicate_definition|attribute_definition
			permissive_assumption:assumption
			loci_declaration:"let" qualified_variables ["such" conditions] ";"
			notation_declaration:attribute_synonym|attribute_antonym|functor_synonym|mode_synonym|predicate_synonym|predicate_antonym
			definition_item:loci_declaration|permissive_assumption|auxiliary_item
			notation_block:"notation" (loci_declaration|notation_declaration)* "end"
			registration_block:"registration" (loci_declaration|cluster_registration|identify_registration|property_registration|reduction_registration|auxiliary_item)* "end"
			definitional_block:"definition" (definition_item|definition|redefinition)* "end"
			notation_item:notation_block ";"
			registration_item:registration_block ";"
			definitional_item:definitional_block ";"
			reserved_identifiers:identifier ("," identifier)*
			reservation_segment:reserved_identifiers "for" type_expression
			reservation:"reserve" reservation_segment ("," reservation_segment)* ";"
			text_item:reservation|definitional_item|registration_item|notation_item|theorem|scheme_item|auxiliary_item
			section:"begin" (text_item)*
			text_proper:section (section)*
			requirement:file_name
			requirement_directive:"requirements" requirement ("," requirement)* ";"
			article_name:file_name
			library_directive:("notations"|"constructors"|"registrations"|"definitions"|"expansions"|"equalities"|"theorems"|"schemes") article_name ("," article_name)* ";"
			vocabulary_name:file_name
			vocabulary_directive:"vocabularies" vocabulary_name ("," vocabulary_name)* ";"
			directive:vocabulary_directive|library_directive|requirement_directive
			environment_declaration:"environ" (directive)*
			article:environment_declaration text_proper

			%import common.WS
			%ignore WS
		""", start='article')

Any help at all is appreciated! I can try to make a shorter test input if you think it's necessary. Thank you!

Regex DeprecationWarning on Python 3.6

Hi erezsh. Thank you for authoring this wonderful library! I'm very happy with how simple and elegant it has made generating a parser.

I've been in the process of developing a DSL on Python 3.5 without issue, but when testing on Python 3.6, some DeprecationWarning's started to come up. The rules in question are taken from from the lark example python3.g grammar, as I'm using the same rules to match python3 numeric and string literals. I googled around and found a few promising leads from other projects, but my knowledge of regex's isn't really strong enough to make headway on the specific issue.

<unknown>:1: DeprecationWarning: invalid escape sequence \w
<unknown>:1: DeprecationWarning: invalid escape sequence \#
<unknown>:1: DeprecationWarning: invalid escape sequence \d
<unknown>:1: DeprecationWarning: invalid escape sequence \d
<unknown>:1: DeprecationWarning: invalid escape sequence \d
<unknown>:1: DeprecationWarning: invalid escape sequence \d
<unknown>:1: DeprecationWarning: invalid escape sequence \d
<unknown>:1: DeprecationWarning: invalid escape sequence \d
<unknown>:1: DeprecationWarning: invalid escape sequence \d
<unknown>:1: DeprecationWarning: invalid escape sequence \d
<unknown>:1: DeprecationWarning: invalid escape sequence \d
<unknown>:1: DeprecationWarning: invalid escape sequence \d
<unknown>:1: DeprecationWarning: invalid escape sequence \d
<unknown>:1: DeprecationWarning: invalid escape sequence \d
/home/brian/.local/lib/python3.6/site-packages/lark/lexer.py:118: DeprecationWarning: Flags not at the start of the expression '(?:(?i)[ub]?r?("(?!"' (truncated)
  re.compile(t.pattern.to_regexp())
/home/brian/.local/lib/python3.6/site-packages/lark/parser_frontends.py:43: DeprecationWarning: Flags not at the start of the expression '(?:(?i)[ub]?r?("(?!"' (truncated)
  self.lexer = ContextualLexer(lexer_conf.tokens, d, ignore=lexer_conf.ignore, always_accept=always_accept)
/home/brian/.local/lib/python3.6/site-packages/lark/lexer.py:118: DeprecationWarning: Flags not at the start of the expression '[+-]?(?:(?:(?:(?:(?:' (truncated)
  re.compile(t.pattern.to_regexp())
/home/brian/.local/lib/python3.6/site-packages/lark/parser_frontends.py:43: DeprecationWarning: Flags not at the start of the expression '[+-]?(?:(?:(?:(?:(?:' (truncated)
  self.lexer = ContextualLexer(lexer_conf.tokens, d, ignore=lexer_conf.ignore, always_accept=always_accept)
/home/brian/.local/lib/python3.6/site-packages/lark/lexer.py:201: DeprecationWarning: Flags not at the start of the expression '(?:(?i)[ub]?r?("(?!"' (truncated)
  lexer = Lexer(state_tokens, ignore=ignore)
/home/brian/.local/lib/python3.6/site-packages/lark/lexer.py:201: DeprecationWarning: Flags not at the start of the expression '[+-]?(?:(?:(?:(?:(?:' (truncated)
  lexer = Lexer(state_tokens, ignore=ignore)
/home/brian/.local/lib/python3.6/site-packages/lark/lexer.py:95: DeprecationWarning: Flags not at the start of the expression '(?P<SIGNED>[+-]?(?:(' (truncated)
  mre = re.compile(u'|'.join(u'(?P<%s>%s)'%(t.name, t.pattern.to_regexp()+postfix) for t in tokens[:max_size]))
/home/brian/.local/lib/python3.6/site-packages/lark/lexer.py:118: DeprecationWarning: Flags not at the start of the expression 'd(?:(?i)[ub]?r?("(?!' (truncated)
  re.compile(t.pattern.to_regexp())
/home/brian/.local/lib/python3.6/site-packages/lark/parser_frontends.py:43: DeprecationWarning: Flags not at the start of the expression 'd(?:(?i)[ub]?r?("(?!' (truncated)
  self.lexer = ContextualLexer(lexer_conf.tokens, d, ignore=lexer_conf.ignore, always_accept=always_accept)
/home/brian/.local/lib/python3.6/site-packages/lark/lexer.py:201: DeprecationWarning: Flags not at the start of the expression 'd(?:(?i)[ub]?r?("(?!' (truncated)
  lexer = Lexer(state_tokens, ignore=ignore)
/home/brian/.local/lib/python3.6/site-packages/lark/lexer.py:118: DeprecationWarning: Flags not at the start of the expression '(?:(?:(?:(?:(?:(?:(?' (truncated)
  re.compile(t.pattern.to_regexp())
/home/brian/.local/lib/python3.6/site-packages/lark/parser_frontends.py:43: DeprecationWarning: Flags not at the start of the expression '(?:(?:(?:(?:(?:(?:(?' (truncated)
  self.lexer = ContextualLexer(lexer_conf.tokens, d, ignore=lexer_conf.ignore, always_accept=always_accept)
/home/brian/.local/lib/python3.6/site-packages/lark/lexer.py:206: DeprecationWarning: Flags not at the start of the expression 'd(?:(?i)[ub]?r?("(?!' (truncated)
  self.root_lexer = Lexer(tokens, ignore=ignore)
/home/brian/.local/lib/python3.6/site-packages/lark/lexer.py:206: DeprecationWarning: Flags not at the start of the expression '(?:(?i)[ub]?r?("(?!"' (truncated)
  self.root_lexer = Lexer(tokens, ignore=ignore)
/home/brian/.local/lib/python3.6/site-packages/lark/lexer.py:206: DeprecationWarning: Flags not at the start of the expression '[+-]?(?:(?:(?:(?:(?:' (truncated)
  self.root_lexer = Lexer(tokens, ignore=ignore)
/home/brian/.local/lib/python3.6/site-packages/lark/lexer.py:206: DeprecationWarning: Flags not at the start of the expression '(?:(?:(?:(?:(?:(?:(?' (truncated)
  self.root_lexer = Lexer(tokens, ignore=ignore)

Here is the grammar (the file also found here). It seems like the issues are coming from the string and numeric literals (the first warning above comes from SHORT_STRING as far as I can tell):

start : (word | atom)+

word : "(" NAME DOCSTR? (word | atom)* ")"

?atom : SIGNED   -> number
      | STRING   -> string
      | OPERATOR -> operator
      | NAME     -> call
      | "True"   -> true_
      | "False"  -> false_
      | "None"   -> none_
      | var
      | list
      | dot
      | array
      | import_

var : ":" NAME
list : "[" atom* "]"
dot : NAME ("." NAME)+
array : "@" NAME ("." NAME)*
import_ : STRING "import"



// Tokens
OPERATOR : "++" | "+"  | "--" | "-"  | "**" | "*"  | "//" | "/"  | "%"
         | "==" | "!=" | "<=" | ">=" | "<"  | ">"  | "~"
         | "~"  | "&"  | "|"  | "^"  | "<<" | ">>"

NAME : /[a-zA-Z_]\w*/

COMMENT : /\#[^\n]*/
WHITESPACE : /[ \t\f\r\n]+/
%ignore COMMENT
%ignore WHITESPACE


// String literals
DOCSTR : /d/ STRING
STRING : SHORT_STRING | LONG_STRING
SHORT_STRING : /(?i)[ub]?r?("(?!"").*?(?<!\\\\)(\\\\\\\\)*?"|'(?!'').*?(?<!\\\\)(\\\\\\\\)*?')/
LONG_STRING : /(?i)(?s)[ub]?r?(""".*?(?<!\\\\)(\\\\\\\\)*?"""|'''.*?(?<!\\\\)(\\\\\\\\)*?''')/


// Numeric literals
SIGNED : /[+-]?/ NUMBER
NUMBER : IMAG_NUMBER | FLOAT_NUMBER | HEX_NUMBER | OCT_NUMBER | BIN_NUMBER | DEC_NUMBER | ZERO
ZERO : "0"
DEC_NUMBER : /(?i)[1-9]\d*l?/
HEX_NUMBER : /(?i)0x[\da-f]*l?/
OCT_NUMBER : /(?i)0o[0-7]*l?/
BIN_NUMBER : /(?i)0b[0-1]*l?/
FLOAT_NUMBER : /(?i)((\d+\.\d*|\.\d+)(e[-+]?\d+)?|\d+(e[-+]?\d+))/
IMAG_NUMBER : /(?i)\d+j|${FLOAT_NUMBER}j/

incorrect ignore behavior

Hi Erez,

Given a grammar like this:

sudo_parser = Lark(r"""
    sudo_item : CNAME*

    %import common.CNAME
    %import common.WS
    %ignore /[\\\\]$/m
    %ignore /^#.*\n/m
    %ignore /^Defaults.*$/m
    %ignore WS
""", start='sudo_item')

This input works:

# ======> BEGINNING OF sudoers.d/00_Cmnd_Aliases

This doesn't:

# ======> BEGINNING OF sudoers.d/00_Cmnd_Aliases
#New aliases

I'm using the master branch.

Note that if I use $ in the comment pattern, neither the first, nor the second case work. Let me know if I'm doing something wrong.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.