simn / hxparse Goto Github PK

View Code? Open in Web Editor NEW

111.0 15.0 30.0 412 KB

haxe Lexer/Parser library

License: MIT License

Haxe 100.00%

hxparse's Introduction

hxparse

This library provides tools for creating lexers and parsers in Haxe.

Installation

Install the library via haxelib

haxelib install hxparse

Usage

Writing a Lexer: https://github.com/Simn/hxparse/wiki/Writing-a-Lexer
Writing a Parser: https://github.com/Simn/hxparse/wiki/Writing-a-Parser
API: http://simn.github.io/hxparse/hxparse/index.html

hxparse's People

Contributors

Stargazers

Watchers

hxparse's Issues

haxelib

This is not a real issue, but more a question:

Any plans to release this library on haxelib? Do you plan to support this library further or was this some kind of "I just wanted to make something cool" experiment? 😄

Repository not quite compatible with haxelib

When using haxelib, its frequent to switch between haxelib and git versions.

Problem with hxparse atm, is that it uses directory src as the haxelib root directory.

Using haxelib set hxparse git, does not work, one needs to set a dev version and manually set the directory to ....../libs/hxparse/git/src

Just something to keep in mind, a solution is simply to put src dir contents on the top of the repository.

Builds fail

I get the following output when trying to build hxparse through haxe build.hxml:

$ haxe build.hxml
haxelib run hxcpp Build.xml haxe -Dhaxe3="1" -Dhaxe_ver="3.103" -Dhxcpp_api_level="311" -Dunifill="1" -I"/usr/lib/haxe/lib/unifill/0,1,1/" -I"test/" -I"src/" -I"/usr/lib/haxe/extraLibs/" -I"/usr/local/lib/haxe/extraLibs/" -I"" -I"/usr/lib/haxe/std/cpp/_std/" -I"/usr/local/lib/haxe/std/cpp/_std/" -I"/usr/lib/haxe/std/" -I"/usr/local/lib/haxe/std/"
(...)
./src/PrintfLexer.cpp -oobj/darwin/fd282d4a_PrintfLexer.o
Error: ./src/PrintfParser.cpp:22:10: fatal error: 'haxe/ds/GenericCell.h' file not found
#include <haxe/ds/GenericCell.h>

Almost definitely caused by HaxeFoundation/haxe#2016.

Running haxe test.hxml also fails:

$ haxe test.hxml
Error: Resource file not found : TestMacro.hx

usage from macros

Is it feasible to run a parser from a macro context?
I am busy testing this but asking may prove quicker and more definitive.

I had this error come up when attempting to run a parser at compile time:
haxelib/hxparse/4,0,0/src/hxparse/Parser.hx:154: lines 154-161 : A generic class can't have static fields

By hacking around a bit further I managed to get more errors along the lines of not being allowed to use @:build from a macro, which given how hxparse works it might not make sense to try?

Thanks!

Use of string interpolation / variables in macro rules

Interpolated strings don't appear to work, and use of variables throws a build error.

A generic class can't have static fields error when a subclass of Parser is used inside a macro

I created a subclass of Parser, and I tried to use it inside a macro. It failed with compiler error "A generic class can't have static fields".

As you can see in #61, I was able to get my Parser subclass working inside a macro by fencing static public macro function parse(). However, I understand that this isn't the correct solution.

I suppose that a similar workaround that I can use for now is to copy all of the code except static public macro function parse() from Parser into a new class, and use that copy instead. However, I would prefer to use the real Parser so that I don't need to manually merge any changes that you make to hxparse in the future.

Exponent on JSON parser example

I think regex should be:
"-?(([1-9][0-9]*)|0)(.[0-9]+)?([eE][\+\-]?[0-9]+)?" => TNumber(lexer.current),

ie "[0-9]+" not "[0-9]?"

"final" variable name conflicts with haxe 4 "final" keyword

In State.hx, I replaced final by finalId:

package hxparse;

/**
	Represents a state in the state machine generated by the `LexEngine`.
**/
class State {
	/**
		The transition vector, where the index corresponds to a char code.
	**/
	public var trans:haxe.ds.Vector<State>;

	/**
		The ids of the final states.
	**/
	public var finalId:Int;

	/**
		Creates a new State.
	**/
	public function new() {
		finalId = -1;
		trans = new haxe.ds.Vector(256);
	}
}

Parser control flow

The current control flow does not compile to flash due to "Cannot compile try/catch as a right-side expression in Flash9". We have to find a way of making this work while still allowing manual Stream.Failure, which is used by the haxe parser parseClassField.

(Eventually) remove / fix / improve the token cache from Parser

It can potentially mess up if you switch rulesets and junk as discussed a while ago. In what other way should it work?

release a new version on haxelib

I can install this via github, but it would be nice to have the "final" keyword fix in for this on a stable version soon.

Haxe currently requires `this.stream` in the macro switch

switch stream
raises an error Expected [ patterns ]

switch this.stream fixes it

Switch Ruleset<T> within Lexer context

I'm using hxparse for my experimental project hxdtl - Haxe implementation of DTL (Django Template Language). The DTL is made for text processing, using simple enough templates.

I've found it uncomfortable (and almost impossible) to use one Ruleset for processing of templates. So I've created three different rulesets, each one serves different sections of templates. But resulting rule's function becomes too complex, when trying to reflect current lexer state (InCode, InText). I was trying to use Lexer.token function to control ruleset flow, but such approach fails when we want both ruleset switching and token to return at same time. To solve this problem I've tried next solutions:

Modified Lexer.token declaration to return an array of tokens and then in LexerStream concat these arrays. But this too ugly and unflexible.
Added a LexerStream field to Lexer - so that we are free to set LexerStream.ruleset from Lexer context.

This last solution seems to be nice, but I'm in doubt if such tight coupling is good? Maybe we need to consider another way of Ruleset composition/switching? And thinking more generally, I like the idea of parsers composition - so that it would possible to reuse existing ones and build new ones.

Anyway, I appreciate your work at hxparse, especially ParserBuilder and LexEngine classes.

npeek

The current stream only allows a single lookahead. We need an efficient way of supporting further lookahead, which is required by haxe's "else" parsing.

[REQUEST] An example for parsing simple dynamic language like 'Kaleidoscope'

I am very curious about writing parsers and compilers. I come from a background of using PEGs to generate parsers. So the concept of writing a parser by hand is entirely new to me. I love Haxe and I want to experiment with parsers using hxparse. I have taken a look at haxeparser library as an implementation guide for a full programming language parsing, but things got complex has it seems to depend on inbuilt Haxe macros for tokenizing expressions.

I'll be glad if there is a simple example or tutorial that explains step-by-step how to parse a simple dynamic language like the Kaleidoscope example.

Thank you!!

Peek(1) affects next token.

I'm using different rulesets in a custom parser for erazor lib.

Frequently i use peek(1) on a ruleset to verify if it should process what comes next or leave it to another ruleset.

The problem is that if a function using ruleset(A) decides not to process what comes next after peek(1), and another function with ruleset(B) processes what comes next, the first token will be the one defined by function (A) peek(1)

example:

function A() {
  stream.ruleset = MyLexer.ASomeRules;

  var tok = peek(1);
  switch(tok) {
      case SOME_RULE: B(); // let B process what comes next.
  }
}

function B() {
    stream.ruleset = MyLexer.BOtherRules;  

    // will throw SOME_RULE has no match in this switch.
    switch stream { 
          case OTHER_RULE: null;
    } 
}

Investigate ways to implement this without exceptions

hxparse was heavily inspired by how the real Haxe parser works. Unfortunately, OCaml deals with exceptions much better than the Haxe targets do. My initial profiling suggests that we spend way too much time with exception handling.

New Error regards statics on generic Classes

src/hxparse/Parser.hx:154: lines 154-161 : A generic class can't have static fields

Haxe build HaxeFoundation/haxe@0c89b91

Lexer ruleset generator matches rules in the wrong order

If you have two rules that match at the same time and they are equally as long, it should match the first one, instead it matches the last one.

Example snippet of tokenizer rules from a modified Latex lexer:

static public var tok = @:rule [
    "\\\\begin{[a-zA-Z]+}" => TBegin(lexer.current.substr(7, lexer.current.length - 8)),
    "\\\\begin{flowchart}" => {
        trace("Matched!");
        TBegin(lexer.current.substr(7, lexer.current.length - 8));
    }
];

This traces out "Matched!", but it shouldn't, since it's the last rule. If the rules are switched, it doesn't trace it.

I'm personally fine with this, but apparently it's supposed to be the other way around. Something to do with DFA building in LexEngine.

please expose the AST in the api.

Hello
Is there a way to expose the AST or whatever tree is built, when using this Parser?
I couldn't find this in the api or docs.
Thanks.

Sub parser seems to end early.

Hey Simon,

Hopefully I'm just missing something obvious, but I've got a couple of lexers which need to grab chunks, which are then parsed. The problem I have come across is the parser seems to end early. Here's a minimal test.

If you comment out line 54 it will end early.

position info wrong in case of unicode characters

When using HaxeLexer 1.0.0 and hxparse 4.0.0 the token positions are incorrect. Positions are correct when switching back to hxparse 3.0.0.
Sample file (test.hx):

/*
* üä
*/
class Test {
  public function new ()
  {}
}

output with hxparse 3.0.0:

Comment(* üä)                         { file => test.hx, max => 12, min => 0 }
Kwd(KwdClass)                         { file => test.hx, max => 18, min => 13 }
    Const(CIdent(Test))               { file => test.hx, max => 23, min => 19 }
        BrOpen                        { file => test.hx, max => 25, min => 24 }
            Kwd(KwdPublic)            { file => test.hx, max => 34, min => 28 }
            Kwd(KwdFunction)          { file => test.hx, max => 43, min => 35 }
                Kwd(KwdNew)           { file => test.hx, max => 47, min => 44 }
                    POpen             { file => test.hx, max => 49, min => 48 }
                        PClose        { file => test.hx, max => 50, min => 49 }
                    BrOpen            { file => test.hx, max => 54, min => 53 }
                        BrClose       { file => test.hx, max => 55, min => 54 }
            BrClose                   { file => test.hx, max => 57, min => 56 }

output with hxparse 4.0.0:

Comment(* üä)                         { file => test.hx, max => 16, min => 0 }
Kwd(KwdClass)                         { file => test.hx, max => 22, min => 17 }
    Const(CIdent(Test))               { file => test.hx, max => 27, min => 23 }
        BrOpen                        { file => test.hx, max => 29, min => 28 }
            Kwd(KwdPublic)            { file => test.hx, max => 38, min => 32 }
            Kwd(KwdFunction)          { file => test.hx, max => 47, min => 39 }
                Kwd(KwdNew)           { file => test.hx, max => 51, min => 48 }
                    POpen             { file => test.hx, max => 53, min => 52 }
                        PClose        { file => test.hx, max => 54, min => 53 }
                    BrOpen            { file => test.hx, max => 58, min => 57 }
                        BrClose       { file => test.hx, max => 59, min => 58 }
            BrClose                   { file => test.hx, max => 61, min => 60 }

you can see the positions from hxparse 4.0.0 are higher starting with the unicode comment.
The numbers go higher with every additional unicode character used.

I am testing on a 64bit linux system (if that makes any difference).

Nightly Haxe build throws errors

libs/hxparse/src/hxparse/Parser.hx:79: characters 3-11 : Array<Unknown<0>> has no field push

Stack overflow in RuleBuilder

While working with haxe language server, I ended up with a compilation server blocked a few times with a Stack overflow error. Third time gave me a call stack from hxparse macros:

Uncaught exception Stack overflow
/opt/haxe/std/haxe/macro/Context.hx:439: characters 10-30 : Called from here
/git/vshaxe/.haxelib/hxparse/git/src/hxparse/RuleBuilder.hx:190: characters 11-28 : Called from here
/git/vshaxe/.haxelib/hxparse/git/src/hxparse/RuleBuilder.hx:38: characters 21-42 : Called from here
/git/vshaxe/.haxelib/hxparse/git/src/hxparse/RuleBuilder.hx:49: characters 4-11 : Called from here
/git/vshaxe/.haxelib/haxeparser/git/src/haxeparser/HaxeLexer.hx:26: character 1 : Called from here

Sadly it happened right when I renamed the configuration that enabled the recording, so I have no recording of this.

Edit: got a 11k lines repro script 😆

Fallthrough behaviour in Parser

I looks like when you have a match on the first token, and a failure on following tokens in the same case, any following cases in the original pattern match are ignored. Is this expected behaviour?

case [TLParen,TAtom(UnboundSym('defun')), sym = symbol(), xs = params(), xprs = sexpr(), TRParen] : Defun(sym,xs,xprs); //fail
case _                                                                                            : SE(sexpr()); //nothing

End of file "" doesn't match an empty string.

A minimal parser / lexer trying to match only the end of file fails on completely empty strings.

haxelib release update

Please release a versioned update on haxelib.

I've been using hxparse (it's wonderfully easy to get into) but until there is a haxelib update for the year+ of updates and haxe 3.2 release changes there is no way I can share this with any users comfortably.

I think I saw @fussybeaver and @prog4mr having the same thoughts.

Is it possible to match international characters?

Hi Simon,

I'm attempting to parse some of the old roundups, which now and then contain international characters - â, ê, etc.

Putting these into a range rule or on their own doesn't work. I've tried escaped char codes, again, into range rules and on their own which doesn't work either. I assume its in the form of \\226.

I've tested the String::charCodeAt and StringTools.fastCodeAt on character â which both return 195. Doing String.fromCharCode( 195 ) returns Ã.

Looking at the extended ascii section the correct code for â is 226. It also returns 195 when testing the ê character which should be 234.

String.fromCharCode seems to always return the correct character based on the codes from the extended ascii site.

I'm using the latest Haxe build and hxparse via github, targeting neko and macros.

Haxelib: create package and share

I'm preparing the release of HxDtl 0.1.0. The library depends on hxparse, and I'm thinking about possible ways of hxdtl distribution:

Git - already utilizes hxparse as submodule, so that developer gets dependency during pull
Haxelib - created file haxelib.xml, but it requires hxparse to be packaged as haxelib also

So are you able to create haxelib.xml for hxparse and submit it to http://lib.haxe.org ?

Is it possible to use this library at compile-time?

Hi,

thanks for the amazing work!

Just a quick question, is there a way to use this library from an expression macro?

Can you detail the general flow of the parser and how of consuming tokens works bit more?

Hello, and thank you for this library.
So far, it seems to reduce the boilerplate required to build parsers in Haxe.

After reaching some level of complexity, I realised that I am not sure about how the flow of the program works exactly, and I find myself doing a lot of trial and error.

My main confusing source is how tokens are matched against the list of cases. I tried to just put the "shapes" that I expect in the case arrays, but that is not working and additionally it triggers a warning as stated at #60.

So, for example, this code:

  public function parseType() {
    return switch stream {
      case [DocType(t), Pipe, DocType(t)]:
        'Either($t, $t2)';
      case [DocType(t)]:  // IT says this case is unused, same happens if I change the order, whatever i put here becomes unused
        t + "";
    }
  }

Is not working as expected, and essentially only works when there is a Pipe in between and fails otherwise. Not sure if this is expected to work.

Also, I would like to know more about skipping cases.
For skipping, I am following a similar approach to tokens, where in the cases I want to skip I just call the same parse function. I tried leaving the cases empty, but that never worked for me, so things like this does not work:

return switch stream {
      case [LParen|SPC]: // This ignore does not work, things just fail.
      case [DocType(t), Pipe, DocType(t)]:
        'Either($t, $t2)';

This however seems to work:

return switch stream {
      case [LParen|SPC]: callItself();
      case [DocType(t), Pipe, DocType(t)]:
        'Either($t, $t2)';

Last but not least, how is the error recovery supposed to be done? I don't want to fill everything to try and catch and doing a single big one does not give much context of what could fail. Of course I can try to "prevent" known errors and raise different messages or just do workarounds, but that is something I am not sure about either.

Thanks for your great work.

Warning : This case is unused

i use hxparse with haxeparser
hxparse throws some warnings out:

hxparse/ParserBuilderImpl.hx:136: lines 136-139 : Warning : This case is unused

Not really important since it will be built anyway.

Extract test JSON/Printf parsers into their own repositories

So I can make them dependencies in CodeVis :)

"final" property not compatible with Haxe 4 (in hxparse.State)

Parser failed for Uppercase letter.

I am trying to parse a lua code baed on Lua 5.1 grammar . I have tried to parse most things with much success, however when I try to parse an Identity which begins with an uppercase character, I get an Unexpected exception.

package test;

import byte.ByteData;
import xlua.Lexer;
import xlua.Parser;

class Main {
    public static function main() {
        var str = "function Plane:fly(a, b, ...) return y end";
        var byteData = ByteData.ofString(str);
        var lex = new Lexer(byteData, "test.lua");
        var ts = new hxparse.LexerTokenSource(lex, Lexer.tok);
        var p = new Parser(ts);
        trace(p.parseLua());
    }
}

From the code above I am trying to parse a named function in Lua,

Parser throws this error :

Called from ? line 1
Called from Main.hx line 14
Called from xlua/Parser.hx line 25
Called from hxparse/ParserBuilderImpl.hx line 112
Called from hxparse/Parser.hx line 36
Called from hxparse/LexerTokenSource.hx line 13
Called from xlua/Lexer.hx line 59
Called from hxparse/Lexer.hx line 100
Uncaught exception - Unexpected P

These are my lexer rules:

class Lexer extends hxparse.Lexer implements hxparse.RuleBuilder {

	static function mkPos(p:hxparse.Position) {
		return {
			file: p.psource,
			min: p.pmin,
			max: p.pmax
		};
	}


	static function mk(lexer:hxparse.Lexer, td) {
		return new xlua.Data.Token(td, mkPos(lexer.curPos()));
	}

	// @:mapping generates a map with lowercase enum constructor names as keys
	// and the constructor itself as value
	static var keywords = @:mapping(3) Data.Keyword;

    static var buf = new StringBuf();

    static var ident = "_*[a-z][a-zA-Z0-9_]*|_+|_+[0-9][_a-zA-Z0-9]*";

    static var integer = "([1-9][0-9]*)|0";

    // @:rule wraps the expression to the right of => with function(lexer) return
    public static var tok = @:rule [
        "" => mk(lexer, Eof),
        "[\r\n\t ]+" => {
			#if keep_whitespace
			var space = lexer.current;
			var token:Token = lexer.token(tok);
			token.space = space;
			token;
			#else
			lexer.token(tok);
			#end
		},
        "0x[0-9a-fA-F]+" => mk(lexer, Const(CInt(lexer.current))),
        integer => mk(lexer, Const(CInt(lexer.current))),
        integer + "\\.[0-9]+" => mk(lexer, Const(CFloat(lexer.current))),
        "\\.[0-9]+" => mk(lexer, Const(CFloat(lexer.current))),
        integer + "[eE][\\+\\-]?[0-9]+" => mk(lexer,Const(CFloat(lexer.current))),
        integer + "\\.[0-9]*[eE][\\+\\-]?[0-9]+" => mk(lexer,Const(CFloat(lexer.current))),
        "-- [^\n\r]*" => mk(lexer, CommentLine(lexer.current.substr(2))),
        "+\\+" => mk(lexer,Unop(OpIncrement)),
        "~" => mk(lexer,Unop(OpNegBits)),
        "%=" => mk(lexer,Binop(OpAssignOp(OpMod))),
        "&=" => mk(lexer,Binop(OpAssignOp(OpAnd))),
        "|=" => mk(lexer,Binop(OpAssignOp(OpOr))),
        "^=" => mk(lexer,Binop(OpAssignOp(OpXor))),
        "+=" => mk(lexer,Binop(OpAssignOp(OpAdd))),
        "-=" => mk(lexer,Binop(OpAssignOp(OpSub))),
        "*=" => mk(lexer,Binop(OpAssignOp(OpMult))),
        "/=" => mk(lexer,Binop(OpAssignOp(OpDiv))),
        "<<=" => mk(lexer,Binop(OpAssignOp(OpShl))),
        "==" => mk(lexer,Binop(OpEq)),
        "~=" => mk(lexer,Binop(OpNotEq)),
        "<=" => mk(lexer,Binop(OpLte)),
        "and" => mk(lexer,Binop(OpBoolAnd)),
        "or" => mk(lexer,Binop(OpBoolOr)),
        "<<" => mk(lexer,Binop(OpShl)),
        "\\.\\.\\." => mk(lexer, TriplDot),
        "~" => mk(lexer,Unop(OpNot)),
        "<" => mk(lexer,Binop(OpLt)),
        ">" => mk(lexer,Binop(OpGt)),
        ":" => mk(lexer, Col),
        "," => mk(lexer, Comma),
        "\\." => mk(lexer, Dot),
        "%" => mk(lexer,Binop(OpMod)),
        "&" => mk(lexer,Binop(OpAnd)),
        "|" => mk(lexer,Binop(OpOr)),
        "^" => mk(lexer,Binop(OpXor)),
        "+" => mk(lexer,Binop(OpAdd)),
        "*" => mk(lexer,Binop(OpMult)),
        "/" => mk(lexer,Binop(OpDiv)),
        "-" => mk(lexer,Binop(OpSub)),
        "=" => mk(lexer,Binop(OpAssign)),
        "in" => mk(lexer,Binop(OpIn)),
        "[" => mk(lexer, BkOpen),
        "]" => mk(lexer, BkClose),
        "{" => mk(lexer, BrOpen),
        "}" => mk(lexer, BrClose),
        "\\(" => mk(lexer, POpen),
        "\\)" => mk(lexer, PClose),
		'"' => {
			buf = new StringBuf();
			var pmin = lexer.curPos();
			var pmax = try lexer.token(string) catch (e:haxe.io.Eof) throw new LexerError(UnterminatedString, mkPos(pmin));
			var token = mk(lexer, Const(CString(unescape(buf.toString(), mkPos(pmin)))));
			token.pos.min = pmin.pmin; token;
		},
		"'" => {
			buf = new StringBuf();
			var pmin = lexer.curPos();
			var pmax = try lexer.token(string2) catch (e:haxe.io.Eof) throw new LexerError(UnterminatedString, mkPos(pmin));
			var token = mk(lexer, Const(CString(unescape(buf.toString(), mkPos(pmin)))));
			token.pos.min = pmin.pmin; token;
		},
		'-- \\*' => {
			buf = new StringBuf();
			var pmin = lexer.curPos();
			var pmax = try lexer.token(comment) catch (e:haxe.io.Eof) throw new LexerError(UnclosedComment, mkPos(pmin));
			var token = mk(lexer, Comment(buf.toString()));
			token.pos.min = pmin.pmin; token;
		},
        "#" + ident => mk(lexer, Sharp(lexer.current.substr(1))),
		ident => {
			var kwd = keywords.get(lexer.current);
			if(kwd != null)
				mk(lexer, Kwd(kwd));
			else
				mk(lexer, Const(CIdent(lexer.current)));
		} 
    ];
}

Error comes from when calling:

lexer.token(tok);

I would appreciate if you can help me figure out whatever I am doing wrongly.

regex flags?

I have been looking for where the regex is handled to see the default flags, but I can't seem to find it.

Secondly, How would I go about matching a /m flag?
This might go against the stream approach, but there are cases where I can't seem to make sense of the parsing - for example, a //comment in a file would require a /m flag for most cases I know of.

Any suggestions?