hoaproject / compiler Goto Github PK

View Code? Open in Web Editor NEW

453.0 17.0 45.0 643 KB

The Hoa\Compiler library.

Home Page: https://hoa-project.net/

PHP 99.10% Pascal 0.90%

php hoa library compiler grammar parser grammar-based-testing

compiler's People

Contributors

Stargazers

Watchers

Forkers

circlecode lyrixx savageman lucciano hywan davidkuehner guiled metalaka stephpy vonglasow shulard thallesrobson vcgato29 hu19891110 15210571579 thekvist domyhero grummfy rubicon9 b1rdex railt kevinyzy fesor flip111 unkind ziqingliang wimg simonfork blmage adequasys ohader dshoreman zaka59 vasily-kartashov immediate-media intracto martijnharte hiqdev bjulien igorora digitalkreativ hoa-math-community acrobat

compiler's Issues

detect invalid regex in lexer

in Lexer.php, preg_match could return false in case of regex error.

the false return should be tested and explained at https://github.com/hoaproject/Compiler/blob/master/Llk/Lexer.php#L270

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Broken visualization of invalid input token in multiline input

When using multiline input, the arrow indicating position of invalid token is not positioned correctly:

/**
 * @Foo(hello
 */

When error is the ( (i.e. missing pair )), the exception message currently contains:

/**
 * @Foo(hello
 */
                   ↑

What would be expceted instead:

/**
 * @Foo(hello
        ↑
 */

or:

/**
 * @Foo(hello
        ↑

ID overwriting does not work properly in PP

Take the following grammar:

%skip  s \s
%token a a
%token b b
%token c c

#root:
    ( <a> ) ( <b> #other )* <c>

echo 'abbc' | hoa compiler:pp Grammar.pp 0 -v dump
>  #root
>  >  token(a, a)
>  >  token(b, b)
>  >  token(b, b)
>  >  token(c, c)

Because we have ( <a> ) and not only <a>, we have the #root ID and not #other as expected.

Add Location to TreeNode

There's currently no way of locating an erroneous token within source during semantic analysis.

Refer to: http://discourse.hoa-project.net/t/hoa-compiler-locate-nodes-in-source/256

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

add a dev / debug mode

writing grammar can be a tricky work.

Of course pp and Hoa\Compiler ease the job, but a dev / debug mode allowing the developer to understand the grammar in some special cases, and giving verbose traces would be a real plus.

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Dependency errors when installed with prefer-lowest

While attempting to run the suite locally with lowest dependencies (Composer's --prefer-lowest), it crashes right away:

$ vendor/bin/hoa test:run -a
PHP Deprecated:  The (unset) cast is deprecated in /tmp/Compiler/vendor/hoa/consistency/Prelude.php on line 73

Deprecated: The (unset) cast is deprecated in /tmp/Compiler/vendor/hoa/consistency/Prelude.php on line 73

Error: The (unset) cast is deprecated in /tmp/Compiler/vendor/hoa/test/.bootstrap.atoum.php at line 7
$ php -v | head -n 1
PHP 7.2.3 (cli) (built: Mar 12 2018 20:39:08) ( NTS )

When using Compiler as dependency and installing with --prefer-lowest, it produces lots of errors, regarding the deprecated cast, but also non-existent class Hoa\Iterator\Buffer. See Travis log here: https://travis-ci.org/schmittjoh/serializer/jobs/371673251#L580

Compiler must, as a library, require lowest versions of its dependencies in versions it works with.

Incorrect calculation of line and position in utf-8 files

Example code with an error on the last line:

type A {
    some(arg: String = "😺 😸 😹 😻 😼 😽 🙀 😿 😾"): Any
}

😿

Expected (line 5 and column 1):

Unrecognized token "😿" at line 5 and column 1:
😿
↑ <- arrow here

Actual (line 1 and column 89):

Unrecognized token "😿" at line 1 and column 89:
type A {
    some(arg: String = "😺 😸 😹 😻 😼 😽 🙀 😿 😾"): Any
}

😿some

                                                             ↑ <- arrow here

Ability to parse a "blob" (sequence of bytes) which length is defined by a previous token

For example 5 abcde
would be parsed into

token(number, 5)
token(fixedlengthstring, abcde)

with the constraint that the length of the 2nd string token "abcde" is equal to the value of the first number token 5.

Here are 3 examples that would need this feature (maybe that using the Compiler is not the right tool for the job, in which case this issue can be closed) :

1️⃣ parsing the response of a IMAP FETCH command

* 1 FETCH (BODY[HEADER]<0> {100}
The first 100 byte literals of the headers would be here)

In this example, the size of the next data token is included not far before the data to parse.

2️⃣ Creating an AST from a PDF file:

5 0 obj
<< /Length 42 >>
stream
This (possibly encoded) stream contains 42 bytes of dataendstream
endobj

This one is a bit more tricky since the size of the data stream (42 bytes) is contained in a the previous "Dictionary Node" (from an AST Point-Of-View), thus requiring knowledge of previous nodes already emited.

3️⃣ Other protocols uses this as well:

the Content-Length header (HTTP), same as 2️⃣
the payload len in a websocket frame, though this one is even more challenging because it would in addition require being able to read a streamed input

add *(sep) and +(sep) syntactic sugar

picked from http://www.gazelle-parser.org/docs/manual.html#_repetition

*(sep)

The (sep) modifier specifies 0 or more occurrences of the previous component, where each occurrence is separated by sep. It is a more straightforward way of writing (component (sep component))?. sep can be any valid component (or in unusual cases, expression of components) that can appear on a right-hand-side of a rule.

+(sep)

The +(sep) modifier specifies 1 or more occurrences of the previous component, where each occurrence is separated by sep. It is a more straightforward way of writing component (sep component)*. sep can be any valid component (or in unusual cases, expression of components) that can appear on a right-hand-side of a rule.

Debug grammar tooling problems

Hi, i try this code https://keynote.hoa-project.net/PHPTour14/Demonstration/Generation_exhaustive.php to show all possible compiler output. Is it the same as paths? But my CPU just goes to 100% and nothing gets printed.

I'm trying to diagnose why i get a PHP Fatal error: Uncaught Hoa\Compiler\Llk\Parser::parse(): (0) Unexpected token ... blabla there is no indication of which token is expected instead. Which alternatives were tried and what it backtracked on. For this i tried the code in the guide https://hoa-project.net/En/Literature/Hack/Compiler.html#Traces

Also question on the side .. the hoa tool. I can not find it, i was expecting it to be in vendor/bin but it's not there, how can i get this tool ?

Introduce pragma to disable UTF-8 in the lexer

In some grammars, we need the lexer to not be in Unicode mode. For instance, in Hoa\Json. So we need to introduce pragma, and the first one would aim to disable Unicode. Something like:

%pragma unicode false

In some days, we will need to implement the following one too:

%pragma backtrack_limit 3

(for an LL(3) compiler)

Maybe the names are not correct. I need your help here!

%skip token selection

In my task I need to know information about some tokens that have been marked as "%skip".

For example, here is the code:

%skip T_COMMENT \/\*.*\*\/

/** Docblock */ 
class /** skipped */ Test /** skipped */
{
    /** skipped */
}

Output:

#Class
    #Name 
        token(T_NAME, Test)

But Im need something like this:

#Class
  #Docblock
      token(T_COMMENT, /** Docblock */)
  #Name
      token(T_NAME, Test)

I can replace the %skip with a %token. But I will have to make it (%token) optional after every active token in the grammar file.

Any ideas how implement this?

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Inlining code of the Parser and license

Hi,

I'd like to use your library, but I want to automatically inline parent class \Hoa\Compiler\Llk\Parser and minimize generated class. So far it looks like

class FooParser extends \Hoa\Compiler\Llk\Parser
{
    public function __construct()
    {
        // ...
    }
}

The point is to decouple final product from the compiler.

I wanna ask whether it's acceptable for you. Of course, generated class would have reference to your library, e.g.

/**
 * This parser was autogenerated by merging Foo-related stuff and Hoa Compiler Parser into this one.
 * In other words, all the credits regarding generic Parser and Lexer goes to maintainers of https://github.com/hoaproject/Compiler.
 * But rights on the Foo-related part of the software (describing formal grammar, etc.) belongs to Santa-Claus.
 *
 * Generated on Wednesday, November 7, 2018 (UTC) using Hoa Compiler "3.17.08.08".
 */
final class FooParser
{
    // ...
}

Text of your license states that I also have to include the whole. Okay, but does it mean that my MIT-licensed product (for example) becomes product with mixed license?

What do you think?

Problems with debugging PP

Hi i've a problem with some pp syntax, i think the wrong repetition is getting precendence but i don't understand why. See below for relevant info.

notation under test:

Ab ::= Cd

%skip S [\x09\x0A\x0D\x20]
%token Production_a ::=
%token NameStartChar \w
%token Choice_a \|
%token SequenceOrDifference_a -
%token Item_a [\?\*\+]

#Production:
    NCName() ::Production_a:: Choice()

#NCName:
    ::NameStartChar:: (::NameStartChar::)*

#Choice:
    SequenceOrDifference() (::Choice_a:: SequenceOrDifference())*

#SequenceOrDifference:
    (NCName() (::SequenceOrDifference_a:: NCName() | NCName()*))?

Trace:

 #  namespace     token name            token value                     offset
-------------------------------------------------------------------------------
 0  default       NameStartChar         A                                    0
 1  default       NameStartChar         b                                    1
 2  default       Production_a          ::=                                  3
 3  default       NameStartChar         C                                    7
 4  default       NameStartChar         d                                    8
 5  default       EOF                                                        9


>  enter Production (#Production)
>  >  enter NCName
         token NameStartChar, consumed A
>  >  >  enter 17
>  >  >  >  enter 16 (#NCName)
               token NameStartChar, consumed b
<  <  <  <  ekzit 16
<  <  <  ekzit 17
<  <  ekzit NCName
      token Production_a, consumed ::=
>  >  enter Choice
>  >  >  enter SequenceOrDifference
>  >  >  >  enter 12
>  >  >  >  >  enter NCName
                  token NameStartChar, consumed C
>  >  >  >  >  >  enter 17
<  <  <  <  <  <  ekzit 17
<  <  <  <  <  ekzit NCName
>  >  >  >  >  enter 11
>  >  >  >  >  >  enter 10 (#SequenceOrDifference)
>  >  >  >  >  >  >  enter 9
>  >  >  >  >  >  >  >  enter NCName
                           token NameStartChar, consumed d
>  >  >  >  >  >  >  >  >  enter 17
<  <  <  <  <  <  <  <  <  ekzit 17
<  <  <  <  <  <  <  <  ekzit NCName
<  <  <  <  <  <  <  ekzit 9
<  <  <  <  <  <  ekzit 10
<  <  <  <  <  ekzit 11
<  <  <  <  ekzit 12
<  <  <  ekzit SequenceOrDifference
>  >  >  enter 5
<  <  <  ekzit 5
<  <  ekzit Choice
<  ekzit Production


>  #Production
>  >  #NCName
>  >  #Choice
>  >  >  #SequenceOrDifference
>  >  >  >  #NCName
>  >  >  >  #NCName

At:

                  token NameStartChar, consumed C
>  >  >  >  >  >  enter 17

Expect:

>  >  >  >  >  >  >  enter 16 (#NCName)

But get:

`<  <  <  <  <  <  ekzit 17`

I think there is a preference for the second NCName() here:

#SequenceOrDifference:
    (NCName() (::SequenceOrDifference_a:: NCName() |    >>> NCName()* <<<    ))?

Instead of trying to match another ::NameStartChar:: here:

#NCName:
    ::NameStartChar:: ( >>> ::NameStartChar:: <<<)*

Please note that i put >>> <<< just to indicate the flow of the grammar, refer to case.pp for the actual grammar.

Lexer speedup

If you remove the support for the token namespaces, then can significantly speed up the Lexer. This will simply completely rewrite the algorithm.

Benchmarks:

3000 code lines
11867 tokens
5 times

Stand:

i7 6700k
Win 10
PHP 7.1.11 (cli)

Hoa Original

Sources: Original Lexer

28.6555s
28.4433s
29.2020s
28.9820s
30.1294s

AVG: 29.0824s (408 token/s)

Compiltely rewritten (Hoa-like)

Sources: Rewritten Lexer

29.4850s
30.6486s
31.3297s
31.1340s
32.0120s

AVG: 30.9218s (383.7 token/s)

Fast Lexer (without namespaces support)

Sources: FastLexer

0.2046s
0.2085s
0.2194s
0.2124s
0.2009s

AVG: 0.2091s (56752.7 token/s)

Can it make sense to adapt it for Hoa and in those cases when the user's grammar contains only the default namespace to use the FastLexer implementation?

Remove $nodeId from Token rule definition.

The third argument to the Token rule is not needed and is not required at runtime or for some other things. We can get rid of it.

new Token($id, $name, $nodeId, $unificationId, $kept);
//                    ^^^^^^^ - excess

PSR Support

What do you think about supporting generally accepted standards? e.g.:

Now the code of Hoa looks ... In Baroque style and reading it is quite problematic %)

State restore error

There are suspicions that the backtrack method does not work correctly and does not find the previous correct chain of rules, but I have not found the reason yet :\

Grammar

%skip T_WHITESPACE \s+
%token T_DIGIT \d+
%token T_WORD \w+

#grammar:
    digits() | words()

#digits:
    <T_DIGIT>*

#words:
    <T_WORD>*

Sample

2 3 4 a b c

Expected

>  #grammar
>  >  #digits
>  >  >  token(T_DIGIT, 2)
>  >  >  token(T_DIGIT, 3)
>  >  >  token(T_DIGIT, 4)
>  >  #words
>  >  >  token(T_WORD, a)
>  >  >  token(T_WORD, b)
>  >  >  token(T_WORD, c)

Actual

Hoa\Compiler\Exception\UnexpectedToken : Unexpected token "a" (T_WORD) at line 1 and column 7:
2 3 4 a b c
      ↑
 ~/vendor/hoa/compiler/Llk/Parser.php:1

Add a research papers section in the documentation

Like in http://hoa-project.net/En/Literature/Hack/Test.html#Research_papers.

Introduce an `%import` pragma

While working on TML I needed to write a second grammar for test purpose. This grammar is quite the same as the original language grammar but with some rules overloaded.

I played a bit whit compiler and came up with an %import directive. It works quite well in my case and was wondering if you would want such feature in compiler. Here is how it looks like:

// src/tml.pp

%token             T_FN                    \.[a-zA-Z_][a-zA-Z0-9_]*
%token             T_VAR                   @[a-zA-Z_][a-zA-Z0-9_]*
%token             T_NUMBER                \-?[1-9][0-9]*

#tml:
    ( fn() | expr() | assign() | str() )+

#expr:
    ( <T_NUMBER> | rvar() ) ( operator() expr() )?

// ...

// tests/tml.pp

%import ../src/tml.pp

#tml:
      expr()
    | assign()
    | str()

#expr:
    <T_NUMBER> ( operator() expr() | division() )?

// Avoids division by zero
division:
    <T_NUMBER[0]> <T_OP_DIVIDE> <T_NUMBER[0]>

operator:
      <T_OP_PLUS>
    | <T_OP_MINUS>
    | <T_OP_MULTI>

Basically, the tests/tml.pp will be loaded (Compiler\Llk\Llk::load) and parsed. The %import directive will be reached and the src/tml.pp grammar will in turn be loaded and parsed. Then we continue with tests/tml.pp thus overloading rules and tokens from src/tml.pp.

The %import can be used at the top of the file to overload imported grammar or a anywhere else to produce different results. Grammar are imported relatively to the file importing it. If an imported grammar imports in turn another grammar, the import will be relative to the file where the %import directive is written.

What do you think ?

Question: how to access/traverse nodes of grammar

After the PP is read how can the grammar nodes be access / traversed programmatically? Before reading in any actual input for the parser.

`compiler:pp -s` and `compiler:pp -v dump` should have colors

@jubianchi did a good POC with it. It is a great idea. We should do that. Maybe @jubianchi could explain more.

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

allow nesting namespaces

In some cases, it can be useful to come back to previous namespace (ie: “the NS we were before the current one”). Such a feature could be expressed with 2 syntaxes (for example, one can imagine another one).
The following examples illustrates these 2 syntax on json.pp file (look at %token string:_quote token) :

%skip   space          \s
// Scalars.
%token  true           true
%token  false          false
%token  null           null
// Strings.
%token  quote_         "        -> string
%token  string:string  [^"]+
%token  string:_quote  "        -> __PREVIOUS_NS__
// Objects.
%token  brace_         {
%token _brace          }
// Arrays.
%token  bracket_       \[
%token _bracket        \]
// Rest.
%token  colon          :
%token  comma          ,
%token  number         \d+

value:
    <true> | <false> | <null> | string() | object() | array() | number()

string:
    ::quote_:: <string> ::_quote::

number:
    <number>

#object:
    ::brace_:: pair() ( ::comma:: pair() )* ::_brace::

#pair:
    string() ::colon:: value()

#array:
    ::bracket_:: value() ( ::comma:: value() )* ::_bracket::

where __PREVIOUS_NS__ is an alias for __PREVIOUS_1_NS__, and the general form being __PREVIOUS_#_NS__ (# is the number of namespace the compiler should go back)

%skip   space          \s
// Scalars.
%token  true           true
%token  false          false
%token  null           null
// Strings.
%token  quote_         "        -> string
%token  string:string  [^"]+
%token  string:_quote  "        <-
// Objects.
%token  brace_         {
%token _brace          }
// Arrays.
%token  bracket_       \[
%token _bracket        \]
// Rest.
%token  colon          :
%token  comma          ,
%token  number         \d+

value:
    <true> | <false> | <null> | string() | object() | array() | number()

string:
    ::quote_:: <string> ::_quote::

number:
    <number>

#object:
    ::brace_:: pair() ( ::comma:: pair() )* ::_brace::

#pair:
    string() ::colon:: value()

#array:
    ::bracket_:: value() ( ::comma:: value() )* ::_bracket::

where <- means “leave current NS” (implying “go back to previous one”), and with <- being allowed several times (for example <- <- <- goes 3 ns back)

Some questions about the structure of rules

What is the difference between $nodeId and $defaultId in grammar rules?

(new Concatenation($id, $children, $nodeId))->setDefaultId($defaultId);
//                                 ^^^^^^^ - There ------- ^^^^^^^^^^

Why do I need $nodeId in tokens?

new Token($id, $name, $nodeId, $unificationId, $kept);
//                    ^^^^^^^ - excess?

Why determine that the rule is transition, if the identifier can uniquely point to it?

Those. we can have completely numeric indexes in the rules array, and whether or not the rule in the AST will be determined by the presence of the name in the "Entry" element of the trace. That some order should speed up the initialization and fetching (optimizing arrays for numeric indexes php7+)

Dependency to mbstring

Hi, as the lexer uses mb_strlen, I think that the composer.json should require ext-mbstring or symfony/polyfill-mbstring, don't you ? :)

Madness with exceptions

Maybe it's just a PhpStorm's bug, but I really didn't find the root Exception class.

use Hoa\Exception as HoaException;

— what is that? Aliasing sub-namespace or aliasing class Exception in the namespace Hoa?

If first, how does it work:

class Exception extends HoaException

If second, I see no class Exception in the namespace Hoa (only Hoa\Exception\Exception).

It looks like black magic. What's the purpose?

And aliasing is still supported by most popular IDE (PhpStorm) quite bad, e.g. find usages ignores aliased classes. I know that it's not an argument ("they should fix it"), but it also leads to confusions when reading code. Is it possible to resolve all aliases?

And I still don't get purpose of Consistency::flexEntity.

lexer fail for empty tokens

in matchesLexem (https://github.com/hoaproject/Compiler/blob/master/Llk/Lexer.php#L185-L199), if a token allows empty string to match (for example [a-z]*), then strpos (https://github.com/hoaproject/Compiler/blob/master/Llk/Lexer.php#L191) will report an error because needle is an empty string.

Either Hoa should complain about token allowing empty match (and doc should say it's forbiden),
either Hoa should support empty matches (But I'm not sure it makes sense to have empty tokens…)

Unrecognized Token in Lexer always reports Line 1?

👋 The lexer always reports Line 1 when it encounters an unrecognized token:

https://github.com/hoaproject/Compiler/blob/master/Llk/Lexer.php#L151

This is particularly problematic as it the file being parsed has 100s of lines. Guess this is also related to #97

PP supports in editors

Hello :-),

Supported editors:

Should we add these resources in this library or in Hoa\Devtools @hoaproject/hoackers?

PHP Warning in Parser class

Error message

PHP Warning:  strrpos(): Offset is greater than the length of haystack string in ~/vendor/hoa/compiler/Llk/Parser.php on line 198

Grammar.pp:

%skip  T_IGNORE                 [\xfeff\x20\x09\x0a\x0d]+

%token T_COLON                  :
%token T_BRACE_OPEN             {
%token T_BRACE_CLOSE            }

%token T_NAME                   ([_A-Za-z][_0-9A-Za-z]*)
%token T_SCHEMA_DEFINITION      schema
%token T_TYPE_DEFINITION        type
%token T_ENUM_DEFINITION        enum
%token T_UNION_DEFINITION       union
%token T_INTERFACE_DEFINITION   interface

%token T_SCALAR_INTEGER         Int
%token T_SCALAR_FLOAT           Float
%token T_SCALAR_STRING          String
%token T_SCALAR_BOOLEAN         Boolean
%token T_SCALAR_ID              ID

#Document:
    TypeDefinition()*

#TypeDefinition:
    ::T_TYPE_DEFINITION:: <T_NAME>? ::T_BRACE_OPEN:: Fields()* ::T_BRACE_CLOSE::

#Fields:
    <T_NAME> ::T_COLON:: FieldValue()

#FieldValue:
    <T_SCALAR_INTEGER> | <T_SCALAR_FLOAT> | <T_SCALAR_STRING> | <T_SCALAR_BOOLEAN> | <T_SCALAR_ID> | <T_NAME>

Source code for parsing:

type Test {

}

Fix the README examples

Both the JSON grammar and the generated data can be updated since recent commits on Hoa\Compiler and Hoa\Json. It must be updated.

Parsing tree is just the first token

When running this grammar https://github.com/hamlet-framework/type/blob/master/src/Reader/grammar.pp

against this test set: https://github.com/hamlet-framework/type/blob/master/tests/Reader/ParserTest.php

the output is:

\Hamlet\Cast\Type
>  token(id, \Hamlet\Cast\Type)
array
>  token(array, array)
int
>  token(built_in, int)
array<string, array<string, array{DateTime}>>
>  token(array, array)
array|null|false|1|1.1
>  token(array, array)
array{id:int|null,name?:string|null}
>  token(array, array)
('a'|'b'|'c')
>  token(string:string, a)
'a'|'b'
>  token(string:string, a)
string[][]
>  token(built_in, string)
(1|false)[]
>  token(int_number, 1)
(A::FOO | A::BAR)
>  token(id, A)
int[]
>  token(built_in, int)
callable(('a'|'b'), int):(string|array{\DateTime}|callable():void)
>  token(callable, callable)
array{0: string, 1: string, foo: stdClass, 28: false}
>  token(array, array)
A::class|B::class
>  token(id, A)
Closure(bool):int
>  token(id, Closure)
array<string,\DateTime>
>  token(array, array)
Generator<T0, int, mixed, T0>
>  token(id, Generator)
Generator<T0, int, mixed, T0> & object
>  token(id, Generator)

What is the output not the complete tree but just the first token?

lexer can match empty value if it changes NS

llk.php disallows lexeme matching an empty value with following code

if('' === $matches[0])
    throw new \Hoa\Compiler\Exception\Lexer(
        'A lexeme must not match an empty value, which is the ' .
        'case of "%s" (%s).', 3, array($lexeme, $regex));

but if this lexeme changes the current NS, it can permit a new match in the new NS. So this has to be allowed (maybe only in certain circumstances).

In fact, I have no idea if such a pattern would be useful, but I just found this piece of code theoretically false

doc improvement required

see http://discourse.hoa-project.net/t/hoa-compiler-feedback-help/205

Like I say, for me the documentation need some improvements :

Insiste on the importance of the order in the PP syntax (order of the lexem)
Describe this part http://discourse.hoa-project.net/t/hoa-compiler-feedback-help/205/2 a bit more

Honestly that was the two point that were missing and that require to have help for me.

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Multiple start-symbols support

Would be great to have support for multiple start-symbols.

I guess it would not be difficult to add since the start rule is just found in the rules

Compiler/Llk/Parser.php

Line 173 in de036d5

$rule = $this->getRootRule();

Compiler/Llk/Parser.php

Lines 774 to 783 in de036d5

    
           public function getRootRule() 
        
           { 
        
               foreach ($this->_rules as $rule => $_) { 
        
                   if (!is_int($rule)) { 
        
                       break; 
        
                   } 
        
               } 
        
               return $rule; 
        
           }

misaligned unicode token when printing token sequence

since https://github.com/hoaproject/Compiler/blob/master/Bin/Pp.php#L238 uses printf, unicode strings are not correctly handled. As pointed in http://stackoverflow.com/questions/16003505/php-sprintf-with-foreign-characters printf is not utf8-aware, and @geraldcroes suggests in an answer :

You can do the trick by doing : utf8_encode(sprintf('format', utf8_decode($yourstring));... Of course you'll have to check every arguments if many are given.

Space in token

Hi,

I would like to match MATCH with a space at the end. When using regular expressions a space just remains a space. Can the token be enclosed somehow in quotes?

Thanks for support

error when first token is unexpected

when first token is unexpected, https://github.com/hoaproject/Compiler/blob/master/Llk/Parser.php#L216 will fail because strrpos's offset will be greater than $text's length

Backtrack issue when rules overlap

Hi!
This library looks really awesome, so I'm playing with it but I'm facing with a "basic" issue and I can't figure out if it's a limitation, a bug, or my mistake… Can you help me?

Here a minimalist grammar to illustrate my situation:

%token a a
%token word \w+

#root:
    <word> | <a>

I want to match all words, but a is a special keyword, I want to match it distinctly. The problem comes when I try to parse "ab": a is recognized as a token and then the parser is stuck on b character with an UnexpectedToken exception. In my understanding, the parser should backtrack, discard the choice of the token a and follow with the token word… Am I wrong?

ℹ️

If I invert the order of rules, "a" input is identified as a word 👎
I could use %token word a\w+|[^a]\w* at first rule but… looks very weird and hard to maintain IMHO
I could discard the token a, matching words only and use AST to identify my specific keywords, but I think it's the role of the syntax analyzer, isn't it?

Thanks in advance for your help, and your nice work on this library :) 👍

Recursive / circular rules leads to an infinite loop

Hello,
It seems that when a rule is recursive or have a circular dependecy with other rules, the parser goes to an infinite loop. During the execution of Hoa\Compiler\Llk\Parser::unfold, instead of decreasing, Hoa\Compiler\Llk\Parser::_todo keeps growing, so the while loop never stops.

I have this behavior with hoa/compiler 3.17.08.08
Example to reproduce:

<?php
use Hoa\Compiler\Llk\Llk;
use Hoa\Compiler\Visitor\Dump;
use Hoa\File\ReadWrite;

$file = new ReadWrite('php://temp');
$file->writeString(<<<PP
%skip whitespace \s
%token and &&
%token integer \d+
%token foo_ \(
%token _foo \)

rule:
    _rule() | ::foo_:: _rule() ::_foo::  
_rule:
    (::integer:: | rule()) ::and:: (::integer:: | rule())
PP
);

$ast = Llk::load($file)->parse(<<<CODE
1 && (2 && 3) && 4
CODE
);

echo (new Dump())->visit($ast);

Add $node->getOffset() support

In addition to synax analysis (lex), there is also a semantic analysis of code. And in the case of errors of semantics, it is required to understand exactly where the error occurred. Like:

function a(int $b) {}
a(null);

// Error: Integer required in XXX file on line 2 and offset 3, but null given.
// a(null);
//    ^

The only way to get this information is the TreeNode instance:

class TreeNode {
    +public function getOffset(): int;
}

I tried to implement this in the form of Pull Request, but the source code is pretty confusing =\

support off-side rule languages

It seems Hoa\Compiler cannot parse Off-side rule languages.

Maybe it could be sufficient to have the compiler adding automatically INDENT (respectively UNINDENT) tokens each time indent increase (respectively decrease) by 1.

The tricky part seems to be the matching between spaces, tab, and indent length…

rule() containing #nodes

Here is a grammar I extracted from Hoa\Math

%skip   space     \s
%token  number    (0|[1-9]\d*)(\.\d+)?([eE][\+\-]?\d+)?
%token  plus      \+
%token  minus     \-|−

exp:
    <number>
    (( ::plus:: #add | ::minus:: #sub ) exp())?

I changed it into this

%skip   space     \s
%token  number    (0|[1-9]\d*)(\.\d+)?([eE][\+\-]?\d+)?
%token  plus      \+
%token  minus     \-|−

exp:
    <number>
    exp2()?

exp2:
    ( ::plus:: #add | ::minus:: #sub ) exp()

And was hoping it will work as previous grammar did but here are the AST
First grammar

#sub

token(number, 1)
#add

token(number, 2)
token(number, 3)

Second grammar

token(number, 1)

Is it normal that the rule exp2() is handled at all (even if the trace shows that it goes through it) ?
Thank you

Dependabot can't resolve your PHP dependency files

Dependabot can't resolve your PHP dependency files.

As a result, Dependabot couldn't update your dependencies.

The error Dependabot encountered was:

Your requirements could not be resolved to an installable set of packages.
  Problem 1
    - hoa/math 1.16.01.15 requires hoa/compiler ~3.0 -> satisfiable by hoa/compiler[3.16.01.11, 3.16.01.14, 3.16.08.15, 3.16.10.24, 3.17.01.10, 3.17.08.08].
    - hoa/math 1.16.01.29 requires hoa/compiler ~3.0 -> satisfiable by hoa/compiler[3.16.01.11, 3.16.01.14, 3.16.08.15, 3.16.10.24, 3.17.01.10, 3.17.08.08].
    - hoa/math 1.16.05.22 requires hoa/compiler ~3.0 -> satisfiable by hoa/compiler[3.16.01.11, 3.16.01.14, 3.16.08.15, 3.16.10.24, 3.17.01.10, 3.17.08.08].
    - hoa/math 1.16.08.29 requires hoa/compiler ~3.0 -> satisfiable by hoa/compiler[3.16.01.11, 3.16.01.14, 3.16.08.15, 3.16.10.24, 3.17.01.10, 3.17.08.08].
    - hoa/math 1.17.01.13 requires hoa/compiler ~3.0 -> satisfiable by hoa/compiler[3.16.01.11, 3.16.01.14, 3.16.08.15, 3.16.10.24, 3.17.01.10, 3.17.08.08].
    - hoa/math 1.17.05.16 requires hoa/compiler ~3.0 -> satisfiable by hoa/compiler[3.16.01.11, 3.16.01.14, 3.16.08.15, 3.16.10.24, 3.17.01.10, 3.17.08.08].
    - Can only install one of: hoa/compiler[3.16.01.14, No version set (parsed as 1.0.0)].
    - Can only install one of: hoa/compiler[3.16.08.15, No version set (parsed as 1.0.0)].
    - Can only install one of: hoa/compiler[3.16.10.24, No version set (parsed as 1.0.0)].
    - Can only install one of: hoa/compiler[3.17.01.10, No version set (parsed as 1.0.0)].
    - Can only install one of: hoa/compiler[3.17.08.08, No version set (parsed as 1.0.0)].
    - hoa/compiler 3.16.01.11 requires hoa/file ~0.0 -> satisfiable by hoa/file[0.14.09.16, 0.14.09.17, 0.14.09.23, 0.14.11.09, 0.14.11.26, 0.14.12.10, 0.15.02.19, 0.15.05.12, 0.15.05.27, 0.15.11.09] but these conflict with your requirements or minimum-stability.
    - hoa/math 1.16.01.14 requires hoa/zformat ~0.0 -> no matching package found.
    - Installation request for hoa/compiler No version set (parsed as 1.0.0) -> satisfiable by hoa/compiler[No version set (parsed as 1.0.0)].
    - Installation request for hoa/math ~1.0 -> satisfiable by hoa/math[1.16.01.14, 1.16.01.15, 1.16.01.29, 1.16.05.22, 1.16.08.29, 1.17.01.13, 1.17.05.16].

Potential causes:
 - A typo in the package name
 - The package is not available in a stable-enough version according to your minimum-stability setting
   see <https://getcomposer.org/doc/04-schema.md#minimum-stability> for more details.
 - It's a private package and you forgot to add a custom repository to find it

Read <https://getcomposer.org/doc/articles/troubleshooting.md> for further common problems.

If you think the above is an error on Dependabot's side please don't hesitate to get in touch - we'll do whatever we can to fix it.

You can mention @dependabot in the comments below to contact the Dependabot team.

Hoa Compiler and (E)BNF

Hi, I've some questions about the compiler. Since I'm considering implementing an existing Domain Specific Language (DSL) in PHP. So I would need a compiler for this. But then I noticed that most already defined languages are defined in Backus–Naur Form (BNF) or Extended Backus–Naur Form (EBNF). From here certain questions come to mind:

Is PP language mainly intended for greating new DSL rather then port existing ones?
Why was PP language created and not a compiler for (E)BNF?
Would it be worth porting EBNF to PP or would I be better of using another compiler?
If it's possible to convert please give a small example. http://doctrine-orm.readthedocs.org/en/latest/reference/dql-doctrine-query-language.html#query-language

For completeness here is a list of resources I found on PHP, Compilers and (E)BNF

EBNF & some other stuff
http://karmin.ch/ebnf/index
http://sourceforge.net/projects/lime-php/
https://github.com/hafriedlander/php-peg
http://php.comsci.us/syntax/statement/ebnf.php
http://php.comsci.us/syntax/statement/bnf.php
http://marc.info/?l=php-internals&m=129387252319019
https://github.com/ferno/loco/blob/master/ebnf.php

BNF
http://www.garshol.priv.no/download/text/bnf.html
http://lxr.php.net/xref/PHP_TRUNK/Zend/zend_language_parser.y
http://lxr.php.net/xref/PHP_TRUNK/Zend/zend_language_scanner.l
http://www.icosaedro.it/articoli/php-syntax-ebnf.txt
http://www.icosaedro.it/articoli/php-syntax-yacc.txt
http://www.phpclasses.org/package/7142-PHP-Parse-language-source-with-a-BNF-grammar-syntax.html
https://github.com/ferno/loco/blob/master/bnf.php
http://code.google.com/p/pragmatic-parser/source/browse/trunk/parser.class.php?r=2

doc : improve explanation on namespace

When I read the following topics http://discourse.hoa-project.net/t/hoa-compiler-keyword-identifier-clash/252 I can understand that namespace permit to avoid collision between namespace but also to isolate each namespace (correct?).

So it could be intresting to add a section in the namespace documentation about that, on the why using namespace and they purpose with a good example ;)

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Remove dependency to `ext/ctype`

It has been noted in #93 (comment) that hoa/compiler depends ext/ctype just because of a single call to ctype_digit in a non-critical path. I think it's a good idea to remove this dependency :-).

use of non multibyte string functions can lead to erroneous error messages

while manipulating $offset for #6, I came to the following : if you use UTF8 chars with pp, error messages (at least) can be wrong. Consider, for example,

the following pp

%skip       SPACE       \s
%token      FOO         foo
%token      BAR         bar
%token      BAZ         baz
%token      CHECK       ✓

#doc:
    <FOO> <BAR> <CHECK> <BAZ>

with the following string

foo bar ✓ pouet baz

error message will be

Hoa\Compiler\Llk\Lexer::lexMe(): (0) Unrecognized token "p" at line 1 and column 13:
foo bar ✓ pouet baz
            ↑
in /media/Data/Matthieu/Documents/hoa/Libs/Compiler/Llk/Lexer.php at line 1.%

but p is effectively at column 11, not 13

Note: I suppose used char

String are not well formed by the Sampler.

How to reproduce:

Grammar.pp

%skip   space          \s
// Strings.
%token  quote_         "        -> string
%token  string:string  [a-z]+
%token  string:_quote  "        -> default

value:
    string()

string:
    ::quote_:: <string> ::_quote::

Run script:

$sampler = new Hoa\Compiler\Llk\Sampler\Coverage(
    Hoa\Compiler\Llk\Llk::load(
        new Hoa\File\Read('Grammar.pp')
    ),
    new Hoa\Regex\Visitor\Isotropic(
        new Hoa\Math\Sampler\Random()
    )
);

foreach($sampler as $i => $value)
    echo $value;

Will output some thing like:

" gvqjcd "

But in JSON there is not space after default:quote_ and string:string token.

space where added by Sampler::generateToken() to each token.

I've fix temporary it by a little patch, but I don't know his behaviour with other languages...

    protected function generateToken ( \Hoa\Compiler\Llk\Rule\Token $token ) {

        $toNamespace = $this->completeToken($token);
        $this->setCurrentNamespace($toNamespace);

        $string = $this->_tokenSampler->visit(
            $token->getAST()
        );

        if (   (   'quote_'  != $token->getTokenName()
                || 'default' != $token->getNamespace())
            && (   'string'  != $token->getTokenName()
                || 'string'  != $token->getNamespace()))
            $string .= ' '; // @todo: use skip token.

        return $string;
    }

I think, we must add a PP keyword (or some thing else) for specify whether no space token must be added after a token.

PS: By the way Json\Grammar.pp <escaped> produce not valid JSON.

Bug when saving parser class

I have the token \ i need to escape it once because pp uses regex so it becomes:

%token token62 \\

when this is written as php class the backslash is only escaped once, like so:

'token62' => '\\\',

it doesn't matter if there are more surrounding characters (i have a few other rules which suffer from the same bug)

	public function getRootRule()
	{
	foreach ($this->_rules as $rule => $_) {
	if (!is_int($rule)) {
	break;
	}
	}

	return $rule;
	}

hoaproject / compiler Goto Github PK

compiler's People

Contributors

Stargazers

Watchers

Forkers

compiler's Issues

Hoa Original

Compiltely rewritten (Hoa-like)

Fast Lexer (without namespaces support)

Grammar

Sample

Expected

Actual

Error message

Recommend Projects

Recommend Topics

Recommend Org