Code Monkey home page Code Monkey logo

Comments (8)

w4454962 avatar w4454962 commented on June 5, 2024

It takes 4 seconds for pegtl to parse 300000 lines of 27mb jass syntax tree

https://github.com/w4454962/jass-parser

luajit + lpeg parsing 300000 lines of 27mb jass syntax tree takes 2 seconds

https://github.com/w4454962/jass-parser-luajit

It only takes 0.7 seconds to parse the same content using the same rules of bison/yacc

https://github.com/lep/pjass

from pegtl.

w4454962 avatar w4454962 commented on June 5, 2024

Is there a way to implement a pegtl version that is close to the running efficiency of the bison / yacc scheme?

from pegtl.

d-frey avatar d-frey commented on June 5, 2024

It looks like you just created a static stack to avoid allocations. Have you tried to use a cache like in our example at

namespace internal
{
// a non-thread-safe allocation cache, assuming that the size (sz) is always the same!
template< std::size_t N >
struct cache
{
std::size_t pos = 0;
std::array< void*, N > data;
cache() = default;
cache( const cache& ) = delete;
cache( cache&& ) = delete;
~cache()
{
while( pos != 0 ) {
::operator delete( data[ --pos ] );
}
}
cache& operator=( const cache& ) = delete;
cache& operator=( cache&& ) = delete;
void* get( std::size_t sz )
{
if( pos != 0 ) {
return data[ --pos ];
}
return ::operator new( sz );
}
void put( void* p )
{
if( pos < N ) {
data[ pos++ ] = p;
}
else {
::operator delete( p );
}
}
};
static cache< 32 > the_cache;
} // namespace internal
// this is not necessary for the example, but serves as a demonstration for an additional optimization.
struct node
: parse_tree::basic_node< node >
{
void* operator new( std::size_t sz )
{
assert( sz == sizeof( node ) );
return internal::the_cache.get( sz );
}
void operator delete( void* p )
{
internal::the_cache.put( p );
}
};

As for the comparison to luajit + lpeg or even to bison/yacc: I don't think we will ever be that fast with our parse tree. And I don't have time to look into this more, sorry.

from pegtl.

w4454962 avatar w4454962 commented on June 5, 2024

OK, I really ignored the static cache code in the demo. It already exists and is the same as my idea. Thank you for your reply. If the speed can't be faster, we can only give up pegtl

from pegtl.

ColinH avatar ColinH commented on June 5, 2024

The PEGTL is highly modular and there are many things that can be done to improve parsing performance.

However the parse_tree included with the PEGTL is more of an educational example that shows how to parse into a certain kind of data structure built around standard (template) library facilities, and less of a highly optimised way to build a tree when speed is important.

One important thing to keep an eye on if you care about performance that is independent of the data structure you generate while parsing is to keep the grammar as simple as possible, and in particular to tune it for minimal backtracking (seeing as backtracking always implies wasted work).

For example you have several rules that are implemented as sor< one< A, B, ... > > instead of the equivalent one< A, B, ... >. Now this particular case is handled efficiently by the PEGTL and does not produce a run-time overhead, however every superfluous sub-rule in the grammar makes it more complicated to parse (no pun intended) and reason about as a human.

Similarly you frequently use star< seq< A, B, ... > > instead of the equivalent, and shorter, star< A, B, ... > (star has an implicit seq), unless of course you really need the seq to attach actions to, or for parse_tree nodes. There are other examples of this kind of potentially superfluous rules in your grammar.

Then there is the way you parse expressions, where the space rule before the operators will typically be parsed multiple times before the right one, as determined by the operator after the space, was found.

And another thing to look at is whether the sub-rules of an sor are, if possible, ordered from most likely to least likely (as they will be attempted to match in the order listed in the grammar).

Then there is the possibly best chance of improving performance, bypassing the generic parse_tree and directly parsing into a more "strongly typed" data structure that minimises heap allocations. I can't say more about this without spending significantly more time with your code.

from pegtl.

w4454962 avatar w4454962 commented on June 5, 2024

The PEGTL is highly modular and there are many things that can be done to improve parsing performance.

However the parse_tree included with the PEGTL is more of an educational example that shows how to parse into a certain kind of data structure built around standard (template) library facilities, and less of a highly optimised way to build a tree when speed is important.

One important thing to keep an eye on if you care about performance that is independent of the data structure you generate while parsing is to keep the grammar as simple as possible, and in particular to tune it for minimal backtracking (seeing as backtracking always implies wasted work).

For example you have several rules that are implemented as sor< one< A, B, ... > > instead of the equivalent one< A, B, ... >. Now this particular case is handled efficiently by the PEGTL and does not produce a run-time overhead, however every superfluous sub-rule in the grammar makes it more complicated to parse (no pun intended) and reason about as a human.

Similarly you frequently use star< seq< A, B, ... > > instead of the equivalent, and shorter, star< A, B, ... > (star has an implicit seq), unless of course you really need the seq to attach actions to, or for parse_tree nodes. There are other examples of this kind of potentially superfluous rules in your grammar.

Then there is the way you parse expressions, where the space rule before the operators will typically be parsed multiple times before the right one, as determined by the operator after the space, was found.

And another thing to look at is whether the sub-rules of an sor are, if possible, ordered from most likely to least likely (as they will be attempted to match in the order listed in the grammar).

Then there is the possibly best chance of improving performance, bypassing the generic parse_tree and directly parsing into a more "strongly typed" data structure that minimises heap allocations. I can't say more about this without spending significantly more time with your code.

I have greatly optimized the upper logic code, but the speed has not been improved by an order of magnitude. It is only accelerated from 4.2s to 4.0s. It may be more necessary to optimize pegtl itself or more deeply optimize parse_ tree

from pegtl.

w4454962 avatar w4454962 commented on June 5, 2024

And if you use the cache method demonstrated by pegtl, it will be 1s slower than my cache code, I don't know how to design a more efficient "parse_tree", which seems to have reached the bottleneck stage.

from pegtl.

ColinH avatar ColinH commented on June 5, 2024

The idea is not to make the parse tree as fast as possible, but to create a more specific and optimised data structure while parsing, one that minimises memory allocations and wasted work...

from pegtl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.