Comments (8)
It takes 4 seconds for pegtl to parse 300000 lines of 27mb jass syntax tree
https://github.com/w4454962/jass-parser
luajit + lpeg parsing 300000 lines of 27mb jass syntax tree takes 2 seconds
https://github.com/w4454962/jass-parser-luajit
It only takes 0.7 seconds to parse the same content using the same rules of bison/yacc
from pegtl.
Is there a way to implement a pegtl version that is close to the running efficiency of the bison / yacc scheme?
from pegtl.
It looks like you just created a static stack to avoid allocations. Have you tried to use a cache like in our example at
PEGTL/src/example/pegtl/parse_tree.cpp
Lines 99 to 160 in 3fbf672
As for the comparison to luajit + lpeg or even to bison/yacc: I don't think we will ever be that fast with our parse tree. And I don't have time to look into this more, sorry.
from pegtl.
OK, I really ignored the static cache code in the demo. It already exists and is the same as my idea. Thank you for your reply. If the speed can't be faster, we can only give up pegtl
from pegtl.
The PEGTL is highly modular and there are many things that can be done to improve parsing performance.
However the parse_tree
included with the PEGTL is more of an educational example that shows how to parse into a certain kind of data structure built around standard (template) library facilities, and less of a highly optimised way to build a tree when speed is important.
One important thing to keep an eye on if you care about performance that is independent of the data structure you generate while parsing is to keep the grammar as simple as possible, and in particular to tune it for minimal backtracking (seeing as backtracking always implies wasted work).
For example you have several rules that are implemented as sor< one< A, B, ... > >
instead of the equivalent one< A, B, ... >
. Now this particular case is handled efficiently by the PEGTL and does not produce a run-time overhead, however every superfluous sub-rule in the grammar makes it more complicated to parse (no pun intended) and reason about as a human.
Similarly you frequently use star< seq< A, B, ... > >
instead of the equivalent, and shorter, star< A, B, ... >
(star
has an implicit seq
), unless of course you really need the seq
to attach actions to, or for parse_tree
nodes. There are other examples of this kind of potentially superfluous rules in your grammar.
Then there is the way you parse expressions, where the space
rule before the operators will typically be parsed multiple times before the right one, as determined by the operator after the space
, was found.
And another thing to look at is whether the sub-rules of an sor
are, if possible, ordered from most likely to least likely (as they will be attempted to match in the order listed in the grammar).
Then there is the possibly best chance of improving performance, bypassing the generic parse_tree
and directly parsing into a more "strongly typed" data structure that minimises heap allocations. I can't say more about this without spending significantly more time with your code.
from pegtl.
The PEGTL is highly modular and there are many things that can be done to improve parsing performance.
However the
parse_tree
included with the PEGTL is more of an educational example that shows how to parse into a certain kind of data structure built around standard (template) library facilities, and less of a highly optimised way to build a tree when speed is important.One important thing to keep an eye on if you care about performance that is independent of the data structure you generate while parsing is to keep the grammar as simple as possible, and in particular to tune it for minimal backtracking (seeing as backtracking always implies wasted work).
For example you have several rules that are implemented as
sor< one< A, B, ... > >
instead of the equivalentone< A, B, ... >
. Now this particular case is handled efficiently by the PEGTL and does not produce a run-time overhead, however every superfluous sub-rule in the grammar makes it more complicated to parse (no pun intended) and reason about as a human.Similarly you frequently use
star< seq< A, B, ... > >
instead of the equivalent, and shorter,star< A, B, ... >
(star
has an implicitseq
), unless of course you really need theseq
to attach actions to, or forparse_tree
nodes. There are other examples of this kind of potentially superfluous rules in your grammar.Then there is the way you parse expressions, where the
space
rule before the operators will typically be parsed multiple times before the right one, as determined by the operator after thespace
, was found.And another thing to look at is whether the sub-rules of an
sor
are, if possible, ordered from most likely to least likely (as they will be attempted to match in the order listed in the grammar).Then there is the possibly best chance of improving performance, bypassing the generic
parse_tree
and directly parsing into a more "strongly typed" data structure that minimises heap allocations. I can't say more about this without spending significantly more time with your code.
I have greatly optimized the upper logic code, but the speed has not been improved by an order of magnitude. It is only accelerated from 4.2s to 4.0s. It may be more necessary to optimize pegtl itself or more deeply optimize parse_ tree
from pegtl.
And if you use the cache method demonstrated by pegtl, it will be 1s slower than my cache code, I don't know how to design a more efficient "parse_tree", which seems to have reached the bottleneck stage.
from pegtl.
The idea is not to make the parse tree as fast as possible, but to create a more specific and optimised data structure while parsing, one that minimises memory allocations and wasted work...
from pegtl.
Related Issues (20)
- [Git main] Merge conflict markers in file "doc/Changelog.md" HOT 2
- [>=3.2.1] Compile error with GCC 9.2 — error: must '#include <typeinfo>' before using 'typeid' HOT 6
- How do I capture each substring at run time HOT 3
- MSVC: error C2338: static_assert failed: 'internal::dependent_true< T > && ( begin != std::string_view::npos ) HOT 3
- Example grammar proto3 does not accept enum fields starting from zero HOT 2
- data type of input? byte? character? HOT 3
- <ciso646> is removed in C++20 and should not be included HOT 1
- Feature Request: Add defines to exclude headers to improve compile time HOT 4
- Does "pegtl" support the operation of binary data serialization/deserialization? HOT 2
- Why parsing succeeds? HOT 3
- Why do I have an infinite loop? HOT 4
- parser_tree.cpp example not compiling in VS 2022, as of PEGTL 3.2.6 HOT 4
- Order independence of rules HOT 7
- list_tail<> invokes action for trailing separator twice? HOT 10
- Parsing Binary Data Encounters Left recursion Problem HOT 9
- Any consideration of ghc::filesystem HOT 8
- Can't get custom error messages to work. HOT 8
- Backreferences and grammar tracing/analyzing. HOT 5
- vs2022 编译错误 x64-windows-static\include\tao\pegtl\parse.hpp(45,38): error C2062: 意外的类型“unknown-type”
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pegtl.