Comments (15)
Added the arithmetic expression parser.
https://github.com/GregRos/parjs/blob/master/examples/math.ts
Also moved to reference lodash on a per-function basis. Also removed dependency on chai.
A minified, uglified webpack bundle of the library is now around 170kb. Modularizing the unicode data so that it's opt-in will probably reduce bundle size down to 90kb or so. Gzip will probably reduce those figures by 2-3.
from parjs.
I was actually just playing around with parser combinator libraries trying to learn about them when I ran into this library.
Since you're asking for direction, I would vote for (3) then (1) then (2). Also maybe loosen dependencies.
Here is why: I don't know much about parser theory and I've never written a parser combinator. Most libraries have an example to parse json. So usually I copy the example and run it on a json blob using benchmarkjs. I think by doing (3) (even if it's just json), it will make it easier to do (1) (even if it's not a very rigorous benchmark).
Also I noticed you have lodash as a dependency, and it doesn't always seem to be used? I looked into this because I tried to create a minimal umd bundle of a parser and it was ~1mb. A similar bundle with parsimmon was about 20kb.
from parjs.
Thanks for the input 😄
That sounds like a good idea. I'll do an arithmetic expression parser, JSON parser, and hopefully a Markdown parser.
Thanks for the tip about the bundle size. It's a pretty big issue. I'll try to fix it, maybe create a lighter version. I can't really compete with Parsimmon in that respect because that library is just one javascript file with around 1000 lines. This library is much bigger than that and has many more features. But 1mb is just way too big for anything.
Lodash is used quiet often but if it's a problem maybe I'll reference individual lodash functions instead of the whole thing. I don't use the entire thing.
from parjs.
No problem! It looks very promising, keep up the great work. Let me know if there is anything in particular you would like help with.
I also think the arithmetic parser would be great, that happens to be very close to my use-case.
from parjs.
Ditto for the JSON parser.
I reduced the size of the library as I said above and the results are pretty close to what I expected.
Note that these figures are when compiling the TypeScript to ES6 (which I've done in the library). Compiling to ES5 may increase size by a factor of 2. ES6 is supported by all modern browsers.
from parjs.
Nice! I'm really glad the bundle info is on the readme as well. Great work!
from parjs.
I was reading https://tomassetti.me/parsing-in-javascript/#apg. It introduces many JS combinatory parser, and the parjs goes last, but I would choose to use the parjs since this is the most modular one. Use the parjs is like using pyparsing.
But Pyparsing has a downside that it can't easily deal with Unicode. Parsing Chinese file is hard, and lack of example. I expect parjs to be more comfortable in dealing with Chinese.
from parjs.
Thanks! It's nice seeing my work being discussed, and I'm happy you appreciate it.
parjs
works on Javascript characters, which work fine for anything in the BMP (this includes almost all CJK characters). However, you might run into trouble if you want to parse symbols composed of multiple characters. A Latin example is something like â̵̳, which is a sequence of around 4 characters but appears as one "letter". Parjs will recognize it as a string of length 4 and will treat it as such.
I don't know much about Chinese, but I know Korean characters have both combining and non-combining variants. Can you give me an example of what you want to parse?
Also, you can use the char-info
package to detect characters of a certain script (like Chinese).
from parjs.
Would this be of interest to you?
- General Docs: http://sap.github.io/chevrotain/website/Deep_Dive/custom_apis.html
- This is a toy API example of what can be done: - https://github.com/SAP/chevrotain/blob/master/examples/custom_apis/combinator/combinator_grammar.js
Basically I've exposed the Chevrotain parsing engine so it can be used to implement other parsing libraries including combinators.
naturally this has limitations:
- Relies on automatic output of a ParseTree (CST) and visiting it, no embedded actions support yet.
- Relies on "new Function()" so won't work where content security policy is enabled without first generating code.
- Distinct lexing phase unlike a PEG grammar.
But also has advantages due to using a mature engine:
- Very High Performance.
- debugable- (you can step into the generated code that implements the combinator).
from parjs.
Parser definitions being too long have always been a problem. I'm seeking a way to reduce the length of code I need to define a parser.
I'm working on a python project, so I'm using pyparsing. There are so many boilerplates, it looks like this in Python:
...
stock_parser = oneOf(stock_types).setResultsName('method') + Optional(oneOf('共 合')) + Optional('计') + stock_amount_parser ^ \
(stock_amount_parser + oneOf(stock_types).setResultsName('method'))
...
my_parser = OneOrMore(SkipTo(price_parser + SkipTo(stock_parser, include=True), include=True)) ^ OneOrMore(SkipTo(stock_parser + SkipTo(price_parser, include=True), include=True)) ^ SkipTo(stock_parser, include=True)
It's not very clear what I'm writing at first glance, since there are too much boilerplates.
And my colleagues are using plain regexp, they can write much shorter code than me since the regexp is just a simple string. Through regexp is too "hard coded", they are hard to refactor.
So I think, why not make something likes styled-components
, use template literal to write regexp, and carefully extend regexp with some context-free grammar's expressing ability.
I'd like to do some prototype on Javascript, build on parjs.
Maybe something likes:
const stockParser = parser`
(?<method>${stockTypes})[共合]?计?${stockAmountParser}|
${stockAmountParser}(?<method>${stockTypes})
`;
const myParser = parser`
(${priceParser}.*${stockParser})+|
(${stockParser}.*${priceParser})+|
${stockParser}
`;
myParser(`同意以3.22元/股的价格回购部分激励对象顾博、侯祥、祝里仁、丁树俊共4人已授予未解锁的合计86万股限制性股票`);
// {'amount': '86万', 'method': '限制性股票', 'price': '3.22', 'priceUnit': '元/股'}
It looks shorter and clearer.
What do you think?
from parjs.
@linonetwo I totally agree with your point. Being concise is important.
Embedding parsers inside tagged templates is a great idea! I haven't thought of that.
However, I don't agree with expressing combinators as commands in the string. I really, really dislike embedding code inside strings, especially when it's your own custom DSL. It's very easy to make mistakes and others will find it difficult to read what you wrote. Anyone wanting to use the feature would have to learn the language, with no IDE support. There is also the matter of designing and parsing the language.
But I like the syntax:
`${a}, ${b}, ${c}`
Even though fully expressing the type of the parser in TypeScript would probably be impossible, I'll add it.
Other stuff
I recently added a feature that lets string and regex literals automatically be converted to parsers that parse them. For example,Parjs.seq("a", "b", "c")
would parse "abc"
. In previous versions you would've had to write Parjs.seq(Parjs.string("a"), Parjs.string("b"), Parjs.string("c"))
or something similar.
I'm also working on a combinator that will take some sort of object of the form:
{
a : Parjs.string("1"),
b : Parjs.string("2"),
c : Parjs.string("3")
}
And return a parser that parses ${a}${b}${c}
and yields {a : "1", b : "2", c : "3"}
.
I have a few more ideas that should make defining parsers more concise. I'm also eager to hear more of yours.
from parjs.
You are right, without IDE support, custom syntax inside template literal is difficult to debug. I had walked into that when I was using codegen.macro
recently.
But I don't quite understand what you mean by ↓
`${a}, ${b}, ${c}`
// and
`${a}${b}${c}`
Do you mean they can be the shorthand for ↓?
Parjs.seq("a", "b", "c")
// and
Parjs.xxx({
a : Parjs.string("1"),
b : Parjs.string("2"),
c : Parjs.string("3")
})
from parjs.
The reason why I was considering a DSL:
My job is data extraction from some report. We are using pyparsing and regexp. But both of them turn out to be long and difficult to read. (We were initially using regexp, and we introduce pyparsing to solve this problem. Though it's refactorable, it is not readable when it being too long, or one sentence being separated into too many parts so we still don't know what this pattern is talking about!)
Both pyparsing and regexp can't be refactorable and readable at the same time.(Maybe it's my usage that wrongs. Is there any design pattern when using combinator?)
That's why I was considering DSL solution that looks like regex, but you can embed combinators inside. Here are my thoughts: https://github.com/linonetwo/fragmented-regex
I prefer not using a DSL too. Because I need IDE hinting and type checking. But I don't know whether there is another way to be both concise and precise when dealing with complex text like natural language and web pages.
In data extraction, there will be lots of patterns to write, and every pattern is long.
- lots of patterns to write: so we should refactor the same part, this is easy to achieve with the combinator
- every pattern is long: Every pattern is derived from an example text and mutated by more example text(There can be lots of thesaurus in the real world texts). To maintain such a pattern, you need to remember the initial example text. So pattern should look similar to the initial text. This is easier to achieve with the regexp
There may be a tradeoff to be both refactorable and readable. The expressing ability is between regexp and CFG.
from parjs.
I've another idea: Editing parser from GUI.
Do you think parjs is easy to be generated?
from parjs.
The string template idea
I thought of translating this:
`${p1} = ${p2}; ${p3}`
Into this:
Parjs.seq(p1, " = ", "; ", p3);
Still haven't had time to work on it.
Your specific problem
Thanks for telling me your use case. It sounds really interesting. I haven't really done that sort of thing before. I've mainly done parsing of programming languages/DSLs.
In general, my advice for working with parser combinators is to build the parser tree incrementally, giving some component parsers names, and trying to reuse them as much as possible. Also, create and store parsers for things like common number formats, styles of headings, and so on.
Also, don't be afraid to define new combinators (as functions) for specific use-cases. For example, it might make sense to define:
let betweenParens = p => p.between("(", ")")
To parse some expression between parentheses, if different expressions can appear in parentheses.
About using a GUI
If you want to generate a parser from a GUI, I recommend that you use a parser generator library. Some mature libraries (no JS library that I'm aware of) have front-end applications. There won't be any benefit from using a parser-combinator library if you generate the code anyway.
GUIs usually offer only a small subset of the capabilities of a library though.
from parjs.
Related Issues (20)
- How to set custom "expecting" messages? HOT 3
- `thenPick` Automatically Assumes `source` Consumed Input HOT 2
- The `or` combinators aren't grouping the error messages. HOT 3
- Write a new math example HOT 3
- .pipe() function causes vitest to hang infinitely HOT 3
- Run the tests on precompiled code in CI HOT 3
- How to parse fixed part followed by several optional parts? HOT 2
- Hard failure when soft expected HOT 4
- Not (not) capturing HOT 2
- Publishing new versions is broken on the master branch HOT 1
- docs: add a friendly guide for users
- docs: create documentation explaining how to create new parsers
- refactor: replace overloaded functions with mapped types HOT 1
- Reconsider “quiet” parsers
- refactor: Get rid of namespaces HOT 1
- Debugging feature: textual descriptions for parsers HOT 2
- Debugging feature: a zero key
- docs: add performance comparison to other parsing techniques
- test: assert parser positions after combinator failures
- test: add type level tests HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from parjs.