Code Monkey home page Code Monkey logo

Comments (15)

GregRos avatar GregRos commented on June 19, 2024 1

Added the arithmetic expression parser.
https://github.com/GregRos/parjs/blob/master/examples/math.ts

Also moved to reference lodash on a per-function basis. Also removed dependency on chai.

A minified, uglified webpack bundle of the library is now around 170kb. Modularizing the unicode data so that it's opt-in will probably reduce bundle size down to 90kb or so. Gzip will probably reduce those figures by 2-3.

from parjs.

tscholl2 avatar tscholl2 commented on June 19, 2024

I was actually just playing around with parser combinator libraries trying to learn about them when I ran into this library.

Since you're asking for direction, I would vote for (3) then (1) then (2). Also maybe loosen dependencies.

Here is why: I don't know much about parser theory and I've never written a parser combinator. Most libraries have an example to parse json. So usually I copy the example and run it on a json blob using benchmarkjs. I think by doing (3) (even if it's just json), it will make it easier to do (1) (even if it's not a very rigorous benchmark).

Also I noticed you have lodash as a dependency, and it doesn't always seem to be used? I looked into this because I tried to create a minimal umd bundle of a parser and it was ~1mb. A similar bundle with parsimmon was about 20kb.

from parjs.

GregRos avatar GregRos commented on June 19, 2024

Thanks for the input 😄

That sounds like a good idea. I'll do an arithmetic expression parser, JSON parser, and hopefully a Markdown parser.

Thanks for the tip about the bundle size. It's a pretty big issue. I'll try to fix it, maybe create a lighter version. I can't really compete with Parsimmon in that respect because that library is just one javascript file with around 1000 lines. This library is much bigger than that and has many more features. But 1mb is just way too big for anything.

Lodash is used quiet often but if it's a problem maybe I'll reference individual lodash functions instead of the whole thing. I don't use the entire thing.

from parjs.

tscholl2 avatar tscholl2 commented on June 19, 2024

No problem! It looks very promising, keep up the great work. Let me know if there is anything in particular you would like help with.

I also think the arithmetic parser would be great, that happens to be very close to my use-case.

from parjs.

GregRos avatar GregRos commented on June 19, 2024

Ditto for the JSON parser.

I reduced the size of the library as I said above and the results are pretty close to what I expected.

Note that these figures are when compiling the TypeScript to ES6 (which I've done in the library). Compiling to ES5 may increase size by a factor of 2. ES6 is supported by all modern browsers.

from parjs.

tscholl2 avatar tscholl2 commented on June 19, 2024

Nice! I'm really glad the bundle info is on the readme as well. Great work!

from parjs.

linonetwo avatar linonetwo commented on June 19, 2024

I was reading https://tomassetti.me/parsing-in-javascript/#apg. It introduces many JS combinatory parser, and the parjs goes last, but I would choose to use the parjs since this is the most modular one. Use the parjs is like using pyparsing.

But Pyparsing has a downside that it can't easily deal with Unicode. Parsing Chinese file is hard, and lack of example. I expect parjs to be more comfortable in dealing with Chinese.

from parjs.

GregRos avatar GregRos commented on June 19, 2024

Thanks! It's nice seeing my work being discussed, and I'm happy you appreciate it.

parjs works on Javascript characters, which work fine for anything in the BMP (this includes almost all CJK characters). However, you might run into trouble if you want to parse symbols composed of multiple characters. A Latin example is something like â̵̳, which is a sequence of around 4 characters but appears as one "letter". Parjs will recognize it as a string of length 4 and will treat it as such.

I don't know much about Chinese, but I know Korean characters have both combining and non-combining variants. Can you give me an example of what you want to parse?

Also, you can use the char-info package to detect characters of a certain script (like Chinese).

from parjs.

bd82 avatar bd82 commented on June 19, 2024

Would this be of interest to you?

Basically I've exposed the Chevrotain parsing engine so it can be used to implement other parsing libraries including combinators.

naturally this has limitations:

  • Relies on automatic output of a ParseTree (CST) and visiting it, no embedded actions support yet.
  • Relies on "new Function()" so won't work where content security policy is enabled without first generating code.
  • Distinct lexing phase unlike a PEG grammar.

But also has advantages due to using a mature engine:

  • Very High Performance.
  • debugable- (you can step into the generated code that implements the combinator).

from parjs.

linonetwo avatar linonetwo commented on June 19, 2024

Parser definitions being too long have always been a problem. I'm seeking a way to reduce the length of code I need to define a parser.

I'm working on a python project, so I'm using pyparsing. There are so many boilerplates, it looks like this in Python:

...

stock_parser = oneOf(stock_types).setResultsName('method') + Optional(oneOf('共 合')) + Optional('计') + stock_amount_parser ^ \
    (stock_amount_parser + oneOf(stock_types).setResultsName('method'))

...

my_parser = OneOrMore(SkipTo(price_parser + SkipTo(stock_parser, include=True), include=True)) ^ OneOrMore(SkipTo(stock_parser + SkipTo(price_parser, include=True), include=True)) ^ SkipTo(stock_parser, include=True)

It's not very clear what I'm writing at first glance, since there are too much boilerplates.
And my colleagues are using plain regexp, they can write much shorter code than me since the regexp is just a simple string. Through regexp is too "hard coded", they are hard to refactor.

So I think, why not make something likes styled-components, use template literal to write regexp, and carefully extend regexp with some context-free grammar's expressing ability.
I'd like to do some prototype on Javascript, build on parjs.

Maybe something likes:

const stockParser = parser`
  (?<method>${stockTypes})[共合]?计?${stockAmountParser}|
  ${stockAmountParser}(?<method>${stockTypes})
`;
const myParser = parser`
  (${priceParser}.*${stockParser})+|
  (${stockParser}.*${priceParser})+|
  ${stockParser}
`;

myParser(`同意以3.22元/股的价格回购部分激励对象顾博、侯祥、祝里仁、丁树俊共4人已授予未解锁的合计86万股限制性股票`);
// {'amount': '86万', 'method': '限制性股票', 'price': '3.22', 'priceUnit': '元/股'}

It looks shorter and clearer.

What do you think?

from parjs.

GregRos avatar GregRos commented on June 19, 2024

@linonetwo I totally agree with your point. Being concise is important.

Embedding parsers inside tagged templates is a great idea! I haven't thought of that.

However, I don't agree with expressing combinators as commands in the string. I really, really dislike embedding code inside strings, especially when it's your own custom DSL. It's very easy to make mistakes and others will find it difficult to read what you wrote. Anyone wanting to use the feature would have to learn the language, with no IDE support. There is also the matter of designing and parsing the language.

But I like the syntax:

`${a}, ${b}, ${c}`

Even though fully expressing the type of the parser in TypeScript would probably be impossible, I'll add it.

Other stuff

I recently added a feature that lets string and regex literals automatically be converted to parsers that parse them. For example,Parjs.seq("a", "b", "c") would parse "abc". In previous versions you would've had to write Parjs.seq(Parjs.string("a"), Parjs.string("b"), Parjs.string("c")) or something similar.

I'm also working on a combinator that will take some sort of object of the form:

{
    a : Parjs.string("1"),
    b : Parjs.string("2"),
    c : Parjs.string("3")
}

And return a parser that parses ${a}${b}${c} and yields {a : "1", b : "2", c : "3"}.

I have a few more ideas that should make defining parsers more concise. I'm also eager to hear more of yours.

from parjs.

linonetwo avatar linonetwo commented on June 19, 2024

You are right, without IDE support, custom syntax inside template literal is difficult to debug. I had walked into that when I was using codegen.macro recently.

But I don't quite understand what you mean by ↓

`${a}, ${b}, ${c}`
// and
`${a}${b}${c}`

Do you mean they can be the shorthand for ↓?

Parjs.seq("a", "b", "c")
// and
Parjs.xxx({
    a : Parjs.string("1"),
    b : Parjs.string("2"),
    c : Parjs.string("3")
})

from parjs.

linonetwo avatar linonetwo commented on June 19, 2024

The reason why I was considering a DSL:

My job is data extraction from some report. We are using pyparsing and regexp. But both of them turn out to be long and difficult to read. (We were initially using regexp, and we introduce pyparsing to solve this problem. Though it's refactorable, it is not readable when it being too long, or one sentence being separated into too many parts so we still don't know what this pattern is talking about!)

Both pyparsing and regexp can't be refactorable and readable at the same time.(Maybe it's my usage that wrongs. Is there any design pattern when using combinator?)

That's why I was considering DSL solution that looks like regex, but you can embed combinators inside. Here are my thoughts: https://github.com/linonetwo/fragmented-regex

I prefer not using a DSL too. Because I need IDE hinting and type checking. But I don't know whether there is another way to be both concise and precise when dealing with complex text like natural language and web pages.

In data extraction, there will be lots of patterns to write, and every pattern is long.

  • lots of patterns to write: so we should refactor the same part, this is easy to achieve with the combinator
  • every pattern is long: Every pattern is derived from an example text and mutated by more example text(There can be lots of thesaurus in the real world texts). To maintain such a pattern, you need to remember the initial example text. So pattern should look similar to the initial text. This is easier to achieve with the regexp

There may be a tradeoff to be both refactorable and readable. The expressing ability is between regexp and CFG.

from parjs.

linonetwo avatar linonetwo commented on June 19, 2024

I've another idea: Editing parser from GUI.
Do you think parjs is easy to be generated?

from parjs.

GregRos avatar GregRos commented on June 19, 2024

@linonetwo

The string template idea

I thought of translating this:

`${p1} = ${p2}; ${p3}`

Into this:

Parjs.seq(p1, " = ", "; ", p3);

Still haven't had time to work on it.

Your specific problem

Thanks for telling me your use case. It sounds really interesting. I haven't really done that sort of thing before. I've mainly done parsing of programming languages/DSLs.

In general, my advice for working with parser combinators is to build the parser tree incrementally, giving some component parsers names, and trying to reuse them as much as possible. Also, create and store parsers for things like common number formats, styles of headings, and so on.

Also, don't be afraid to define new combinators (as functions) for specific use-cases. For example, it might make sense to define:

let betweenParens = p => p.between("(", ")")

To parse some expression between parentheses, if different expressions can appear in parentheses.

About using a GUI

If you want to generate a parser from a GUI, I recommend that you use a parser generator library. Some mature libraries (no JS library that I'm aware of) have front-end applications. There won't be any benefit from using a parser-combinator library if you generate the code anyway.

GUIs usually offer only a small subset of the capabilities of a library though.

from parjs.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.