norskeld / sigma Goto Github PK

View Code? Open in Web Editor NEW

24.0 2.0 3.0 1.86 MB

TypeScript parser combinator library for building fast and convenient parsers.

Home Page: https://sigma.nrsk.dev

License: MIT License

TypeScript 100.00%

typescript parser-combinators parser combinators parsec hacktoberfest

sigma's Introduction

`𝝨` sigma

TypeScript parser combinator library for building fast and convenient parsers.

Features

Capable of parsing LL grammars using recursive descent with backtracking.
Ergonomic API with excellent TypeScript support.
Zero dependencies. Supports tree shaking.
Performant enough to beat similar parser combinator libraries.

All-in-all, Sigma is easy to use and extend, reasonably fast and convenient, but a bit limited regarding what types of grammars it can parse.

Docs

You can find the documentation here. If you want to contribute, feel free to check out the source code.

Installation

Node

Just use your favorite package manager.

npm i @nrsk/sigma

Deno

You can import the library via Skypack (note the ?dts query parameter, this is to pull types):

import { ... } from 'https://cdn.skypack.dev/@nrsk/sigma?dts'
import { ... } from 'https://cdn.skypack.dev/@nrsk/sigma/parsers?dts'
import { ... } from 'https://cdn.skypack.dev/@nrsk/sigma/combinators?dts'

Example

Below is an example of parsing nested tuples like (1, 2, (3, 4)) into an AST.

Click to show the tuples example.

import { choice, map, optional, sepBy, sequence, takeMid } from '@nrsk/sigma/combinators'
import { defer, integer, run, string, whitespace } from '@nrsk/sigma/parsers'
import type { Span } from '@nrsk/sigma'

/* AST. */

interface NumberNode {
  type: 'number'
  span: Span
  value: number
}

interface ListNode {
  type: 'list'
  span: Span
  value: Array<NumberNode | ListNode>
}

/* Mapping functions to turn parsed string values into AST nodes. */

function toNumber(value: number, span: Span): NumberNode {
  return {
    type: 'number',
    span,
    value
  }
}

function toList(value: Array<NumberNode | ListNode>, span: Span): ListNode {
  return {
    type: 'list',
    span,
    value
  }
}

/* Parsers. */

const OpenParen = string('(')
const CloseParen = string(')')
const Space = optional(whitespace())
const Comma = sequence(Space, string(','), Space)

const TupleNumber = defer<NumberNode>()
const TupleList = defer<ListNode>()

TupleNumber.with(
  map(
    integer(),
    toNumber
  )
)

TupleList.with(
  map(
    takeMid(
      OpenParen,
      sepBy(choice(TupleList, TupleNumber), Comma),
      CloseParen
    ),
    toList
  )
)

Then we simply run the root parser, feeding it with text:

run(TupleList).with('(1, 2, (3, 4))')

And in the end we get the following output with the AST, which can then be manipulated if needed:

{
  isOk: true,
  span: [ 0, 14 ],
  pos: 14,
  value: {
    type: 'list',
    span: [ 0, 14 ],
    value: [
      { type: 'number', span: [ 1, 2 ], value: 1 },
      { type: 'number', span: [ 4, 5 ], value: 2 },
      {
        type: 'list',
        span: [ 7, 13 ],
        value: [
          { type: 'number', span: [ 8, 9 ], value: 3 },
          { type: 'number', span: [ 11, 12 ], value: 4 }
        ]
      }
    ]
  }
}

Development

Fork, clone, then instead of npm install run:

npm run install:all

Note

This will install dependencies for the package itself, and also for docs and benchmarks packages. This is due to limitations of the current repository setup and needed to avoid problems with eslint that runs on pre-commit hook.

This project follows the conventional commits spec and uses a slightly modified commitlint preset for automatic linting commits and generating changelog.

License

MIT.

sigma's People

Contributors

Stargazers

Watchers

Forkers

gilbert mindplay-dk szepeviktor

sigma's Issues

feat: error recovery

Provide means to recover from errors.

feat: make regexp parser throw an error if the global flag is missing

In this issue I've learnt that the g flag must be used in order to make regexp parser work correctly.

Should regexp parser check itself whether the g flag was used? It would be a minor improvement but it will prevent many mistakes. I think there is always a chance to forget adding this flag even if you know about this rule.

Maybe the paragraph in the documentation explaining this requirement should be emphasized to make this moment more obvious. But I don't think that documentation itself would be enough and this rule should be validated programmatically.

Adding this check somewhere would be a great addition:

if (!re.global) throw new Error('regexp parser must use g flag to process input correctly');

sigma/src/parsers/regexp.ts

Line 17 in 567b6f9

export function regexp(re: RegExp, expected: string): Parser<string> {

I can submit a PR myself if you agree with me on this point!

docs: tsdoc annotations

TSDoc annotations for better DX.

Dependents

What problem does this feature solve?

sigma is so well written, feature packed, and to a high level of perfection.
How come that no one uses it?
https://www.npmjs.com/package/@nrsk/sigma

feat: find a match in free text, get consumed input and leftovers

I'm currently migrating parser from regular expressions based solution. I faced an issue that I couldn't just run parser over free text and I need to specify everything around this parser somehow. What is more one of the requirement that I have is to get the matched string out of the input (it is used to remove it from the free text later).

I'm currently using something like that to get the desired output:

import * as s from '@nrsk/sigma';

interface ExtractedResult<T> {
    consumed: string;
    rest: string;
    value: T;
}

export function extract<T>(input: string, targetParser: s.Parser<T>): ExtractedResult<T> {
    const wrappedParser = s.map(
        s.sequence(
            s.takeUntil(s.any(), targetParser),
            s.rest(),
        ),
        ([[before, value], after]) => ([
            before.join(''),
            value,
            after,
        ] as const)
    );

    const result = s.run(wrappedParser).with(input);
    if (result.isOk) {
        const [before, value, after] = result.value;
        const consumed = input.replace(before, '').replace(after, '');
        const rest = input.replace(consumed, '');

        return {
            value,
            consumed,
            rest,
        };
    } else {
        throw new Error(result.expected);
    }
};

What can be done instead?

While consumed and rest could be covered by span feature, there is stil a need to make parser work with additional input like regular expressions do:

const parser = freeText(string('hello world'));
const re = /hello world/;

run(parser).with('RANDOM TEXT hello world RANDOM TEXT') // result.isOk = true, value = 'hello world'
re.exec('RANDOM TEXT hello world RANDOM TEXT') // matches: ['hello world']

run(parser).with('hello world') // result.isOk = true, value = 'hello world'
re.exec('hello world') // matches: ['hello world']

feat: poc

Proof of concept implementation with basic combinators and capabilities.

refactor: stabilize public API & restructure sources

Get rid of aliases for combinators and parsers.
Rename some combinators to match parsec (or parser-combinators), and purescript-parsing conventions.
Restructure sources and lift everything up from the internal directory.
Adjust exports and re-export everything in the package's entry point (index.ts).

docs(vitepress): explore automatic code snippet imports for signatures

Right now every signature for every parser/combinator is copy-pasted from the source code. This could be "automated" to some extent by leveraging code snippet imports in VitePress.

feat: negative/positive look-ahead/behind, non-consuming surrounding context parsers, negation parsers

There are multiple things described in this issue, but I think it's better to keep them close.

I'm doing some free text parsing migrating from the regular expressions based solution. I faced an issue that I couldn't fully specify the surrounding context. when is the only parser that could work with context but documentation doesn't say anything whether it consumes input or not (I assume it does which raises a question how it is different from sequence). Also when allows to specify preceding context only and I'm not sure what should I use to specify the context after the target parser.

It would be nice to have non-consuming parser (context) and negation parser (not) so it would be possible to specify surrounding context e.g.

// context(before, target, optional after)

const helloWorld = string('hello world');
const strictHelloWorld = context(
    not(letter()),
    helloWorld,
    not(whole())
);

const parser1 = sequence(any(), strictHelloWorld, any());
const parser2 = sequence(any(), helloWorld, any());

run(parser1).with('1hello worldA'); // result.isOk = true, value = ['1', 'hello world', 'A'];
run(parser1).with('Ahello world1'); // result.isOk = false

run(parser2).with('1hello worldA'); // result.isOk = true, value = ['1', 'hello world', 'A'];
run(parser2).with('Ahello world1'); // result.isOk = true, value = ['A', 'hello world', '1'];

Also it would be nice to have some default non-consuming parsers such as wordBoundary. What is more this will require polyfilling JS \b since it's not unicode friendly.

More examples can be found in the codesandbox.

feat: add esm bundle

Reject ancient tech, embrace modern stuff.

perf: explore further performance improvements

Explore memoizing techniques.
Explore LL edge cases.

feat: spans

After working with some rust parser combinator libraries like chumsky, I feel like it would be really handy to provide capabilities to either produce spans by default, or allow to map with spans.

A span is simply a pair of numbers, a tuple like [start: number, end: number], which points to some range in the source code we are parsing or parsed. That is actually a must for quality error reporting and diagnostics.

docs(vitepress): automate sidebar construction

Right now, every time we want to add a new parser/combinator and write a documentation, we need to manually add an entry to the sidebar:

sigma/docs/content/.vitepress/config.ts

Lines 135 to 186 in 9dc2b1c

    
           return [ 
        
             Sidebar.group('Introduction', '/introduction', [ 
        
               Sidebar.item('Getting started', '/getting-started') 
        
             ]), 
        
             Sidebar.group('Guides', '/guides', [ 
        
               Sidebar.item('Primitives and composites', '/primitives-and-composites') 
        
             ]), 
        
             Sidebar.group('Combinators', '/combinators', [ 
        
               Sidebar.item('chainl', '/chainl'), 
        
               Sidebar.item('choice', '/choice'), 
        
               Sidebar.item('error', '/error'), 
        
               Sidebar.item('many', '/many'), 
        
               Sidebar.item('many1', '/many1'), 
        
               Sidebar.item('map', '/map'), 
        
               Sidebar.item('mapTo', '/mapTo'), 
        
               Sidebar.item('optional', '/optional'), 
        
               Sidebar.item('sepBy', '/sepBy'), 
        
               Sidebar.item('sepBy1', '/sepBy1'), 
        
               Sidebar.item('sequence', '/sequence'), 
        
               Sidebar.item('skipUntil', '/skipUntil'), 
        
               Sidebar.item('takeLeft', '/takeLeft'), 
        
               Sidebar.item('takeMid', '/takeMid'), 
        
               Sidebar.item('takeRight', '/takeRight'), 
        
               Sidebar.item('takeSides', '/takeSides'), 
        
               Sidebar.item('takeUntil', '/takeUntil'), 
        
               Sidebar.item('when', '/when') 
        
             ]), 
        
             Sidebar.group('Parsers', '/parsers', [ 
        
               Sidebar.item('any', '/any'), 
        
               Sidebar.item('binary', '/binary'), 
        
               Sidebar.item('defer', '/defer'), 
        
               Sidebar.item('eof', '/eof'), 
        
               Sidebar.item('eol', '/eol'), 
        
               Sidebar.item('float', '/float'), 
        
               Sidebar.item('hex', '/hex'), 
        
               Sidebar.item('integer', '/integer'), 
        
               Sidebar.item('letter', '/letter'), 
        
               Sidebar.item('letters', '/letters'), 
        
               Sidebar.item('noneOf', '/noneOf'), 
        
               Sidebar.item('nothing', '/nothing'), 
        
               Sidebar.item('octal', '/octal'), 
        
               Sidebar.item('oneOf', '/oneOf'), 
        
               Sidebar.item('regexp', '/regexp'), 
        
               Sidebar.item('rest', '/rest'), 
        
               Sidebar.item('run', '/run'), 
        
               Sidebar.item('string', '/string'), 
        
               Sidebar.item('tryRun', '/tryRun'), 
        
               Sidebar.item('ustring', '/ustring'), 
        
               Sidebar.item('whitespace', '/whitespace'), 
        
               Sidebar.item('whole', '/whole') 
        
             ]) 
        
           ]

This is at the very least inconvenient. Scanning specific directories and extracting titles from .md files' frontmatter would be enough, since the structure is pretty flat and simple.

feat: error recovery, handling and mapping

Right now there're no "errors" per se, i.e. all sigma provides users with is text messages and ability to re-map those messages to something custom. This is a shame and should be improved.

Implementation of spans in #34 should help a bit, but we will also need to extend Parser<T> signature with a second generic parameter E, so parsers could bear error type information. All parsers and combinators will be changed accordingly, although I'm pretty sure there'll be hurdles here and there.

Additionally, there should be added two combinators:

mapErr(parser, fn) - this combinator will map error from E to some other type using given fn.
mapOrElse(parser, okFn, errFn) - this combinator will conditionally apply okFn or errFn depending on the parser's result.

It also makes sense to implement error recovery along with the stuff above. Hopefully, it will be enough to provide a single combinator:

recovery(parser, fn) - this combinator takes a parser and a recovery function fn that should produce another parser; similar to when combinator, but it acts only on failures.

Making parsers named (adding name property to Parser<T>) wouldn't hurt as well.

docs: rewrite using VitePress

VitePress is still in alpha, but from what I've seen and actually played around, it's already okay and provides all essential and usefull stuff out-of-the-box.

feat: add deno bundle

Maybe. Probably.

build: investigate adopting `unbuild`

unbuild looks very similar to what I have in my custom build/compile script and the setup relying on Rollup in general, so it should be easy to add it to the project and get everything up and running.

Would be great if I could do the same with release script...

Feature: error printer

What do you think about including a formatError function?

I wrote a function that takes the input string and a Failure instance, and prints out the surrounding +/- n lines with line-numbers etc.

I don't know if it's wise to include this in the library, since parser errors aren't the only type of errors you'd want to print - though, on the other hand, if the function was just something like formatError(input: string, offset: number, expectation: string), then this could be used to inject whatever expectation you want, to print out semantic errors and so on.

Not 100% sure if this is a good idea, as error formats can be quite different in different projects - but on the other hand, it might be nice to have something you can use just to get your project started and get something useful on the screen without doing a lot of ground work?

Up to you, but I do have something I could PR, if you'd like. 🙂

sepBy mutates position even on no match

Describe the bug

When sepBy fails to match at all, it still updates the cursor position. This causes subsequent parsers to fail due to skipped input.

In contrast, many – which also always succeeds – does not update cursor position when it fails to match at all.

Reproduction

I can't figure out how to get past the pre-commit hooks, so here's a diff of the test and fix instead of a PR:

The test & fix

diff --git a/src/__tests__/combinators/sepBy.spec.ts b/src/__tests__/combinators/sepBy.spec.ts
index ca348b9..0cda6a1 100644
--- a/src/__tests__/combinators/sepBy.spec.ts
+++ b/src/__tests__/combinators/sepBy.spec.ts
@@ -1,4 +1,4 @@
-import { sepBy, sepBy1 } from '@combinators'
+import { sepBy, sepBy1, sequence } from '@combinators'
 import { string } from '@parsers'
 import { run, result, should, describe, it } from '@testing'

@@ -26,6 +26,17 @@ describe('sepBy', () => {

     should.matchState(actual, expected)
   })
+
+  it('should successfully continue if nothing matched', () => {
+    const parser = sequence(
+      sepBy(string('hello'), string('?')),
+      sepBy(string('bye'), string('?')),
+    )
+    const actual = run(parser, 'bye?bye?')
+    const expected = result(true, [[], ['bye', 'bye']])
+
+    should.matchState(actual, expected)
+  })
 })

 describe('sepBy1', () => {
diff --git a/src/combinators/sepBy.ts b/src/combinators/sepBy.ts
index 51b81b1..f42bc32 100644
--- a/src/combinators/sepBy.ts
+++ b/src/combinators/sepBy.ts
@@ -14,6 +14,7 @@ import type { Parser } from '@types'
 export function sepBy<T, S>(parser: Parser<T>, sep: Parser<S>): Parser<Array<T>> {
   return {
     parse(input, pos) {
       // Run the parser once to get the first value.
       const resultP = parser.parse(input, pos)

@@ -37,8 +38,8 @@ export function sepBy<T, S>(parser: Parser<T>, sep: Parser<S>): Parser<Array<T>>

       return {
         isOk: true,
-        span: [pos, resultP.pos],
-        pos: resultP.pos,
+        span: [pos, pos],
+        pos,
         value: []
       }
     }

And here's the test again for visibility, which fails on the current version (3.6.2):

  it('should successfully continue if nothing matched', () => {
    const parser = sequence(
      sepBy(string('hello'), string('?')),
      sepBy(string('bye'), string('?')),
    )
    const actual = run(parser, 'bye?bye?')
    const expected = result(true, [[], ['bye', 'bye']])

    should.matchState(actual, expected)
  })

test: rewrite and refactor to use uvu test runner

Rewrite tests to use uvu test runner instead of jest.
The provided uvu/assert assertions should be enough for our purposes, but it's worth looking for alternatives.

NB. Avoid pointless duplication. Use helpers, these can be used as a base for most cases:

sigma/tests/@helpers/index.ts

Lines 39 to 55 in e76249b

    
           export function testFailure<P extends () => Parser<unknown>>(input: string, parser: P) { 
        
             const actual = run(parser(), input) 
        
             const expected = result('failure', actual.kind === 'failure' ? actual.expected : actual.value) 
        
             should.matchState(actual, expected) 
        
           } 
        
           export function testSuccess<T, P extends () => Parser<unknown>>( 
        
             input: string, 
        
             value: T, 
        
             parser: P 
        
           ) { 
        
             const actual = run(parser(), input) 
        
             const expected = result('success', value) 
        
             should.matchState(actual, expected) 
        
           }

feat: add common parsers

To avoid reinventing the wheel every time, several vital parsers could be implemented and provided out-of-the-box, such as:

whitespace and optional whitespace.
letter (single) and letters (multiple), probably should be added along with #3.
char could be an ASCII char code or Unicode code point probably.
float and int (both signed and unsigned) with support for scientific form.
...

bug: `choice` incorrectly infers a type if given a spreaded array of parsers

When choice is given a spreaded array, e.g.:

choice(...['one', 'two'].map(string))

it incorrectly infers Parser<never> instead of Parser<string>. This is the problem with ToUnion type helper:

// Ok
type U1 = [Parser<string>, Parser<number>, Parser<boolean>]
type R1 = ToUnion<U1> // type R1 = string | number | boolean

// Wrong
type U2 = Array<Parser<string>>
type R2 = ToUnion<U2> // type R2 = never

This is why we need #48...

feat: optional spaces and whitespaces

What problem does this feature solve?

Currently the whitespace parser is strict and requires at least one character to be matched. There are many cases where I need to wrap it into optional parser e.g. the spaces before and after argument braces are optional.

function main() {...}
function main (){...}

Describe the solution

I think aligning this parser with other API would be a little bit more convenient by matching the current pairs such as many and many1, sepBy and sepBy1:

whitespaces1 - requires a single or multiple characters, works as the current whitespace
whitespaces - requires zero or more whitespaces to match, works as optional(whitespace)
(optional) wrapWhitespaces and wrapWhitespaces1 - that should prevent writing the common pattern of surrounding whitespaces over and over again:

sequence(
    functionKeyword,
    takeMid(
        optional(whitespace),
        functionName,
        optional(whitespace)
   )
)

/* could be turned into */

sequence(
    functionKeyword,
    wrapWhitespaces(functionName)
)

test: investigate adopting `vitest` instead of `uvu`

Postinstall fails on 3.6.3

Describe the bug

npm install
npm WARN EBADENGINE Unsupported engine {
npm WARN EBADENGINE   package: '@nrsk/[email protected]',
npm WARN EBADENGINE   required: { node: '>=18.16.0 <=20' },
npm WARN EBADENGINE   current: { node: 'v18.15.0', npm: '9.5.0' }
npm WARN EBADENGINE }
npm ERR! code 1
npm ERR! path /Users/redacted/project/node_modules/@nrsk/sigma
npm ERR! command failed
npm ERR! command sh -c npm run install:benchmarks && npm run install:docs
npm ERR! > @nrsk/[email protected] install:benchmarks
npm ERR! > cd benchmarks && npm i
npm ERR! sh: line 0: cd: benchmarks: No such file or directory

Repro steps

Install 3.6.3 locally. I originally discovered the bug on trying to deploy to render since it caught the latest version due to my package.json having ^3.6.2 for sigma.

System information

macOS Ventura 13.2.1
Shell: zsh 5.8.1

feat: throwing runner

Add tryRun helper, that will, unlike run, throw in case of failure.

docs: explain how to write custom parsers and combinators

Explain how to write custom parsers and combinators.

Docs: `run` and `tryRun` aren't parsers

I had trouble finding the run and tryRun functions, which are located in the "Parsers" section.

I would suggest adding a "Core" section at the top of the menu, and document the core functions of the library API there.

Probably moving the code itself into a "core" folder would make sense as well?

Let me know if you'd like a PR.

ustring() docs/questions

What is the ustring parser for? Assuming you load a valid unicode text file (which in this day and age is every text file) wouldn't it just match everything?

Best guess, this is for something like validating correct binary encoding of JSON files? But is it actually possible to load a non-valid unicode text file into a string in JavaScript?

I was expecting I'd use this for, say, keywords.

But then the "success" example in the documentation says:

Note that the index is 12, which is correct, since every hieroglyph here takes 3 bytes.

String operations in JS generally operate in ranges of code points:

So these numbers aren't useful for error reporting, or any subsequent string operation in JS really.

Text editors usually measure positions in code points as well:

The documentation itself explains:

This parser is very similar to the string parser, except it takes a bit hacky (though performant) approach, that is based on counting length of the given match string in bytes. It then subslices and compares string slice with that match string.

"hacky though performant", but it seems like this is doing a lot of unnecessary work to figure out a string position that isn't useful for most common use cases, like just matching a keyword or symbol, isn't it?

~~What I was expecting was a simpler parser that would use String.prototype.includes, which ought to be the fastest native way to check for a specific string at a specific offset, I think?~~

EDIT: oh, whoops, now I get it! I avoided string, because it it specifically says this will match "ASCII", which is incorrect. It would in fact match whatever Unicode characters you put in the string. Looks like a documentation problem.

~~But I don't see any other parser for simple strings - and nothing relevant in the codebase calling includes.~~

EDIT: looks like maybe there is room for a small optimization here to avoid copying.

~~It's also difficult to think of a name for such a parser, now that string is taken.~~ 😅

(I know I'm submitting a lot of feedback! I am already somewhat invested in this lovely library, and I do want to help out - if you want me to submit PRs for anything, let me know.)

EDIT: let me know if you'd like me to correct the documentation and/or try the minor optimization/simplification with includes in the string parser.

feat(combinators/not): add `not` combinator

docs: add `consuming` and `non-consuming` labels/badges

Something similar to primitive and composite badges. These should be added to tsdocs as well.

docs(types): thoroughly document user-facing types

Library types:

Failure and Success, as mentioned in #82;
Parser<T> union variants.

Check utility types' docs.

docs: automagically build sidebar from the `content` directory structure

Right now sidebar's implementation is kinda clunky and annoying in that it is hardcoded. It would be much better to build the sidebar from the content directory, mirroring its structure and leveraging entries' front-matter.

Result model: span vs pos

Hey,

I was looking over these types:

sigma/src/types/library.ts

Lines 22 to 36 in 261f2f8

    
           /** Represents failed execution. */ 
        
           export type Failure = { 
        
             readonly isOk: false 
        
             readonly span: Span 
        
             readonly pos: number 
        
             readonly expected: string 
        
           } 
        
           /** Represents successful execution. */ 
        
           export type Success<T> = { 
        
             readonly isOk: true 
        
             readonly span: Span 
        
             readonly pos: number 
        
             readonly value: T 
        
           }

I see that spans were added on later.

Maybe I'm missing something, but I was wondering:

In a Success, isn't pos always going to be identical to span[0]?
In a Failure, isn't span[0] always going to be identical to span[1] as well as to pos?

A failure always happens at one specific position, does it not? When would the span mean anything?

And a success always has both a start and an end - even if these are identical, that would signify a zero-length match, so both values are always meaningful, right?

Any particular reason you wouldn't just deprecate or remove pos?

Or just have plain start and end properties for Success, and pos for Failure? Is there any practical advantage to having those tuples? maybe for source maps or something? Does it matter if those properties have the same names/types?

Just wondering.

This library looks amazing btw. 😄

Feature: grammar helper

What problem does this feature solve?

This is mainly for convenience - but it does solve the problem with the existing defer function, which relies initialization separate from creation. My proposed grammar helper would be statically type-checked, and wouldn't need an error-handler.

Describe the solution

If we look at the example for defer:

sigma/docs/docs/content/parsers/defer.md

Lines 33 to 64 in 0d96737

    
           interface NumberNode { 
        
             type: 'number' 
        
             span: Span 
        
             value: number 
        
           } 
        
           interface ListNode { 
        
             type: 'list' 
        
             span: Span 
        
             value: Array<NumberNode | ListNode> 
        
           } 
        
           const TupleList = defer<ListNode>() 
        
           const TupleNumber = defer<NumberNode>() 
        
           TupleNumber.with( 
        
             map( 
        
               integer(), 
        
               (value, span) => ({ type: 'number', span, value }) 
        
             ) 
        
           ) 
        
           TupleList.with( 
        
             map( 
        
               takeMid( 
        
                 string('('), 
        
                 sepBy(choice(TupleList, TupleNumber), string(',')), 
        
                 string(')') 
        
               ), 
        
               (value, span) => ({ type: 'list', span, value }) 
        
             ) 
        
           )

Here is that example implemented with the grammar helper:

  const tupleGrammar = grammar({
    tupleNumber(): Parser<NumberNode> {
      return map(
        integer(),
        (value, span) => ({ type: 'number', span, value })
      )
    },
    tupleList(): Parser<ListNode> {
      return map(
        takeMid(
          string('('),
          sepBy(choice(this.tupleList, this.tupleNumber), string(',')),
          string(')')
        ),
        (value, span) => ({ type: 'list', span, value })
      )
    },
  });

Here is a test demonstrating how to use the resulting grammar:

  is.equal(
    run(tupleGrammar.tupleList).with('(1,2,(3,(4,5)))'),
    {
      isOk: true,
      span: [ 0, 15 ],
      pos: 15,
      value: {
        type: 'list',
        span: [ 0, 15 ],
        value: [
          { type: 'number', span: [ 1, 2 ], value: 1 },
          { type: 'number', span: [ 3, 4 ], value: 2 },
          {
            type: 'list',
            span: [ 5, 14 ],
            value: [
              { type: 'number', span: [ 6, 7 ], value: 3 },
              {
                type: 'list',
                span: [ 8, 13 ],
                value: [
                  { type: 'number', span: [ 9, 10 ], value: 4 },
                  { type: 'number', span: [ 11, 12 ], value: 5 }
                ]
              }
            ]
          }
        ]
      }
    }
  );

And here is my preliminary implementation:

import { Parser } from "@nrsk/sigma";

type Grammar<T> = {
  [P in keyof T]: T[P] extends () => any ? ReturnType<T[P]> : never;
};

type GrammarInit<T> = T & ThisType<Grammar<T>>;

type GrammarType = {
  [name: string]: () => Parser<any>;
};

export function grammar<T extends GrammarType>(init: GrammarInit<T>): Grammar<T> {
  const grammar = {} as { [key: string]: Parser<any> };

  const initialized = {} as { [key: string]: true };
  
  for (const key in init) {
    grammar[key] = {
      parse(input, pos) {
        if (! initialized[key]) {
          initialized[key] = true;

          grammar[key] = (init[key] as any).apply(grammar);
        }
        
        return grammar[key].parse(input, pos);
      },
    } as Parser<any>;
  }

  return grammar as Grammar<T>;
}

Here is a screenshot demonstrating IDE support:

As you can see, this works with the circular references, which is possible with the magical ThisType in TS.

It's doing basically the same thing as defer for each member, so of course this works with circular references at run-time as well.

I didn't benchmark it against defer, and it might need some optimization, and the types could probably use a little work.

But what do you think, would you welcome a PR for this feature? 🙂

docs: standalone documentation

Standalone documentation built with any SSG that will be deployed with Vercel.
Create a subdomain: https://sigma.vm.codes

feat: add unicode support

Pretty self-explanatory.

Optimizations

I was curious to see how Sigma would stack up against other parsers - the biggest benchmark I know is Chevrotain's, so I added Sigma's JSON parser example to it:

mindplay-dk/chevrotain@d8fd236

although it is 4 times slower than Chevrotain, Sigma is definitely in the lead 🙂

Chevrotain is the fastest JS parser library I know of, so it would probably be difficult to beat.

and of course, this is without making any attempts to optimize Sigma's implementation of the JSON parser at all.

I did a quick profile, and sequence looks like the biggest bottleneck at the moment:

it might be worth optimizing and benchmarking further - Chevrotain might be worth referencing for optimizations as well, I know the author put a lot of work into that.

as previously mentioned, performance is not the main reason I picked this library - but I do think it's important, and if there are any "easy wins", it might be worth while investigating this a bit further.

I might take a closer look at some point - just leaving this here for now. 🙂

defer() should error?

Just wondering, I noticed this:

sigma/src/parsers/defer.ts

Lines 72 to 77 in 0d96737

    
           return { 
        
             isOk: false, 
        
             span: [pos, pos], 
        
             pos, 
        
             expected: `Deferred parser wasn't initialized.` 
        
           }

I'm not sure this makes sense as a parser error?

It's not that parsing failed - it's that there is something wrong with your code.

So I think maybe it would be more appropriate to throw an error here?

You might have had some sort of reason for this - it looks a little off to me, so I figured I'd ask. 🙂

What should be done

Add a new parser: hexadecimal. Should parse a positive whole number in the hexadecimal system. The number should be prefixed with 0x or 0X. Ex: 0xDF, 0X1F.
Add a new parser: octal. Should parse a positive whole number in the octal system. The number should be prefixed with 0o or 0O. Ex: 0o230, 0O11.
Add a new parser: binary. Should parse a positive whole number in the binary system. The number should be prefixed with 0b or 0B. Ex: 0b1111, 0B1111.
Add a new parser: whole. Should parse a positive whole number in the decimal system. Ex: 0, 1, 2.
Rename int to integer. The same as whole, except it can be prefixed with a minus - sign.

Notes

All parsers return a parsed string, parsing into actual numbers should be done in the userland on demand.
If possible, do not use regular expressions for parsing numbers, do benchmarks. Regular expressions are probably faster than comparing chars one-by-one.
Do not use factory functions like here to reduce the number of calls.

	return [
	Sidebar.group('Introduction', '/introduction', [
	Sidebar.item('Getting started', '/getting-started')
	]),
	Sidebar.group('Guides', '/guides', [
	Sidebar.item('Primitives and composites', '/primitives-and-composites')
	]),
	Sidebar.group('Combinators', '/combinators', [
	Sidebar.item('chainl', '/chainl'),
	Sidebar.item('choice', '/choice'),
	Sidebar.item('error', '/error'),
	Sidebar.item('many', '/many'),
	Sidebar.item('many1', '/many1'),
	Sidebar.item('map', '/map'),
	Sidebar.item('mapTo', '/mapTo'),
	Sidebar.item('optional', '/optional'),
	Sidebar.item('sepBy', '/sepBy'),
	Sidebar.item('sepBy1', '/sepBy1'),
	Sidebar.item('sequence', '/sequence'),
	Sidebar.item('skipUntil', '/skipUntil'),
	Sidebar.item('takeLeft', '/takeLeft'),
	Sidebar.item('takeMid', '/takeMid'),
	Sidebar.item('takeRight', '/takeRight'),
	Sidebar.item('takeSides', '/takeSides'),
	Sidebar.item('takeUntil', '/takeUntil'),
	Sidebar.item('when', '/when')
	]),
	Sidebar.group('Parsers', '/parsers', [
	Sidebar.item('any', '/any'),
	Sidebar.item('binary', '/binary'),
	Sidebar.item('defer', '/defer'),
	Sidebar.item('eof', '/eof'),
	Sidebar.item('eol', '/eol'),
	Sidebar.item('float', '/float'),
	Sidebar.item('hex', '/hex'),
	Sidebar.item('integer', '/integer'),
	Sidebar.item('letter', '/letter'),
	Sidebar.item('letters', '/letters'),
	Sidebar.item('noneOf', '/noneOf'),
	Sidebar.item('nothing', '/nothing'),
	Sidebar.item('octal', '/octal'),
	Sidebar.item('oneOf', '/oneOf'),
	Sidebar.item('regexp', '/regexp'),
	Sidebar.item('rest', '/rest'),
	Sidebar.item('run', '/run'),
	Sidebar.item('string', '/string'),
	Sidebar.item('tryRun', '/tryRun'),
	Sidebar.item('ustring', '/ustring'),
	Sidebar.item('whitespace', '/whitespace'),
	Sidebar.item('whole', '/whole')
	])
	]

	export function testFailure<P extends () => Parser<unknown>>(input: string, parser: P) {
	const actual = run(parser(), input)
	const expected = result('failure', actual.kind === 'failure' ? actual.expected : actual.value)

	should.matchState(actual, expected)
	}

	export function testSuccess<T, P extends () => Parser<unknown>>(
	input: string,
	value: T,
	parser: P
	) {
	const actual = run(parser(), input)
	const expected = result('success', value)

	should.matchState(actual, expected)
	}

	/** Represents failed execution. */
	export type Failure = {
	readonly isOk: false
	readonly span: Span
	readonly pos: number
	readonly expected: string
	}

	/** Represents successful execution. */
	export type Success<T> = {
	readonly isOk: true
	readonly span: Span
	readonly pos: number
	readonly value: T
	}

	interface NumberNode {
	type: 'number'
	span: Span
	value: number
	}

	interface ListNode {
	type: 'list'
	span: Span
	value: Array<NumberNode \| ListNode>
	}

	const TupleList = defer<ListNode>()
	const TupleNumber = defer<NumberNode>()

	TupleNumber.with(
	map(
	integer(),
	(value, span) => ({ type: 'number', span, value })
	)
	)

	TupleList.with(
	map(
	takeMid(
	string('('),
	sepBy(choice(TupleList, TupleNumber), string(',')),
	string(')')
	),
	(value, span) => ({ type: 'list', span, value })
	)
	)

	return {
	isOk: false,
	span: [pos, pos],
	pos,
	expected: `Deferred parser wasn't initialized.`
	}

norskeld / sigma Goto Github PK

sigma's Introduction

𝝨 sigma

Features

Docs

Installation

Node

Deno

Example

Development

License

sigma's People

Contributors

Stargazers

Watchers

Forkers

sigma's Issues

What problem does this feature solve?

What can be done instead?

Describe the bug

Reproduction

What problem does this feature solve?

Describe the solution

Describe the bug

Repro steps

System information

What problem does this feature solve?

Describe the solution

What should be done

Notes

Recommend Projects

Recommend Topics

Recommend Org

`𝝨` sigma