norskeld / sigma Goto Github PK
View Code? Open in Web Editor NEWTypeScript parser combinator library for building fast and convenient parsers.
Home Page: https://sigma.nrsk.dev
License: MIT License
TypeScript parser combinator library for building fast and convenient parsers.
Home Page: https://sigma.nrsk.dev
License: MIT License
unbuild looks very similar to what I have in my custom build/compile script and the setup relying on Rollup in general, so it should be easy to add it to the project and get everything up and running.
Would be great if I could do the same with release script...
Right now sidebar's implementation is kinda clunky and annoying in that it is hardcoded. It would be much better to build the sidebar from the content
directory, mirroring its structure and leveraging entries' front-matter.
Provide means for creating context-aware (chained) parsers.
When sepBy
fails to match at all, it still updates the cursor position. This causes subsequent parsers to fail due to skipped input.
In contrast, many
โ which also always succeeds โ does not update cursor position when it fails to match at all.
I can't figure out how to get past the pre-commit hooks, so here's a diff of the test and fix instead of a PR:
diff --git a/src/__tests__/combinators/sepBy.spec.ts b/src/__tests__/combinators/sepBy.spec.ts
index ca348b9..0cda6a1 100644
--- a/src/__tests__/combinators/sepBy.spec.ts
+++ b/src/__tests__/combinators/sepBy.spec.ts
@@ -1,4 +1,4 @@
-import { sepBy, sepBy1 } from '@combinators'
+import { sepBy, sepBy1, sequence } from '@combinators'
import { string } from '@parsers'
import { run, result, should, describe, it } from '@testing'
@@ -26,6 +26,17 @@ describe('sepBy', () => {
should.matchState(actual, expected)
})
+
+ it('should successfully continue if nothing matched', () => {
+ const parser = sequence(
+ sepBy(string('hello'), string('?')),
+ sepBy(string('bye'), string('?')),
+ )
+ const actual = run(parser, 'bye?bye?')
+ const expected = result(true, [[], ['bye', 'bye']])
+
+ should.matchState(actual, expected)
+ })
})
describe('sepBy1', () => {
diff --git a/src/combinators/sepBy.ts b/src/combinators/sepBy.ts
index 51b81b1..f42bc32 100644
--- a/src/combinators/sepBy.ts
+++ b/src/combinators/sepBy.ts
@@ -14,6 +14,7 @@ import type { Parser } from '@types'
export function sepBy<T, S>(parser: Parser<T>, sep: Parser<S>): Parser<Array<T>> {
return {
parse(input, pos) {
// Run the parser once to get the first value.
const resultP = parser.parse(input, pos)
@@ -37,8 +38,8 @@ export function sepBy<T, S>(parser: Parser<T>, sep: Parser<S>): Parser<Array<T>>
return {
isOk: true,
- span: [pos, resultP.pos],
- pos: resultP.pos,
+ span: [pos, pos],
+ pos,
value: []
}
}
And here's the test again for visibility, which fails on the current version (3.6.2):
it('should successfully continue if nothing matched', () => {
const parser = sequence(
sepBy(string('hello'), string('?')),
sepBy(string('bye'), string('?')),
)
const actual = run(parser, 'bye?bye?')
const expected = result(true, [[], ['bye', 'bye']])
should.matchState(actual, expected)
})
Test utility and other types with expect-type, because we have tricky ones, with recursion and conditionals. It's (almost) never a bad thing to write a couple more tests.
Explain how to write custom parsers and combinators.
Provide means to recover from errors.
Maybe. Probably.
After working with some rust parser combinator libraries like chumsky, I feel like it would be really handy to provide capabilities to either produce spans by default, or allow to map with spans.
A span is simply a pair of numbers, a tuple like [start: number, end: number]
, which points to some range in the source code we are parsing or parsed. That is actually a must for quality error reporting and diagnostics.
I had trouble finding the run
and tryRun
functions, which are located in the "Parsers" section.
I would suggest adding a "Core" section at the top of the menu, and document the core functions of the library API there.
Probably moving the code itself into a "core" folder would make sense as well?
Let me know if you'd like a PR.
VitePress is still in alpha, but from what I've seen and actually played around, it's already okay and provides all essential and usefull stuff out-of-the-box.
This is mainly for convenience - but it does solve the problem with the existing defer
function, which relies initialization separate from creation. My proposed grammar
helper would be statically type-checked, and wouldn't need an error-handler.
If we look at the example for defer
:
sigma/docs/docs/content/parsers/defer.md
Lines 33 to 64 in 0d96737
Here is that example implemented with the grammar
helper:
const tupleGrammar = grammar({
tupleNumber(): Parser<NumberNode> {
return map(
integer(),
(value, span) => ({ type: 'number', span, value })
)
},
tupleList(): Parser<ListNode> {
return map(
takeMid(
string('('),
sepBy(choice(this.tupleList, this.tupleNumber), string(',')),
string(')')
),
(value, span) => ({ type: 'list', span, value })
)
},
});
Here is a test demonstrating how to use the resulting grammar:
is.equal(
run(tupleGrammar.tupleList).with('(1,2,(3,(4,5)))'),
{
isOk: true,
span: [ 0, 15 ],
pos: 15,
value: {
type: 'list',
span: [ 0, 15 ],
value: [
{ type: 'number', span: [ 1, 2 ], value: 1 },
{ type: 'number', span: [ 3, 4 ], value: 2 },
{
type: 'list',
span: [ 5, 14 ],
value: [
{ type: 'number', span: [ 6, 7 ], value: 3 },
{
type: 'list',
span: [ 8, 13 ],
value: [
{ type: 'number', span: [ 9, 10 ], value: 4 },
{ type: 'number', span: [ 11, 12 ], value: 5 }
]
}
]
}
]
}
}
);
And here is my preliminary implementation:
import { Parser } from "@nrsk/sigma";
type Grammar<T> = {
[P in keyof T]: T[P] extends () => any ? ReturnType<T[P]> : never;
};
type GrammarInit<T> = T & ThisType<Grammar<T>>;
type GrammarType = {
[name: string]: () => Parser<any>;
};
export function grammar<T extends GrammarType>(init: GrammarInit<T>): Grammar<T> {
const grammar = {} as { [key: string]: Parser<any> };
const initialized = {} as { [key: string]: true };
for (const key in init) {
grammar[key] = {
parse(input, pos) {
if (! initialized[key]) {
initialized[key] = true;
grammar[key] = (init[key] as any).apply(grammar);
}
return grammar[key].parse(input, pos);
},
} as Parser<any>;
}
return grammar as Grammar<T>;
}
Here is a screenshot demonstrating IDE support:
As you can see, this works with the circular references, which is possible with the magical ThisType
in TS.
It's doing basically the same thing as defer
for each member, so of course this works with circular references at run-time as well.
I didn't benchmark it against defer
, and it might need some optimization, and the types could probably use a little work.
But what do you think, would you welcome a PR for this feature? ๐
sigma
is so well written, feature packed, and to a high level of perfection.
How come that no one uses it?
https://www.npmjs.com/package/@nrsk/sigma
Currently the whitespace
parser is strict and requires at least one character to be matched. There are many cases where I need to wrap it into optional
parser e.g. the spaces before and after argument braces are optional.
function main() {...}
function main (){...}
I think aligning this parser with other API would be a little bit more convenient by matching the current pairs such as many
and many1
, sepBy
and sepBy1
:
whitespaces1
- requires a single or multiple characters, works as the current whitespace
whitespaces
- requires zero or more whitespaces to match, works as optional(whitespace)
wrapWhitespaces
and wrapWhitespaces1
- that should prevent writing the common pattern of surrounding whitespaces over and over again:sequence(
functionKeyword,
takeMid(
optional(whitespace),
functionName,
optional(whitespace)
)
)
/* could be turned into */
sequence(
functionKeyword,
wrapWhitespaces(functionName)
)
Right now every signature for every parser/combinator is copy-pasted from the source code. This could be "automated" to some extent by leveraging code snippet imports in VitePress.
I was curious to see how Sigma would stack up against other parsers - the biggest benchmark I know is Chevrotain's, so I added Sigma's JSON parser example to it:
mindplay-dk/chevrotain@d8fd236
although it is 4 times slower than Chevrotain, Sigma is definitely in the lead ๐
Chevrotain is the fastest JS parser library I know of, so it would probably be difficult to beat.
and of course, this is without making any attempts to optimize Sigma's implementation of the JSON parser at all.
I did a quick profile, and sequence
looks like the biggest bottleneck at the moment:
it might be worth optimizing and benchmarking further - Chevrotain might be worth referencing for optimizations as well, I know the author put a lot of work into that.
as previously mentioned, performance is not the main reason I picked this library - but I do think it's important, and if there are any "easy wins", it might be worth while investigating this a bit further.
I might take a closer look at some point - just leaving this here for now. ๐
To avoid reinventing the wheel every time, several vital parsers could be implemented and provided out-of-the-box, such as:
whitespace
and optional whitespace
.letter
(single) and letters
(multiple), probably should be added along with #3.char
could be an ASCII char code or Unicode code point probably.float
and int
(both signed and unsigned) with support for scientific form.There are multiple things described in this issue, but I think it's better to keep them close.
I'm doing some free text parsing migrating from the regular expressions based solution. I faced an issue that I couldn't fully specify the surrounding context. when
is the only parser that could work with context but documentation doesn't say anything whether it consumes input or not (I assume it does which raises a question how it is different from sequence
). Also when
allows to specify preceding context only and I'm not sure what should I use to specify the context after the target parser.
It would be nice to have non-consuming parser (context
) and negation parser (not
) so it would be possible to specify surrounding context e.g.
// context(before, target, optional after)
const helloWorld = string('hello world');
const strictHelloWorld = context(
not(letter()),
helloWorld,
not(whole())
);
const parser1 = sequence(any(), strictHelloWorld, any());
const parser2 = sequence(any(), helloWorld, any());
run(parser1).with('1hello worldA'); // result.isOk = true, value = ['1', 'hello world', 'A'];
run(parser1).with('Ahello world1'); // result.isOk = false
run(parser2).with('1hello worldA'); // result.isOk = true, value = ['1', 'hello world', 'A'];
run(parser2).with('Ahello world1'); // result.isOk = true, value = ['A', 'hello world', '1'];
Also it would be nice to have some default non-consuming parsers such as wordBoundary
. What is more this will require polyfilling JS \b
since it's not unicode friendly.
More examples can be found in the codesandbox.
Pretty self-explanatory.
I'm currently migrating parser from regular expressions based solution. I faced an issue that I couldn't just run parser over free text and I need to specify everything around this parser somehow. What is more one of the requirement that I have is to get the matched string out of the input (it is used to remove it from the free text later).
I'm currently using something like that to get the desired output:
import * as s from '@nrsk/sigma';
interface ExtractedResult<T> {
consumed: string;
rest: string;
value: T;
}
export function extract<T>(input: string, targetParser: s.Parser<T>): ExtractedResult<T> {
const wrappedParser = s.map(
s.sequence(
s.takeUntil(s.any(), targetParser),
s.rest(),
),
([[before, value], after]) => ([
before.join(''),
value,
after,
] as const)
);
const result = s.run(wrappedParser).with(input);
if (result.isOk) {
const [before, value, after] = result.value;
const consumed = input.replace(before, '').replace(after, '');
const rest = input.replace(consumed, '');
return {
value,
consumed,
rest,
};
} else {
throw new Error(result.expected);
}
};
While consumed
and rest
could be covered by span
feature, there is stil a need to make parser work with additional input like regular expressions do:
const parser = freeText(string('hello world'));
const re = /hello world/;
run(parser).with('RANDOM TEXT hello world RANDOM TEXT') // result.isOk = true, value = 'hello world'
re.exec('RANDOM TEXT hello world RANDOM TEXT') // matches: ['hello world']
run(parser).with('hello world') // result.isOk = true, value = 'hello world'
re.exec('hello world') // matches: ['hello world']
Right now, every time we want to add a new parser/combinator and write a documentation, we need to manually add an entry to the sidebar:
sigma/docs/content/.vitepress/config.ts
Lines 135 to 186 in 9dc2b1c
This is at the very least inconvenient. Scanning specific directories and extracting title
s from .md
files' frontmatter would be enough, since the structure is pretty flat and simple.
TSDoc annotations for better DX.
When choice
is given a spreaded array, e.g.:
choice(...['one', 'two'].map(string))
it incorrectly infers Parser<never>
instead of Parser<string>
. This is the problem with ToUnion
type helper:
// Ok
type U1 = [Parser<string>, Parser<number>, Parser<boolean>]
type R1 = ToUnion<U1> // type R1 = string | number | boolean
// Wrong
type U2 = Array<Parser<string>>
type R2 = ToUnion<U2> // type R2 = never
This is why we need #48...
Hey,
I was looking over these types:
Lines 22 to 36 in 261f2f8
I see that spans were added on later.
Maybe I'm missing something, but I was wondering:
Success
, isn't pos
always going to be identical to span[0]
?Failure
, isn't span[0]
always going to be identical to span[1]
as well as to pos
?A failure always happens at one specific position, does it not? When would the span
mean anything?
And a success always has both a start and an end - even if these are identical, that would signify a zero-length match, so both values are always meaningful, right?
Any particular reason you wouldn't just deprecate or remove pos
?
Or just have plain start
and end
properties for Success
, and pos
for Failure
? Is there any practical advantage to having those tuples? maybe for source maps or something? Does it matter if those properties have the same names/types?
Just wondering.
This library looks amazing btw. ๐
What is the ustring
parser for? Assuming you load a valid unicode text file (which in this day and age is every text file) wouldn't it just match everything?
Best guess, this is for something like validating correct binary encoding of JSON files? But is it actually possible to load a non-valid unicode text file into a string in JavaScript?
I was expecting I'd use this for, say, keywords.
But then the "success" example in the documentation says:
Note that the index is 12, which is correct, since every hieroglyph here takes 3 bytes.
String operations in JS generally operate in ranges of code points:
So these numbers aren't useful for error reporting, or any subsequent string operation in JS really.
Text editors usually measure positions in code points as well:
The documentation itself explains:
This parser is very similar to the string parser, except it takes a bit hacky (though performant) approach, that is based on counting length of the given match string in bytes. It then subslices and compares string slice with that match string.
"hacky though performant", but it seems like this is doing a lot of unnecessary work to figure out a string position that isn't useful for most common use cases, like just matching a keyword or symbol, isn't it?
What I was expecting was a simpler parser that would use String.prototype.includes
, which ought to be the fastest native way to check for a specific string at a specific offset, I think?
EDIT: oh, whoops, now I get it! I avoided string
, because it it specifically says this will match "ASCII", which is incorrect. It would in fact match whatever Unicode characters you put in the string. Looks like a documentation problem.
But I don't see any other parser for simple strings - and nothing relevant in the codebase calling includes
.
EDIT: looks like maybe there is room for a small optimization here to avoid copying.
It's also difficult to think of a name for such a parser, now that ๐
string
is taken.
(I know I'm submitting a lot of feedback! I am already somewhat invested in this lovely library, and I do want to help out - if you want me to submit PRs for anything, let me know.)
EDIT: let me know if you'd like me to correct the documentation and/or try the minor optimization/simplification with includes
in the string
parser.
Right now there're no "errors" per se, i.e. all sigma provides users with is text messages and ability to re-map those messages to something custom. This is a shame and should be improved.
Implementation of spans in #34 should help a bit, but we will also need to extend Parser<T>
signature with a second generic parameter E
, so parsers could bear error type information. All parsers and combinators will be changed accordingly, although I'm pretty sure there'll be hurdles here and there.
Additionally, there should be added two combinators:
mapErr(parser, fn)
- this combinator will map error from E
to some other type using given fn
.mapOrElse(parser, okFn, errFn)
- this combinator will conditionally apply okFn
or errFn
depending on the parser's result.It also makes sense to implement error recovery along with the stuff above. Hopefully, it will be enough to provide a single combinator:
recovery(parser, fn)
- this combinator takes a parser
and a recovery function fn
that should produce another parser; similar to when
combinator, but it acts only on failures.Making parsers named (adding name
property to Parser<T>
) wouldn't hurt as well.
internal
directory.index.ts
).In this issue I've learnt that the g
flag must be used in order to make regexp
parser work correctly.
Should regexp
parser check itself whether the g
flag was used? It would be a minor improvement but it will prevent many mistakes. I think there is always a chance to forget adding this flag even if you know about this rule.
Maybe the paragraph in the documentation explaining this requirement should be emphasized to make this moment more obvious. But I don't think that documentation itself would be enough and this rule should be validated programmatically.
Adding this check somewhere would be a great addition:
if (!re.global) throw new Error('regexp parser must use g flag to process input correctly');
Line 17 in 567b6f9
I can submit a PR myself if you agree with me on this point!
npm install
npm WARN EBADENGINE Unsupported engine {
npm WARN EBADENGINE package: '@nrsk/[email protected]',
npm WARN EBADENGINE required: { node: '>=18.16.0 <=20' },
npm WARN EBADENGINE current: { node: 'v18.15.0', npm: '9.5.0' }
npm WARN EBADENGINE }
npm ERR! code 1
npm ERR! path /Users/redacted/project/node_modules/@nrsk/sigma
npm ERR! command failed
npm ERR! command sh -c npm run install:benchmarks && npm run install:docs
npm ERR! > @nrsk/[email protected] install:benchmarks
npm ERR! > cd benchmarks && npm i
npm ERR! sh: line 0: cd: benchmarks: No such file or directory
Install 3.6.3 locally. I originally discovered the bug on trying to deploy to render since it caught the latest version due to my package.json having ^3.6.2
for sigma.
Library types:
Failure
and Success
, as mentioned in #82;Parser<T>
union variants.Check utility types' docs.
Something similar to primitive
and composite
badges. These should be added to tsdocs as well.
Add benchmarks and comparison with other similar libraries.
Reject ancient tech, embrace modern stuff.
Proof of concept implementation with basic combinators and capabilities.
Add tryRun
helper, that will, unlike run
, throw in case of failure.
Just wondering, I noticed this:
Lines 72 to 77 in 0d96737
I'm not sure this makes sense as a parser error?
It's not that parsing failed - it's that there is something wrong with your code.
So I think maybe it would be more appropriate to throw an error here?
You might have had some sort of reason for this - it looks a little off to me, so I figured I'd ask. ๐
What do you think about including a formatError
function?
I wrote a function that takes the input string and a Failure
instance, and prints out the surrounding +/- n lines with line-numbers etc.
I don't know if it's wise to include this in the library, since parser errors aren't the only type of errors you'd want to print - though, on the other hand, if the function was just something like formatError(input: string, offset: number, expectation: string)
, then this could be used to inject whatever expectation
you want, to print out semantic errors and so on.
Not 100% sure if this is a good idea, as error formats can be quite different in different projects - but on the other hand, it might be nice to have something you can use just to get your project started and get something useful on the screen without doing a lot of ground work?
Up to you, but I do have something I could PR, if you'd like. ๐
Add a new parser: hexadecimal. Should parse a positive whole number in the hexadecimal system. The number should be prefixed with 0x
or 0X
. Ex: 0xDF
, 0X1F
.
Add a new parser: octal. Should parse a positive whole number in the octal system. The number should be prefixed with 0o
or 0O
. Ex: 0o230
, 0O11
.
Add a new parser: binary. Should parse a positive whole number in the binary system. The number should be prefixed with 0b
or 0B
. Ex: 0b1111
, 0B1111
.
Add a new parser: whole. Should parse a positive whole number in the decimal system. Ex: 0
, 1
, 2
.
Rename int to integer. The same as whole, except it can be prefixed with a minus -
sign.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.