We want the ability to add your own tokenizer. This should permit very lightweight spe

So <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

tokenizer support,about lalrpop/lalrpop

Comments (18)

nikomatsakis commented on May 22, 2024 1

It's time to make progress here. I have a concrete proposal I plan to experiment with. It does not yet handle multiple lexer states, but I'm interested in inferring those automatically based on the grammar instead.

My idea is roughly this. One can define a match block that defines your tokenizer. This consists of multiple sections corresponding to priorities. The regular expressions or strings within a section all have equal priority and hence must not match the same things:

match {
    r"[a-z]\w+", // anything beginning with lowercase letter
    "Foo" // Foo specifically
} else {
    r"[A-Z]\w+", // anything beginning with uppercase letter
    _, // include anything else mentioned in the grammar that doesn't already appear here
};

In addition, these entries can have actions. Right now, these actions can only take two forms:

skip (=> ()) -- when we find a match, ignore it entirely
rename (=> S where S is some symbol) -- produce this terminal that is used elsewhere in the grammar

The default action we saw above is then equivalent to an identity rename. e.g. "Foo" is equivalent to "Foo" => "Foo". The _ just means "scrape the grammar for things literals etc and insert them here" (e.g., maybe the grammar has r"[0-9]+" for numbers; that would get inserted there).

So, to accommodate Pascal, where the reserved words are case insensitive, I might do:

match {
    r"(?i)begin" => "BEGIN",
    r"(?i)end" => "END",
} else {
    r"[a-zA-Z_][a-zA-Z0-9_]*" => IDENTIFIER,
};

This would declare that:

begin/end have higher-priority and
I will refer to them in my grammar as "BEGIN" and "END";
identifiers have lower priority;
and I will refer to them as IDENTIFIER.

Note that you can map to any symbol name, with or without quotes etc.

So then I can write my grammar using those "cleaned-up" names:

BLOCK = "BEGIN" ... "END";
EXPRESSION = IDENTIFIER ...;

I feel good about this.

from lalrpop.

anchpop commented on May 22, 2024 1

I would be happy to work on an implementation of something like what's described in this whitepaper, as it would dramatically simplify some of my code. But I would need some guidance from someone experienced with the library on what they think the best way to implement it would be, In addition, being able to configure the lexer to emit ENDLINE or STARTLINE tokens would allow some languages to be described without much more complexity on LALRPOP's side, I think.

from lalrpop.

nikomatsakis commented on May 22, 2024

We currently have some simple tokenizer generation support -- it has no states, and you can't write any custom action code at all. There are some immediate needs for extension described in #14, but it'd be nice to tie this all together in a grand, unified design.

from lalrpop.

nikomatsakis commented on May 22, 2024

My current thought is something like this. You can add match sections into the grammar, with an optional priority annotation. These can contain either literal strings or regular expresisons.

#[priority=2]
match {
    "foo" => action, // literal text
    r"bar" => action, // regular expression
    "foo" =>? action, // fallible action
}

The action will be to produce something that supports IntoIterator<Item=TheToken>. Thus you can return None to just skip the token, Some(foo) to produce one token, or vec![...] to produce many tokens (or other such things). If the rule is fallible, you should produce Result<T,E> where E is a ParseError and T is something supporting IntoIterator, as above.

I had imagined that the internal token form would "desugar" to this. One catch is that if you are hand-writing code, you would presumably want to specify your own token type. The current internal tokenizer just generates tuples like (u32, &'input str), where the integer identifies which token it is. This is pretty opaque and not something you could feed by hand (OTOH, the main use for this form is to specify the whitespace to skip, in which case you'd just return None).

The main use case for producing multiple tokens is something like making >>> a single match that then generates 3 tokens, one for each >. (Here, the first two would be a special sort of > that indicates that another > afterwards.) This could also be accommodated by incorporating lookahead into the regular expressions, I guess. I'd have to think about how best to do that (or read some papers).

While resolving these annoying questions, I could JUST support this form, to allow skipping comments:

match {
    "//[^\n]*" => None, 
}

(However, we also want to consider states, so we can support nested /* comments. For that I imagined match in state, and returning something like Push(state, tokens), but that also requires some thought about how the states are reflected into rust code, etc. Maybe it should be special syntax. Bah.)

from lalrpop.

nikomatsakis commented on May 22, 2024

I've been looking at the code a bit. I'm going to take a few notes on a possible implementation plan, though I've run out of "side project" time for this morning:

right now, name resolution takes place in two places:
- for non-terminals, that is resolve/mod.rs. It cares about terminals only in so far as it needs to ignore them. This works by first identifying the set of non-quoted identifiers (e.g., IDENTIFIER) that are actually terminals. Currently, that can only occur if you have an external tokenizer, so it finds an extern section for that and extracts out the LHS of those pattern declarations.
- later, in token_check, we walk the terminals. If an external tokenizer is defined, we check that the terminal is in that set. Otherwise we collect the literals ('foo" or r"foo").
token-check then constructs an InternTok structure, presuming no external tokenizer has been defined. This basically means we construct the DFA.

To make this system I am describing here work, we have to make changes in a few places.

First and foremost, we have to parse the match declaration above (which I will call an "tokenizer declaration"). When we lower from the parse-tree to the repr, I would then make the invariant that we have either an external tokenizer or an internal tokenizer. If the user did not add a extern or a match section, that is equivalent to an internal tokenizer like match { _ }.

Next, we can now have "bare" terminals even with an internal tokenizer. So, name resolution needs to find the internal tokenizer "match" declaration and check the right-hand side of each =>. So e.g. r"[a-zA-Z_][a-zA-Z0-9_]*" => IDENTIFIER would indicate that IDENTIFIER is actually a terminal. That's easy enough.

In token-check, when we validate a literal like "bar" that appears in the grammar, we will want to check against the tokenizer declaration: if "bar" appears on the RHS of some entry, we're done. But otherwise, if there is a _, we want to collect in a vector. If there is no _, we want to report an error, this is an unrecognized symbol.

Finally, when we build the DFA, we would build it up using the LHS of each A => B mapping. I think as long as we are careful, no further changes are needed to other parts of LALRPOP.

from lalrpop.

wagenet commented on May 22, 2024

I like this API, I'll try to see if I can make enough sense of the lalrpop code to see if I can make any progress.

from lalrpop.

nikomatsakis commented on May 22, 2024

So @wagenet made awesome progress here. There is still a bit more work to go before I'm ready to close this issue. In particular, I want to add support for tokenizers that produce zero or multiple tokens in response to an input. I'm envisioning something like this:

match {
    r"&&" => ("&[&]", "&[]"), // produce two tokens in response to 1 regex
} else {
    r"\s+" => (), // skip whitespace
    r"//.*" => (), // skip EOL comments
    r"/\*.*\*/" => (), // skip (non-nested) `/* ... */` comments
}

This would basically make us able to synthesize any regular (as in regular expressions) lexer. To go beyond regular languages we'd have to add states -- I'm not opposed to it, but want to think on it a bit more.

One thing that I don't know: right now, we implicitly skip whitespace for you. I'd like to give users control of that, but I'm not sure when to disable the current behavior. For example, should we disable the whitespace default behavior if there are any tokens that map to ()? Or should we have people add a #[no_skip_whitespace] annotation?

Somehow the "no skip whitespace" thing feels related to _ to me -- i.e., the former opts out of default behavior, the latter opts back in -- perhaps they could be combined? i.e., we'd have _ by default unless you wrote #[no_default] or something on the match?

Thoughts?

from lalrpop.

8573 commented on May 22, 2024

For example, should we disable the whitespace default behavior if
there are any tokens that map to ()? Or should we have people add
a #[no_skip_whitespace] annotation?

In the former case, how would one write a lexer that skips nothing?

from lalrpop.

nikomatsakis commented on May 22, 2024

@8573 yeah, good question. I was thinking that the better "magic" answer would be to maybe look to see if any regex patterns may match whitespace, but it feels...kind of magic.

from lalrpop.

hollinwilkins commented on May 22, 2024

Opting for the non-magic option seems to be the more predictable choice and wouldn't preclude implementing magic in the case it is not set as a fallback.

from lalrpop.

nikomatsakis commented on May 22, 2024

Opting for the non-magic option seems to be the more predictable choice and wouldn't preclude implementing magic in the case it is not set as a fallback.

Yes. I think there are two viable options:

If there is no _, then disable skipping whitespace and require an explicit line.
Some form of annotation like #[no_skip_whitespace]

Right now, I lean towards the former. The annotation just feels awkward, and using _ as a stand-in for "insert default rules" feels ok. Basically _ would be short for:

Insert a "foo" => "foo" rule for every terminal that appears in the grammar which is not explicitly listed.
Insert r"\s+" => ()

The next question is: what notation should we use for "skip" rules? I am thinking a bit forward here. I eventually expect to support:

custom code rules where you can write things like => { /* this will execute */ }
probably the ability to transition between states ike "/*" => goto COMMENT
- I had eventually hoped to avoid this but I'm changing my opinion
the ability to generate multiple tokens from one regex match (as I already mentioned)
- but this may want custom code, since I think the number of tokens to generate may sometimes depend on the input
- and maybe this should just be a custom tokenizer

Considering all these things I am leaning towards either:

=> (), as originally proposed, or
=> continue, just use a keyword.

I sort of lean towards the second one right now, but really either is fine.

from lalrpop.

pyfisch commented on May 22, 2024

Has there been any progress with "skip" rules? I want to use them for line comments and inline comments without nesting.

What needs to be changed to support "skip" rules?

from lalrpop.

ahmedcharles commented on May 22, 2024

How does this relate to #14? They both seem related to tokenizers and the match functionality. Should one be a duplicate of the other? It's also not obvious what functionality would be sufficient to result in either issue being closed?

from lalrpop.

nikomatsakis commented on May 22, 2024

@ahmedcharles yeah I think they may be basically the same thing. =)

from lalrpop.

ahmedcharles commented on May 22, 2024

Should one of them be closed?

from lalrpop.

anchpop commented on May 22, 2024

What is the status for parsing indentation sensitive languages? This (usually) requires a non-regular lexer to create functionality such as Python ignoring indentation within lists. Is there any documentation on how one can create their own lexer?

from lalrpop.

Marwes commented on May 22, 2024

@anchpop http://lalrpop.github.io/lalrpop/lexer_tutorial/index.html explains how to write and integrate a simple lexer.

from lalrpop.

anchpop commented on May 22, 2024

Oh, I didn't notice that, thanks

from lalrpop.

tokenizer support about lalrpop HOT 18 OPEN

Comments (18)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent