Code Monkey home page Code Monkey logo

maleeni's Introduction

maleeni

maleeni is a lexer generator for golang. maleeni also provides a command to perform lexical analysis to allow easy debugging of your lexical specification.

ci

Installation

Compiler:

$ go install github.com/nihei9/maleeni/cmd/maleeni@latest

Code Generator:

$ go install github.com/nihei9/maleeni/cmd/maleeni-go@latest

Usage

1. Define your lexical specification

First, define your lexical specification in JSON format. As an example, let's write the definitions of whitespace, words, and punctuation.

{
    "name": "statement",
    "entries": [
        {
            "kind": "whitespace",
            "pattern": "[\\u{0009}\\u{000A}\\u{000D}\\u{0020}]+"
        },
        {
            "kind": "word",
            "pattern": "[0-9A-Za-z]+"
        },
        {
            "kind": "punctuation",
            "pattern": "[.,:;]"
        }
    ]
}

Save the above specification to a file. In this explanation, the file name is statement.json.

⚠️ The input file must be encoded in UTF-8.

2. Compile the lexical specification

Next, generate a DFA from the lexical specification using maleeni compile command.

$ maleeni compile statement.json -o statementc.json

3. Debug (Optional)

If you want to make sure that the lexical specification behaves as expected, you can use maleeni lex command to try lexical analysis without having to generate a lexer. maleeni lex command outputs tokens in JSON format. For simplicity, print significant fields of the tokens in CSV format using jq command.

⚠️ An encoding that maleeni lex and the driver can handle is only UTF-8.

$ echo -n 'The truth is out there.' | maleeni lex statementc.json | jq -r '[.kind_name, .lexeme, .eof] | @csv'
"word","The",false
"whitespace"," ",false
"word","truth",false
"whitespace"," ",false
"word","is",false
"whitespace"," ",false
"word","out",false
"whitespace"," ",false
"word","there",false
"punctuation",".",false
"","",true

The JSON format of tokens that maleeni lex command prints is as follows:

Field Type Description
mode_id integer An ID of a lex mode.
mode_name string A name of a lex mode.
kind_id integer An ID of a kind. This is unique among all modes.
mode_kind_id integer An ID of a lexical kind. This is unique only within a mode. Note that you need to use kind_id field if you want to identify a kind across all modes.
kind_name string A name of a lexical kind.
row integer A row number where a lexeme appears.
col integer A column number where a lexeme appears. Note that col is counted in code points, not bytes.
lexeme array of integers A byte sequense of a lexeme.
eof bool When this field is true, it means the token is the EOF token.
invalid bool When this field is true, it means the token is an error token.

4. Generate the lexer

Using maleeni-go command, you can generate a source code of the lexer to recognize your lexical specification.

$ maleeni-go statementc.json

The above command generates the lexer and saves it to statement_lexer.go file. By default, the file name will be {spec name}_lexer.json. To use the lexer, you need to call NewLexer function defined in statement_lexer.go. The following code is a simple example. In this example, the lexer reads a source code from stdin and writes the result, tokens, to stdout.

package main

import (
    "fmt"
    "os"
)

func main() {
    lex, err := NewLexer(NewLexSpec(), os.Stdin)
    if err != nil {
        fmt.Fprintln(os.Stderr, err)
        os.Exit(1)
    }

    for {
        tok, err := lex.Next()
        if err != nil {
            fmt.Fprintln(os.Stderr, err)
            os.Exit(1)
        }
        if tok.EOF {
            break
        }
        if tok.Invalid {
            fmt.Printf("invalid: %#v\n", string(tok.Lexeme))
        } else {
            fmt.Printf("valid: %v: %#v\n", KindIDToName(tok.KindID), string(tok.Lexeme))
        }
    }
}

Please save the above source code to main.go and create a directory structure like the one below.

/project_root
├── statement_lexer.go ... Lexer generated from the compiled lexical specification (the result of `maleeni-go`).
└── main.go .............. Caller of the lexer.

Now, you can perform the lexical analysis.

$ echo -n 'I want to believe.' | go run main.go statement_lexer.go
valid: word: "I"
valid: whitespace: " "
valid: word: "want"
valid: whitespace: " "
valid: word: "to"
valid: whitespace: " "
valid: word: "believe"
valid: punctuation: "."

More Practical Usage

See also this example.

Lexical Specification Format

The lexical specification format to be passed to maleeni compile command is as follows:

top level object:

Field Type Domain Nullable Description
name string id false A specification name.
entries array of entry objects N/A false An array of entries sorted by priority. The first element has the highest priority, and the last has the lowest priority.

entry object:

Field Type Domain Nullable Description
kind string id false A name of a token kind. The name must be unique, but duplicate names between fragments and non-fragments are allowed.
pattern string regexp false A pattern in a regular expression
modes array of strings N/A true Mode names that an entry is enabled in (default: "default")
push string id true A mode name that the lexer pushes to own mode stack when a token matching the pattern appears
pop bool N/A true When pop is true, the lexer pops a mode from own mode stack.
fragment bool N/A true When fragment is true, its entry is a fragment.

See Identifier and Regular Expression for more details on id domain and regexp domain.

Identifier

id represents an identifier and must follow the rules below:

  • id must be a lower snake case. It can contain only a to z, 0 to 9, and _.
  • The first and last characters must be one of a to z.
  • _ cannot appear consecutively.

Regular Expression

regexp represents a regular expression. Its syntax is below:

⚠️ In JSON, you need to write \ as \\.

⚠️ maleeni doesn't allow you to use some code points. See Unavailable Code Points.

Composites

Concatenation and alternation allow you to combine multiple characters or multiple patterns into one pattern.

Pattern Matches
abc abc
abc|def abc or def

Single Characters

In addition to using ordinary characters, there are other ways to represent a single character:

  • dot expression
  • bracket expressions
  • code point expressions
  • character property expressions
  • escape sequences

Dot Expression

The dot expression matches any one chracter.

Pattern Matches
. any one character

Bracket Expressions

The bracket expressions are represented by enclosing characters in [ ] or [^ ]. [^ ] is negation of [ ]. For instance, [ab] matches one of a or b, and [^ab] matches any one character except a and b.

Pattern Matches
[abc] a, b, or c
[^abc] any one character except a, b, and c
[a-z] one in the range of a to z
[a-] a or -
[-z] - or z
[-] -
[^a-z] any one character except the range of a to z
[a^] a or ^

Code Point Expressions

The code point expressions match a character that has a specified code point. The code points consists of a four or six digits hex string.

Pattern Matches
\u{000A} U+000A (LF)
\u{3042} U+3042 (hiragana )
\u{01F63A} U+1F63A (grinning cat 😺)

Character Property Expressions

The character property expressions match a character that has a specified character property of the Unicode. Currently, maleeni supports General_Category, Script, Alphabetic, Lowercase, Uppercase, and White_Space. When you omitted the equal symbol and a right-side value, maleeni interprets a symbol in \p{...} as the General_Category value.

Pattern Matches
\p{General_Category=Letter} any one character whose General_Category is Letter
\p{gc=Letter} the same as \p{General_Category=Letter}
\p{Letter} the same as \p{General_Category=Letter}
\p{l} the same as \p{General_Category=Letter}
\p{Script=Latin} any one character whose Script is Latin
\p{Alphabetic=yes} any one character whose Alphabetic is yes
\p{Lowercase=yes} any one character whose Lowercase is yes
\p{Uppercase=yes} any one character whose Uppercase is yes
\p{White_Space=yes} any one character whose White_Space is yes

Escape Sequences

As you escape the special character with \, you can write a rule that matches the special character itself. The following escape sequences are available outside of bracket expressions.

Pattern Matches
\\. .
\\? ?
\\* *
\\+ +
\\( (
\\) )
\\[ [
\\| |
\\\\ \\

The following escape sequences are available inside bracket expressions.

Pattern Matches
\\^ ^
\\- -
\\] ]

Repetitions

The repetitions match a string that repeats the previous single character or group.

Pattern Matches
a* zero or more a
a+ one or more a
a? zero or one a

Grouping

( and ) groups any patterns.

Pattern Matches
a(bc)*d ad, abcd, abcbcd, and so on
(ab|cd)+ ab, cd, abcd, cdab, abcdab, and so on

Fragment

The fragment is a feature that allows you to define a part of a pattern. This feature is useful for decomposing complex patterns into simple patterns and for defining common parts between patterns. A fragment entry is defined by an entry whose fragment field is true, and is referenced by a fragment expression (\f{...}). Fragment patterns can be nested, but they are not allowed to contain circular references.

For instance, you can define an identifier of golang as follows:

{
    "name": "id",
    "entries": [
        {
            "fragment": true,
            "kind": "unicode_letter",
            "pattern": "\\p{Letter}"
        },
        {
            "fragment": true,
            "kind": "unicode_digit",
            "pattern": "\\p{Number}"
        },
        {
            "fragment": true,
            "kind": "letter",
            "pattern": "\\f{unicode_letter}|_"
        },
        {
            "kind": "identifier",
            "pattern": "\\f{letter}(\\f{letter}|\\f{unicode_digit})*"
        }
    ]
}

Unavailable Code Points

Lexical specifications and source files to be analyzed cannot contain the following code points.

When you write a pattern that implicitly contains the unavailable code points, maleeni will automatically generate a pattern that doesn't contain the unavailable code points and replaces the original pattern. However, when you explicitly use the unavailable code points (like \u{U+D800} or \p{General_Category=Cs}), maleeni will occur an error.

  • surrogate code points: U+D800..U+DFFF

Lex Mode

Lex Mode is a feature that allows you to separate a DFA transition table for each mode.

modes field of an entry in a lexical specification indicates in which mode the entry is enabled. If modes field is empty, the entry is enabled only in the default mode. The compiler groups the entries and generates a DFA for each mode. Thus the driver can switch the transition table by switching modes. The mode switching follows push or pop field of each entry.

For instance, you can define a subset of the string literal of golang as follows:

{
    "name": "string",
    "entries": [
        {
            "kind": "string_open",
            "pattern": "\"",
            "push": "string"
        },
        {
            "modes": ["string"],
            "kind": "char_seq",
            "pattern": "[^\\u{000A}\"\\\\]+"
        },
        {
            "modes": ["string"],
            "kind": "escaped_char",
            "pattern": "\\\\[abfnrtv\\\\'\"]"
        },
        {
            "modes": ["string"],
            "kind": "escape_symbol",
            "pattern": "\\\\"
        },
        {
            "modes": ["string"],
            "kind": "newline",
            "pattern": "\\u{000A}"
        },
        {
            "modes": ["string"],
            "kind": "string_close",
            "pattern": "\"",
            "pop": true
        },
        {
            "kind": "identifier",
            "pattern": "[A-Za-z_][0-9A-Za-z_]*"
        }
    ]
}

In the above specification, when the " mark appears in default mode (it's the initial mode), the driver transitions to the string mode and interprets character sequences (char_seq) and escape sequences (escaped_char). When the " mark appears the next time, the driver returns to the default mode.

$ echo -n '"foo\nbar"foo' | maleeni lex stringc.json | jq -r '[.mode_name, .kind_name, .lexeme, .eof] | @csv'
"default","string_open","""",false
"string","char_seq","foo",false
"string","escaped_char","\n",false
"string","char_seq","bar",false
"string","string_close","""",false
"default","identifier","foo",false
"default","","",true

The input string enclosed in the " mark (foo\nbar) are interpreted as the char_seq and the escaped_char, while the outer string (foo) is interpreted as the identifier. The same string foo is interpreted as different types because of the different modes in which they are interpreted.

Unicode Version

maleeni references Unicode 13.0.0.

maleeni's People

Contributors

nihei9 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

maleeni's Issues

Minimize DFA

Minimize a DFA maleeni compile command outputs.

Prohibit nullable patterns

A pattern that generates the empty string like a? or a* (referred to below as nullable pattern) is valid as the regular expression, but maleeni won't recognize the empty string as a lexeme. If a user expects that maleeni recognizes a lexeme matching the empty string, that expectation is wrong, and maleeni should report so. Therefore we prohibit using nullable patterns.

Cannot compile `\p{gc=Unassigned}`

maleeni v0.6.0 cannot compile \p{gc=Unassigned}. Specifically, I made the following file:

{
    "name": "test",
    "entries": [
        {
            "kind": "unassigned",
            "pattern": "\\p{gc=Unassigned}"
        }
    ]
}

Then I got the following error:

$ maleeni compile test.json
failed to compile in default mode: surrogate code points U+D800..U+DFFF are not allowed in UTF-8: U+D801..U+DB7E

Regardless of whether gc=Unassigned properties are practical or not, the expectation is that no errors will occur.

Panic on a mode having no patterns

When a mode contains no patterns, not fragments, maleeni v0.6.0 panics.

I made the following file:

{
	"name": "test",
	"entries": [
		{
			"modes": ["other_mode"],
			"kind": "foo",
			"pattern": "foo"
		}
	]
}

Then I got the following error:

$ maleeni compile test.json
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x5b2510]

goroutine 1 [running]:
github.com/nihei9/maleeni/compiler/dfa.GenDFA(0x0, 0x0, 0xc000118dd0, 0xc000118dd0)
        /home/nihei9/src/github.com/nihei9/maleeni/compiler/dfa/dfa.go:51 +0x50
github.com/nihei9/maleeni/compiler.compile(0x832280, 0x0, 0x0, 0xc00011b350, 0xc00011b3b0, 0xc0001a4f78, 0xc00014c400, 0x3, 0x4, 0xc00011b350, ...)
        /home/nihei9/src/github.com/nihei9/maleeni/compiler/compiler.go:323 +0x1b7d
github.com/nihei9/maleeni/compiler.Compile(0xc00011b050, 0xc0001add18, 0x1, 0x1, 0x0, 0xc0001b8100, 0xc0001add10, 0xc0001add38, 0xc0001add38, 0x5cd065)
        /home/nihei9/src/github.com/nihei9/maleeni/compiler/compiler.go:57 +0x9a7
main.runCompile(0xc000154f00, 0xc000118ad0, 0x1, 0x1, 0x0, 0x0)
        /home/nihei9/src/github.com/nihei9/maleeni/cmd/maleeni/compile.go:49 +0xe5
github.com/spf13/cobra.(*Command).execute(0xc000154f00, 0xc000118ab0, 0x1, 0x1, 0xc000154f00, 0xc000118ab0)
        /home/nihei9/go/pkg/mod/github.com/spf13/[email protected]/command.go:852 +0x472
github.com/spf13/cobra.(*Command).ExecuteC(0x800100, 0x4407dc, 0xc000000180, 0x300000002)
        /home/nihei9/go/pkg/mod/github.com/spf13/[email protected]/command.go:960 +0x375
github.com/spf13/cobra.(*Command).Execute(...)
        /home/nihei9/go/pkg/mod/github.com/spf13/[email protected]/command.go:897
main.Execute(0x7e9d88, 0xc00010e058)
        /home/nihei9/src/github.com/nihei9/maleeni/cmd/maleeni/root.go:22 +0x31
main.main()
        /home/nihei9/src/github.com/nihei9/maleeni/cmd/maleeni/main.go:8 +0x25

The following spec causes the same error.

{
	"name": "test",
	"entries": [
		{
			"kind": "foo",
			"fragment": true,
			"pattern": "foo"
		}
	]
}

Cannot report an error about spelling inconsistencies of kind names

When kind names contain spelling inconsistencies, maleeni v0.6.0 panics.

I made the following file:

{
	"name": "test",
	"entries": [
		{
			"kind": "foo_1",
			"pattern": "foo_1"
		},
		{
			"kind": "foo1",
			"pattern": "foo1"
		}
	]
}

Then I got the following error:

$ maleeni compile test.json
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x5262bb]

goroutine 1 [running]:
github.com/nihei9/maleeni/spec.findSpellingInconsistenciesErrors(0xc0001b83c0, 0x3, 0x4, 0x0, 0x3, 0xc0001b83c0, 0x2)
        /home/nihei9/src/github.com/nihei9/maleeni/spec/spec.go:259 +0xdb
github.com/nihei9/maleeni/spec.(*LexSpec).Validate(0xc0001830e0, 0x0, 0x200)
        /home/nihei9/src/github.com/nihei9/maleeni/spec/spec.go:225 +0xb28
github.com/nihei9/maleeni/compiler.Compile(0xc0001830e0, 0xc000217d18, 0x1, 0x1, 0x0, 0xc000226100, 0xc000217d10, 0xc000217d38, 0xc000217d38, 0x5cd045)
        /home/nihei9/src/github.com/nihei9/maleeni/compiler/compiler.go:37 +0x45
main.runCompile(0xc0001bef00, 0xc000183050, 0x1, 0x3, 0x0, 0x0)
        /home/nihei9/src/github.com/nihei9/maleeni/cmd/maleeni/compile.go:49 +0xe5
github.com/spf13/cobra.(*Command).execute(0xc0001bef00, 0xc000183020, 0x3, 0x3, 0xc0001bef00, 0xc000183020)
        /home/nihei9/go/pkg/mod/github.com/spf13/[email protected]/command.go:852 +0x472
github.com/spf13/cobra.(*Command).ExecuteC(0x800100, 0x4626e0, 0xc0000526ef, 0xc000028800)
        /home/nihei9/go/pkg/mod/github.com/spf13/[email protected]/command.go:960 +0x375
github.com/spf13/cobra.(*Command).Execute(...)
        /home/nihei9/go/pkg/mod/github.com/spf13/[email protected]/command.go:897
main.Execute(0x7e9d88, 0xc00010e058)
        /home/nihei9/src/github.com/nihei9/maleeni/cmd/maleeni/root.go:22 +0x31
main.main()
        /home/nihei9/src/github.com/nihei9/maleeni/cmd/maleeni/main.go:8 +0x25

The expectation is that maleeni compile command prints an error message without panic.

Inverse bracket expression is broken

I made the following file:

{
	"name": "test",
	"entries": [
		{
			"kind": "not_paren",
			"pattern": "[^()]+"
		}
	]
}

Then I got the following result:

$ echo -n '()' | maleeni lex test.json
{"mode_id":1,"mode_name":"default","kind_id":0,"mode_kind_id":0,"kind_name":"","row":0,"col":0,"lexeme":"(","eof":false,"invalid":true}
{"mode_id":1,"mode_name":"default","kind_id":1,"mode_kind_id":1,"kind_name":"not_paren","row":0,"col":1,"lexeme":")","eof":false,"invalid":false}
{"mode_id":1,"mode_name":"default","kind_id":0,"mode_kind_id":0,"kind_name":"","row":0,"col":0,"lexeme":"","eof":true,"invalid":false}

Even though the inverse expression contains ), maleeni v0.6.0 recognizes ) as a valid token. The expectation is that ) will be a part of an invalid token.

Lexeme positions

Include positions that lexemes appear in source code in tokens.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.