mewmew / uc Goto Github PK

View Code? Open in Web Editor NEW

55.0 5.0 5.0 834 KB

A compiler for the µC language.

Go 68.25% TeX 4.78% Makefile 0.61% C 0.01% LLVM 0.06% Shell 0.06% HTML 20.88% Standard ML 5.30% Lex 0.06%

compiler golang llvm-ir c

uc's Introduction

µC Compiler

A compiler for the µC programming language.

Installation

Install Go.
Install Gocc go get github.com/goccmack/gocc.

$ go get -u github.com/mewmew/uc
$ cd ${GOPATH}/src/github.com/mewmew/uc/gocc
$ make gen
$ go get github.com/mewmew/uc/...
$ go install github.com/mewmew/uc/cmd/ulex
$ go install github.com/mewmew/uc/cmd/uparse
$ go install github.com/mewmew/uc/cmd/uclang
$ go install github.com/mewmew/uc/cmd/3rdpartycompile

Usage

ulex: a lexer for the µC language which pretty-prints tokens to standard output.
uparse: a parser for the µC language which pretty-prints abstract syntax trees to standard output.
usem: a static semantic checker for the µC language which validates the input and reports errors to standard error.
uclang: a compiler for the µC language which validates the input, and prints corresponding LLVM IR assembly to standard output.
3rdpartycompile: a compiler for the µC language which through the uclang tool chain validates the input, compiles to LLVM and through the third party compiler clang links with the supplied lib uc.c and outputs the corresponding binary.

Public domain

The source code and any original content of this repository is hereby released into the public domain.

uc's People

Contributors

Stargazers

Watchers

Forkers

golint-fixer firecrackerz happy-ferret tabby-lang isgasho

uc's Issues

IR Generation

Implement a translation from abstract syntax of the µC language to LLVM IR.

References:

LLVM Language Reference Manual
Assignment 5 (contains useful information, even if the target assembly differs)
Tutorial 5
Examples 5

parser: "invalid function arguments type; expected []ast.Expr, got <nil>"

u@x1 ~/D/g/s/g/m/uc> uparse testdata/quiet/mips/m01.c 
Parsing "testdata/quiet/mips/m01.c"

…

S44 ;(3,;) reduce:57(Expr15 : int_lit   << astx.NewBasicLit(X[0], token.IntLit) >>)
S61 ;(3,;) reduce:54(Expr14 : Expr15    <<  >>)
S60 ;(3,;) reduce:51(Expr13L : Expr14   <<  >>)
S58 ;(3,;) reduce:48(Expr12L : Expr13L  <<  >>)
S57 ;(3,;) reduce:43(Expr10L : Expr12L  <<  >>)
S56 ;(3,;) reduce:40(Expr9L : Expr10L   <<  >>)
S55 ;(3,;) reduce:38(Expr5L : Expr9L    <<  >>)
S166 ;(3,;) reduce:37(Expr2R : Expr2R "=" Expr5L    << astx.NewBinaryExpr(X[0], token.Assign, X[2]) >>)
S53 ;(3,;) reduce:35(Expr : Expr2R  <<  >>)
S48 ;(3,;) shift:84
pos: 132
typ: 4
lit: jal

S84 ident(4,jal) reduce:25(Stmt : Expr ";"  << astx.NewExprStmt(X[0]) >>)
S47 ident(4,jal) shift:42
pos: 135
typ: 5
lit: (

S42 ((5,() shift:64
pos: 136
typ: 6
lit: )

S64 )(6,)) reduce:63(Args : empty   << nil, nil >>)
S120 )(6,)) shift:195
pos: 137
typ: 3
lit: ;

S195 ;(3,;) reduce:60(Expr15 : ident "(" Args ")"   << astx.NewCallExpr(X[0], X[2]) >>)
2016/04/10 00:41:35 main.parseFile (uparse.go:71): error: Error in S47: ;(3,;), Pos(offset=137, line=29, column=1)github.com/mewmew/uc/ast/astx.NewCallExpr (astx.go:313): error: invalid function arguments type; expected []ast.Expr, got <nil>

lexer: Trailing newline in test case incorrect/lexer/good.c

u@x1 ~/D/g/s/g/m/u/u/testdata> tail incorrect/lexer/good.c | hexdump -C
00000000  0a 20 20 20 20 20 20 20  20 32 64 69 65 34 55 0a  |.        2die4U.|
00000010  0a 2f 2f 7c 7c 20 54 68  65 20 66 6f 6c 6c 6f 77  |.//|| The follow|
00000020  69 6e 67 20 73 68 6f 75  6c 64 20 61 6c 6c 20 62  |ing should all b|
00000030  65 20 72 65 67 61 72 64  65 64 20 61 73 20 69 64  |e regarded as id|
00000040  65 6e 74 69 66 69 65 72  73 3a 0a 0a 09 46 75 6e  |entifiers:...Fun|
00000050  63 74 69 6f 6e 20 50 72  4f 63 65 44 75 52 45 20  |ction PrOceDuRE |
00000060  62 65 67 49 4e 20 65 4e  64 20 50 72 69 6e 54 20  |begIN eNd PrinT |
00000070  72 45 61 64 20 69 46 20  54 48 65 6e 20 53 74 61  |rEad iF THen Sta|
00000080  54 69 63 0a 09 45 6c 53  65 20 77 48 69 6c 45 20  |Tic..ElSe wHilE |
00000090  44 6f 20 72 65 54 75 72  4e 20 6e 6f 54 20 41 6e  |Do reTurN noT An|
000000a0  44 20 4f 52 20 54 72 55  45 20 62 4f 4f 6c 20 46  |D OR TrUE bOOl F|
000000b0  61 6c 73 45 20 73 69 7a  45 0a 0a 0a 2f 2f 20 49  |alsE sizE...// I|
000000c0  74 20 69 73 20 6c 65 67  61 6c 20 74 6f 20 65 6e  |t is legal to en|
000000d0  64 20 74 68 65 20 63 6f  64 65 20 6c 69 6b 65 20  |d the code like |
000000e0  74 68 69 73 2c 20 77 69  74 68 6f 75 74 20 61 6e  |this, without an|
000000f0  20 65 6e 64 69 6e 67 20  6e 65 77 6c 69 6e 65 2e  | ending newline.|
00000100  0a                                                |.|

Notice the trailing newline in the test case, directly succeeding the comment stating that no such new line character should be present.

// It is legal to end the code like this, without an ending newline.

parser: Incorrect representation of conditional expression of while and if statements

The conditional expression should not be a parenthesized expression (ParenExpr), but rather an expression surrounded by parenthesis ("(" Expr ")").

parser: "error: invalid block statements type; expected []ast.Stmt, got *ast.ExprStmt"

u@x1 ~/D/g/s/g/m/uc> uparse testdata/quiet/mips/m03.c 
Parsing "testdata/quiet/mips/m03.c"

…

S42 ;(3,;) reduce:58(Expr15 : ident << astx.NewIdent(X[0]) >>)
S61 ;(3,;) reduce:54(Expr14 : Expr15    <<  >>)
S60 ;(3,;) reduce:51(Expr13L : Expr14   <<  >>)
S58 ;(3,;) reduce:48(Expr12L : Expr13L  <<  >>)
S57 ;(3,;) reduce:43(Expr10L : Expr12L  <<  >>)
S56 ;(3,;) reduce:40(Expr9L : Expr10L   <<  >>)
S55 ;(3,;) reduce:38(Expr5L : Expr9L    <<  >>)
S54 ;(3,;) reduce:36(Expr2R : Expr5L    <<  >>)
S53 ;(3,;) reduce:35(Expr : Expr2R  <<  >>)
S86 ;(3,;) shift:156
pos: 91
typ: 14
lit: }

S156 }(14,}) reduce:26(Stmt : "return" Expr ";" << astx.NewReturnStmt(X[1]) >>)
S47 }(14,}) reduce:23(Stmts : empty <<  >>)
S83 }(14,}) reduce:24(Stmts : Stmt Stmts    <<  >>)
S83 }(14,}) reduce:24(Stmts : Stmt Stmts    <<  >>)
S83 }(14,}) reduce:24(Stmts : Stmt Stmts    <<  >>)
S231 }(14,}) shift:272
pos: 95
typ: 16
lit: return

S272 return(16,return) reduce:30(Stmt : "{" Stmts "}"   << astx.NewBlockStmt(X[1]) >>)
2016/04/10 00:45:55 main.parseFile (uparse.go:71): error: Error in S90: return(16,return), Pos(offset=95, line=15, column=10)github.com/mewmew/uc/ast/astx.NewBlockStmt (astx.go:167): error: invalid block statements type; expected []ast.Stmt, got *ast.ExprStmt

lexer: Lex type tokens as identifiers

Lex type tokens as identifiers to solve the lexer hack, similar to how Clang handles this issue.

parser: "error: invalid if statement else-body type; expected ast.Stmt, got *ast.ExprStmt"

u@x1 ~/D/g/s/g/m/uc> uparse testdata/quiet/parser/p08.c
Parsing "testdata/quiet/parser/p08.c"

…

S44 ;(3,;) reduce:57(Expr15 : int_lit   << astx.NewBasicLit(X[0], token.IntLit) >>)
S61 ;(3,;) reduce:54(Expr14 : Expr15    <<  >>)
S60 ;(3,;) reduce:51(Expr13L : Expr14   <<  >>)
S58 ;(3,;) reduce:48(Expr12L : Expr13L  <<  >>)
S57 ;(3,;) reduce:43(Expr10L : Expr12L  <<  >>)
S56 ;(3,;) reduce:40(Expr9L : Expr10L   <<  >>)
S55 ;(3,;) reduce:38(Expr5L : Expr9L    <<  >>)
S166 ;(3,;) reduce:37(Expr2R : Expr2R "=" Expr5L    << astx.NewBinaryExpr(X[0], token.Assign, X[2]) >>)
S53 ;(3,;) reduce:35(Expr : Expr2R  <<  >>)
S162 ;(3,;) shift:234
pos: 150
typ: 14
lit: }

S234 }(14,}) reduce:25(Stmt : Expr ";"  << astx.NewExprStmt(X[0]) >>)
S283 }(14,}) reduce:33(ElsePart : "else" Stmt   << X[1], nil >>)
S281 }(14,}) reduce:29(Stmt : "if" Condition Stmt ElsePart  << astx.NewIfStmt(X[1], X[2], X[3]) >>)
2016/04/10 00:32:55 main.parseFile (uparse.go:71): error: Error in S90: }(14,}), Pos(offset=150, line=16, column=1)github.com/mewmew/uc/ast/astx.NewIfStmt (astx.go:152): error: invalid if statement else-body type; expected ast.Stmt, got *ast.ExprStmt

parser: Add support for nested functions (GNU extension)

Example source file containing nested functions.

int main() {
    int f(int a, int b) {
        return a+b;
    }
    return f(2,3);
}

Compilation using GCC.

u@x1 ~/Desktop> gcc -std=c99 -pedantic -o a a.c
a.c: In function ‘main’:
a.c:2:2: warning: ISO C forbids nested functions [-Wpedantic]
  int f(int a, int b) {
  ^
u@x1 ~/Desktop> ./a ; echo $status
5
u@x1 ~/Desktop>

Clang has intentionally omitted support for nested functions:

clang does not support nested functions; this is a complex feature which is infrequently used, so it is unlikely to be implemented anytime soon. In C++11 it can be emulated by assigning lambda functions to local variables.

Motivation

Adding support for nested functions would simplify the grammar and allow the TopLevelDecl special case to be removed, as functions would be considered regular declarations (see issue #38).

Semantic Analysis

Implement a static semantic checker for the µC language.

Type checker
Scope checker

References:

lexer: Add support for Unicode with BOM detection

To allow for the common use case of writing source code in ASCII while having comments written in the encoding of the programmer's locale, allow for arbitrary data within comments. This will allow for ISO-8859-1 encoding and Big5 encoding Chinese characters; e.g.

// Hej världen!
/* 世界您好 */

parser: Distinguish between `int f()` and `int f(void)`

Apparently int f() is not equivalent to int f(void), make sure the parser handles these cases separately.

Contents of a.c:

int f();

int f(int a, int b);

int main(void) {
    return 42;
}

u@x1 ~> gcc -pedantic -o a a.c
u@x1 ~> ./a ; echo $status
42
u@x1 ~> clang -pedantic -o a a.c
u@x1 ~> ./a ; echo $status
42

Contents of b.c:

int f(int c);

int f(int a, int b);

int main(void) {
    return 42;
}

u@x1 ~> clang -pedantic -o b b.c
b.c:3:5: error: conflicting types for 'f'
int f(int a, int b);
    ^
b.c:1:5: note: previous declaration is here
int f(int c);
    ^
1 error generated.
u@x1 ~> clang -pedantic -o b b.c
b.c:3:5: error: conflicting types for 'f'
int f(int a, int b);
    ^
b.c:1:5: note: previous declaration is here
int f(int c);
    ^
1 error generated.

Report

Introduction (issue #12)
Lexical Analysis (issue #7)
Syntactic Analysis (issue #8)
Semantic Analysis (issue #9)
Intermediate Representation (issue #10) (future ambition)
IR Generation (issue #11)
Conclusion (issue #13)

lexer: Add tracking of new lines

Add tracking of new lines such that source file positions may be converted to line:column pairs.

The current idea on how to implement this is as follows.

Add a lines []int field to the lexer structure, tracking the positions of new line characters within the input stream.
Extend the next method to append the position of new lines as they are being read from the input stream. Since the lexer may invoke the backup method, make sure to only append new line positions if they are not already present in lines.
Add an exported method Position which takes a pos argument, representing the input stream position in bytes, and returns a line, column pair representing the corresponding line number and column offset. The line number would be determined using binary search to locate the index of the line, after which pos would have been inserted to retain a sorted list. Given pos, the input stream position of the token, and linePos, the position of the directly succeeding line number, column may be calculated as follows: column := pos - linePos (perhaps +1 if columns are indexed from 1).
...
profit!

type lexer struct {
    …
    // lines tracks the positions of new line characters within the input stream.
    lines []int
}

// Position returns the corresponding line:column pair of the given position in the input stream.
func (l *lexer) Position(pos int) (line, column int) {
   // TODO: Implement using binary search...
}

gocc/lexer: Handle line comments not ending with new line

The previous "solution" to handling line comments ending with EOF, turned out to fall apart for real use cases. Commit 1488a8a removes the special case for handling line comments ending with new line.

However, with commit 1488a8a, comments ending with EOF were incorrectly labeled as invalid. The hand-written lexer is capable of handling such line comments, but for now a solution to the Gocc lexer is needed.

report: Conclusion

parser: Add support for lists of declarations

From ast.go:

// TODO: Add support for lists of declarations? E.g.
//    int x, y, z;
//    int x, y = 3, *z;

parser: Consider removing the *ast.File abstraction

Premature abstraction is the root of all evil. The question is, is the *ast.File abstraction premature?

From ast.go:

// TODO: Consider removing file abstraction, and let []TopLevelDecl be the
// return type of Parse.

Lexical Analysis

Construct a lexer for the µC language.

References:

lexer: New line should not be part of line comment lexemes

Currently the Gocc generated lexer and the hand-written lexer do not agree on whether trailing new line characters should be part of line comment lexemes.

After discussing the manner and evaluating how other projects handle this issue (e.g. the scanner package in Go), we've decided to not include trailing new line characters in the lexeme of line comments.

lexer: Always include token affixes in lexemes.

The lexer is currently trying to be cleaver, by stripping // prefixes from line comments, and /* and */ affixes from block comments. However, it does not strip apostrophes ' from character literals.

Two approaches for fixing these inconsistencies have been identified.

Always try to be cleaver, and let the lexer strip known token affixes from all lexemes.
Don't try to be cleaver. Never let the lexer strip token affixes from lexemes.

After discussing the matter, we've settled on the second alternative as it not only solves this inconsistency, but also provides an elegant solution for another issue; namely the one of locating the end source code position of a token.

After updating the lexer with regards to the second alternative, finding the end source code location of a token may be done as follows.

start := tok.Pos
end := start += len(tok.Val)

The first approach does not make this possible even by inspecting the token kind, since it is not possible to distinguish block comments from line comments, and these lexemes have affixes of different length.

Therefore, the second alternative will be implemented.

Intermediate Representation

Implement a library for interacting with LLVM IR or evaluate using an existing library.

Define the data structures required for LLVM IR (see #62).

References:

LLVM Language Reference Manual
Assignment 4 (contains useful information, even if the target IR differs)
Tutorial 4
Examples 4

parser: Production rules made more complex by comment tokens

Having comment tokens emitted by the lexer into the stream of tokens, presents an issue when parsing source files; as the current grammar does not handle comments within its production rules.

Two approaches for resolving this issue have been identified

Add comments to the production rules.
- Pros: makes it easier to write tools such as indent and doxygen which requires preservation of position accurate comments, and hopefully attached to their corresponding node within the AST.
- Cons: makes the grammar much more verbose, as a comment can exist virtually between any two given tokens.
Ignore comments when lexing the source file.
- Pros: simplifies the grammar.
- Cons: does not facilitate the development of source code rewriting and documentation generation tools.

For the time being, the second approach has been implemented. This decision may be revisited once the parser has matured to correctly handle the syntactic elements of uC.

parser: Add source position tracking of nodes

From ast.go:

// TODO: Add source position tracking of nodes.

This requires some careful considerations. How would we like to store the node source position information? Should we always store both the start and the end of a node, or just the start and calculate the end from the length of the node (would this always be possible?).

Related to issue #22 which tracks the pretty-printing of line:col pairs.

parser: Shift-reduce conflict between TypeName and Locals

From LR1_conflicts.txt generated by Gocc:

11 LR-1 conflicts: 
    S28
        symbol: ident
            Reduce(19:Locals : empty    <<  >>)
            Shift(6)
    S35
        symbol: ident
            Reduce(19:Locals : empty    <<  >>)
            Shift(6)

Relevant production rules from the BNF grammar:

TypeName
    : ident // TypeName : "char" | "int" | "void" ;
;

Locals
    : empty
    | VarDecl ";" Locals
;

Example which fails to parse using uparse on input testdata/quiet/parser/p01.c:

u@x1 ~/D/g/s/g/m/uc> uparse testdata/quiet/parser/p01.c 
Parsing "testdata/quiet/parser/p01.c"

…

pos: 28
typ: 4
lit: int

S35 ident(4,int) shift:6
pos: 32
typ: 4
lit: y

S6 ident(4,y) reduce:10(TypeName : ident    <<  >>)
S33 ident(4,y) shift:36
pos: 33
typ: 3
lit: ;

S36 ;(3,;) reduce:8(ScalarDecl : TypeName ident <<  >>)
S7 ;(3,;) reduce:6(VarDecl : ScalarDecl <<  >>)
S32 ;(3,;) shift:35
pos: 37
typ: 4
lit: x

S35 ident(4,x) shift:6
pos: 39
typ: 29
lit: =

2016/03/31 03:11:06 main.parseFile (uparse.go:69): error: Error in S6: =(29,=), Pos(offset=39, line=7, column=8), expected one of: ident

A potential solution may be to define TypeName as:

TypeName : "char" | "int" | "void" ;

This solves the shift-reduce conflict. The decision may have to be revisited when adding support for user-defined types.

parser: Postpone type checking to the semantic analysis phase

To simplify the design of the parser and the parser generator, I suggest we postpone the type checking to the semantic analysis phase. This design decision would be supported by the KISS mentality, let each component focus on one thing, and do it well.

Alex, what are your thoughts? We could go either direction, and for uC it would be trivial to enforce type checking already in the parser. My main concern is that such an approach could make the parser sufficiently more complex in the future, once we start supporting a larger subset of C.

Below is an example of how array type declarations may be simplified by this design decision (extracts taken from the ast/astx package).

// Code before:

// NewArrayType returns a new array type based on the given element type and
// length.
func NewArrayType(elem interface{}, length interface{}) (*types.Array, error) {
    // Parse array length.
    s, err := tokenString(length)
    if err != nil {
        return nil, errutil.Newf("invalid array length; %v", err)
    }
    n, err := strconv.Atoi(s)
    if err != nil {
        return nil, errutil.Newf("invalid array length; %v", err)
    }

    // Validate element type.
    switch elem := elem.(type) {
    case *types.Basic:
        switch elem.Kind {
        case types.Char, types.Int:
            // Valid element type.
        default:
            return nil, errutil.Newf("invalid array element type; %v", elem.Kind)
        }
        return &types.Array{Elem: elem, Len: n}, nil
    default:
        return nil, errutil.Newf("invalid array element type; %v", elem)
    }
}

// Code after:

// NewArrayType returns a new array type based on the given element type and
// length.
func NewArrayType(elem interface{}, length interface{}) (*types.Array, error) {
    rawlen, err := tokenString(length)
    if err != nil {
        return nil, errutil.Newf("invalid array length; %v", err)
    }
    typ := new(types.Array)
    typ.Len, err = strconv.Atoi(rawlen)
    if err != nil {
        return nil, errutil.Newf("invalid array length; %v", err)
    }
    typ.Elem, err = NewType(elem)
    if err != nil {
        return nil, errutil.Newf("invalid array element type; %v", err)
    }
    return typ, nil
}

// NewType returns a new type of µC.
func NewType(typ interface{}) (types.Type, error) {
    if typ, ok := typ.(types.Type); ok {
        return typ, nil
    }
    return nil, errutil.Newf("invalid type; expected types.Type, got %T", typ)
}

Note that several production rules would make use of the NewType function, which is also more generic and allows more types than those specifically valid for array types, thus mandating a subsequent type check in the semantic analysis phase.

parser: Dangling else

The dangling else problem is the cannonical example of a shift reduce conflict for LR grammars.

It is present in the µC grammar:

1 LR-1 conflicts: 
    S271
        symbol: else
            Shift(277)
            Reduce(32:ElsePart : empty  <<  >>)

To solve this specific ambiguity, we've decided to use the automatic conflict resolution of Gocc, which always shifts when encountering a shift-reduce conflict, as is the expected resolution for C language grammars.

Below follows an extract from the Gocc user guide on this topic.

When automatic LR(1) conflict resolution is selected by the -a option, gocc resolves this conflict in the same way as specified in the C language specification: by shifting and parsing the longest valid production (maximal-munch). This means recognising the else-statement as part of the second if.

report: Introduction

lexer: No EOF token emitted when EOF is present in block comments

c ∆ µC: Add tokens from C

Extend the lexer to support the full set of tokens from C.

References:

C11 spec

parser: Reduce-reduce conflict between TypeName and VoidType

From LR1_conflicts.txt generated by Gocc:

LR-1 conflicts: 
    S6
        symbol: ident
            Reduce(13:VoidType : ident  <<  >>)
            Reduce(10:TypeName : ident  <<  >>)

Relevant production rules from the BNF grammar:

TypeName
    : ident // "char" | "int"
;

FuncType
    : VoidType
    | TypeName
;

VoidType
    : ident // "void"
;

Example which fails to parse using uparse on input testdata/quiet/parser/p01.c:

u@x1 ~/D/g/s/g/m/u/t/q/parser> uparse p01.c
Parsing "p01.c"
pos: 0
typ: 4
lit: int

S0 ident(4,int) shift:6
pos: 4
typ: 4
lit: main

S6 ident(4,main) reduce:10(TypeName : ident <<  >>)
S9 ident(4,main) shift:14
pos: 8
typ: 5
lit: (

2016/03/30 22:25:00 main.parseFile (uparse.go:68): error: Error in S14: ((5,(), Pos(offset=8, line=0, column=0), expected one of: ; [

A potential solution may be to merge VoidType, FuncType and TypeName into the single production rule TypeName : ident ;. Type checking would handle invalid uses of void. This approach was also suggested in the uC language page, which mentions the following:

The other alternative is to allow void as an alternative in typename, and to replace funtype with typename. This however means that the grammar accepts variable or array declarations with void as base type: they must be detected and rejected, either by the syntax-tree building code, or by the type checking code. (This solution is required for full C since full C includes variable declarations such as void *ptr.)

parser: Multiple unary expressions not handled correctly

Contents of a.c:

int main(void) {
    return 13 - - - 10;
}

u@x1 ~/D/g/s/g/m/uc> uparse a.c
Parsing "a.c"
2016/04/11 20:41:19 main.parseFile (uparse.go:69): error: Error in S61: -(26,-), Pos(offset=32, line=3, column=25), expected one of: ident ( int_lit

report: Add design decision to have both identifiers and types as identifiers

report: Intermediate Representation

parser: Construct AST using production actions

Add action expressions to the production rules of the uC grammar.

parser: Associativity and precedence of binary operations

From LR1_conflicts.txt generated by Gocc:

    S105
        symbol: *
            Reduce(39:Expr : Expr BinaryOp Expr <<  >>)
            Shift(70)
        symbol: /
            Reduce(39:Expr : Expr BinaryOp Expr <<  >>)
            Shift(71)
        symbol: !=
            Reduce(39:Expr : Expr BinaryOp Expr <<  >>)
            Shift(76)
        symbol: &&
            Reduce(39:Expr : Expr BinaryOp Expr <<  >>)
            Shift(78)
        symbol: =
            Reduce(39:Expr : Expr BinaryOp Expr <<  >>)
            Shift(79)
        symbol: ==
            Reduce(39:Expr : Expr BinaryOp Expr <<  >>)
            Shift(77)
        symbol: -
            Reduce(39:Expr : Expr BinaryOp Expr <<  >>)
            Shift(68)
        symbol: +
            Reduce(39:Expr : Expr BinaryOp Expr <<  >>)
            Shift(69)
        symbol: <
            Shift(72)
            Reduce(39:Expr : Expr BinaryOp Expr <<  >>)
        symbol: >
            Reduce(39:Expr : Expr BinaryOp Expr <<  >>)
            Shift(73)
        symbol: <=
            Shift(74)
            Reduce(39:Expr : Expr BinaryOp Expr <<  >>)
        symbol: >=
            Reduce(39:Expr : Expr BinaryOp Expr <<  >>)
            Shift(75)

Relevant production rules from the BNF grammar:

Expr
    : int_lit
    | ident
    | ident "[" Expr "]"
    | UnaryOp Expr
    | Expr BinaryOp Expr
    | ident "(" Actuals ")"
    | "(" Expr ")"
;

BinaryOp
    : "+"
    | "-"
    | "*"
    | "/"
    | "<"
    | ">"
    | "<="
    | ">="
    | "!="
    | "=="
    | "&&"
    | "="
;

Example which fails to parse using uparse on input testdata/quiet/parser/p04.c:

u@x1 ~/D/g/s/g/m/uc> uparse testdata/quiet/parser/p04.c 
Parsing "testdata/quiet/parser/p04.c"

…

pos: 49
typ: 4
lit: x

S39 ident(4,x) reduce:21(Locals : empty <<  >>)
S55 ident(4,x) reduce:22(Locals : VarDecl ";" Locals    <<  >>)
S55 ident(4,x) reduce:22(Locals : VarDecl ";" Locals    <<  >>)
S55 ident(4,x) reduce:22(Locals : VarDecl ";" Locals    <<  >>)
S38 ident(4,x) shift:42
pos: 50
typ: 20
lit: -

S42 -(20,-) reduce:36(Expr : ident  <<  >>)
S48 -(20,-) shift:68
pos: 51
typ: 4
lit: y

S68 ident(4,y) reduce:45(BinaryOp : "-" <<  >>)
S67 ident(4,y) shift:42
pos: 52
typ: 20
lit: -

S42 -(20,-) reduce:36(Expr : ident  <<  >>)
S105 -(20,-) shift:68
pos: 53
typ: 4
lit: z

S68 ident(4,z) reduce:45(BinaryOp : "-" <<  >>)
S67 ident(4,z) shift:42
pos: 54
typ: 20
lit: -

S42 -(20,-) reduce:36(Expr : ident  <<  >>)
S105 -(20,-) shift:68
pos: 55
typ: 8
lit: 42

S68 int_lit(8,42) reduce:45(BinaryOp : "-"  <<  >>)
S67 int_lit(8,42) shift:44
pos: 57
typ: 3
lit: ;

S44 ;(3,;) reduce:35(Expr : int_lit <<  >>)
S105 ;(3,;) reduce:39(Expr : Expr BinaryOp Expr <<  >>)
S105 ;(3,;) reduce:39(Expr : Expr BinaryOp Expr <<  >>)
S105 ;(3,;) reduce:39(Expr : Expr BinaryOp Expr <<  >>)
S48 ;(3,;) shift:66
pos: 71
typ: 33
lit: // associativity


2016/03/31 12:08:42 main.parseFile (uparse.go:71): error: Error in S66: comment(33,// associativity
), Pos(offset=71, line=11, column=2), expected one of: ; ident ( int_lit { } return while if - !

Still looking for a good solution to this issue.

token: incorrect token.Kind string representation for character literals

From https://travis-ci.org/mewmew/uc#L309

--- FAIL: ExampleParseFile (0.00s)
got:
…
Char        "'c'"
…

want:
…
CharLit     "'c'"
…

lexer: Gocc test-cases fail for new definition of TypeName

ref: https://travis-ci.org/mewmew/uc/builds/119698507

parser: Evaluate merging TopLevelDecl with Decl to simplify grammar

From ast.go:

// TODO: Evaluate whether TopLevelDecl should be merged with Decl, to simplify
// the structure of the AST; in which case a semantic analysis pass would be
// added to ensure that function declarations are not nested (even if this would
// be an interesting extension to C).

lexer: Gocc lexer unable to handle line comments ending with EOF

ref: https://travis-ci.org/mewmew/uc/builds/118045185#L236

lexer: Emit error tokens at the start source position of the invalid token

Emit error tokens with positions at the start of the invalid token, and continue lexing directly succeeding the prefix of the invalid token.

parser: Add user-friendly error messages

Upstream pull-request goccmack/gocc#18 for enabling custom error messages for Gocc, without losing state information from the parser (such as next token and the expected set of tokens).

semantic: Forward references

Consider adding support for forward references.

C89 and the first edition of C90 had support for implicit function declarations (support for which was removed in the second revision of implicit declarations), as indicated by the following extract from the C89 draft.

If the expression that precedes the parenthesized argument list in a function call consists solely of an identifier, and if no declaration is visible for this identifier, the identifier is implicitly declared exactly as if, in the innermost block containing the function call, the declaration
extern int  identifier();
appeared.30

Furthermore, clang and gcc supports implicit forward declarations.

u@x1 ~/D/g> clang -o g g.c
g.c:2:9: warning: implicit declaration of function 'foo' is invalid in C99
      [-Wimplicit-function-declaration]
        return foo();
u@x1 ~/D/g> clang -std=c89 -o g g.c
u@x1 ~/D/g>

Contents of g.c:

int main() {
    return foo();
}

int foo() {
    return 42;
}

parser: Add dedicated support for character literals

Figure out how to handle character literals. Should they be considered a subset of integer literals of as a dedicated literal?

Related TODOs.

hand/scanner/scanner.go

// TODO: Change back to "char_lit" once dedicated support for character literals has been added.

gocc/scanner/scanner_test.go line 537, 2374, 2394, 2414, and 2434.

// TODO: Figure out if char_lit and int_lit should be separate tokens.

ast/astx/astx.go:

// TODO: Add char_lit production rule to NewBasicLit doc comment once handled
// explicitly in uc.bnf.

gocc/uc.bnf:

// TODO: Handle char_lit explicitly?

Syntactic Analysis

Construct a parser for the µC language.

References:

parser: Crash while parsing testdata/quiet/lexer/l03.c

u@x1 ~/D/g/s/g/m/uc> uparse testdata/quiet/lexer/l03.c
Parsing "testdata/quiet/lexer/l03.c"

…

S39 ident(4,i) reduce:21(Locals : empty <<  >>)
S63 ident(4,i) reduce:22(Locals : VarDecl ";" Locals    <<  >>)
S38 ident(4,i) shift:42
pos: 30
typ: 20
lit: =

S42 =(20,=) reduce:58(Expr15 : ident    << astx.NewIdent(X[0]) >>)
S61 =(20,=) reduce:54(Expr14 : Expr15   <<  >>)
S60 =(20,=) reduce:51(Expr13L : Expr14  <<  >>)
S58 =(20,=) reduce:48(Expr12L : Expr13L <<  >>)
S57 =(20,=) reduce:43(Expr10L : Expr12L <<  >>)
S56 =(20,=) reduce:40(Expr9L : Expr10L  <<  >>)
S55 =(20,=) reduce:38(Expr5L : Expr9L   <<  >>)
S54 =(20,=) reduce:36(Expr2R : Expr5L   <<  >>)
S53 =(20,=) shift:91
pos: 32
typ: 8
lit: 123456789

S91 int_lit(8,123456789) shift:44
pos: 41
typ: 3
lit: ;

S44 ;(3,;) reduce:57(Expr15 : int_lit   << astx.NewBasicLit(X[1], token.IntLit) >>)
panic: runtime error: index out of range

goroutine 1 [running]:
panic(0x4eab00, 0xc8200100c0)
    /home/u/go/src/runtime/panic.go:500 +0x18a
github.com/mewmew/uc/gocc/parser.glob.func58(0xc820060120, 0x1, 0x5a, 0xc820081c50, 0x1, 0x52, 0x0)
    /home/u/Desktop/go/src/github.com/mewmew/uc/gocc/parser/productionstable.go:604 +0x74
github.com/mewmew/uc/gocc/parser.(*Parser).Parse(0xc820081ea0, 0x582240, 0xc8200862f0, 0xc820060080, 0x0, 0x64, 0x1)
    /home/u/Desktop/go/src/github.com/mewmew/uc/gocc/parser/parser.go:204 +0x88e
main.parseFile(0x7ffec6002b0e, 0x1a, 0x0, 0x0)
    /home/u/Desktop/go/src/github.com/mewmew/uc/cmd/uparse/uparse.go:69 +0x430
main.main()
    /home/u/Desktop/go/src/github.com/mewmew/uc/cmd/uparse/uparse.go:42 +0x80
u@x1 ~/D/g/s/g/m/uc>

parser: Node representation

The current Node representation is defined as follows.

type Node interface {
    // Start returns the start position of the node within the input stream.
    Start() int
    // End returns the first character immediately after the node within the input
    // stream.
    End() int
}

The start position of a token is useful for error reporting (e.g. error at line:col, expected identifier got ";"), while the end position is useful for tools operating on the original input such as refactoring tools, which may reposition nodes (as defined by their start-end ranges) within the input.

For now, the main use of positional information will be for error reporting. Therefore, information about the end position of tokens is not recorded. This may be revisited in the future, once there are potential users who may benefit from having access to the end position of tokens.

mewmew / uc Goto Github PK

uc's Introduction

µC Compiler

Installation

Usage

Public domain

uc's People

Contributors

Stargazers

Watchers

Forkers

uc's Issues

Motivation

Recommend Projects

Recommend Topics

Recommend Org