Code Monkey home page Code Monkey logo

nballerina's Introduction

nBallerina

Goals

The long-term goal of the nBallerina project is to create a new compiler for the Ballerina language that is written in Ballerina and can generate native code using LLVM. It will implement the Ballerina language as defined by the 2021R1 version of the language specification; we will aim to track any agreed changes to the spec.

Eventually we expect nBallerina to replace the existing jBallerina compiler, which is written in Java. The long-term vision for Ballerina is to support execution both natively and on top of the JVM, which means it will eventually need to be possible also to generate JVM bytecode using nBallerina.

At a high level, nBallerina and jBallerina share a common architecture: they are divided into a frontend and a backend, which communicate using an intermediate representation called BIR. Since we expect it to be several years until nBallerina can replace jBallerina, in the meantime we want to set things up so that we can combine the jBallerina frontend with the nBallerina backend. Eventually we may also want to use the jBallerina backend to allow nBallerina to produce JVM bytecode.

One of the most fundamental features of the Ballerina language is that subtyping is semantic. This means that a type corresponds to a set of values, and the subtype relationship between types is defined in terms of the subset relationship between the corresponding sets of values: a type S is a subtype of T if and only if the set of values denoted by S is a subset of the set of values denoted by T. This is a very simple and powerful idea, but unfortunately is rather challenging to implement.

The implementation of jBallerina was begun many years ago, before typing was fully worked out. Its most significant current limitation is that its implementation of subtyping is not semantic. Instead it implements a syntactic approximation to semantic subtyping. Although this is good enough for many uses, it is a high priority to fix this. Semantic subtyping needs to be consistently applied in the frontend, the backend and the runtime.

The initial goals for nBallerina are the things that jBallerina does not yet do:

  • generating LLVM IR from BIR
  • semantic subtyping.

There is an earlier project that was called nBallerina. It took a different approach: a backend written in C++, which read serialized BIR from the jBallerina frontend. This is in the nballerina-cpp repository and is no longer being developed.

Structure

The nBallerina compiler, which is organized as a Ballerina project in the compiler directory, is structured into the following components written in Ballerina:

  • Semantic subtyping implementation (in the types module). This provides a normalized representation of Ballerina types, and operations on that normalized representation.
  • BIR (in the bir module). This is a definition of BIR as a Ballerina type, together with some utility functions. BIR (as used in nBallerina) represents types using the normalized representation provided by the semantic subtyping implementation. This also includes a verifier that uses the semantic subtyper to verify that the BIR is well-typed. This depends on the types module.
  • Front-end (in the front module). This generates BIR from the source code of a Ballerina module using the front.syntax module. This depends on the bir module and the front.syntax module.
  • Parser (in the front.syntax module). This produces an abstract syntax tree from the source code for a single file.
  • LLVM API (in the print.llvm module). This provides a Ballerina API to LLVM. The implementation of this API builds a textual representation of the LLVM IR as LLVM assembly language, which can be written to an .ll file and then compiled with LLVM's clang command. This is designed to be very similar to the LLVM C API. This does not depend on any of the other modules.
  • LLVM API on top of JNI (in the jni.llvm module). This implements the same API as the print.llvm module on top of the LLVM C API via JNI. (So this is Ballerina, on top of Java, on top of C via JNI, on top of C++.)
  • Native backend (in the nback module). This builds the LLVM IR representation of a Ballerina module from the BIR representation of a Ballerina module. It depends on the bir and the nback.llvm modules (but not the front module).
  • A compiler driver (in default module). This calls the frontend to generate the BIR and then calls the backend to generate LLVM. It depends on the bir, front and nback modules.

The starting point for the types and front modules was the semtype project.

As well as a compiler, nBallerina needs a runtime, which is in the runtime directory. This is currently in C and fairly minimal. The major components will include

  • runtime type checker (we hope to write this mostly in Ballerina);
  • implementation of the langlib;
  • garbage collector (current plan is eventually for this to be in Rust and built on top of MMTK);
  • scheduler.

Implementation plan

The implementation strategy is start by implementing a tiny subset of the language, and then implement progressively larger subsets. The plan is that each subset will be implemented correctly. This implies that

  • if the compiler accepts the program, then the result when executed will behave as the language spec requires it to behave;
  • the compiler will not accept a program that the language spec says is not a valid program.

In accordance with the initial goals of the project, we will not initially devote much engineering effort to usability aspects of the front end, such as high-quality error messages and error recovery. The initial priority for the front-end is correctness; we will evolve it into something more usable later.

In designing the sequence of subsets, we want to

  • maintain correctness
  • be able to write the appropriate parts of the runtime needed for each subset in Ballerina
  • define subsets syntactically as much as possible (i.e. the subset is all programs that have this grammar)

and then work towards implementing a subset that is sufficient for each of the following milestones

  1. Supporting all the BIR instructions needed to self-host the compiler. This will allow us to run the entire nBallerina compiler natively, by compiling it using the jBallerina frontend in conjunction with the nBallerina backend.
  2. A full implementation of semantic typing. This provides a foundation for jBallerina to switch over to doing semantic subtyping.
  3. Self-hosting the compiler.
  4. Implementing useful Ballerina clients.

Note that as regards the third of the above milestones, we will allow the nBallerina to make use all the relevant language features implemented by jBallerina, since a secondary goal of the nBallerina project is to help with testing jBallerina. This should not affect how long it takes us to get to the first of the above milestones, but will mean it takes us longer to get to the third. We will, however, initially restrict nBallerina's use of concurrency: we don't want to have to implement Ballerina's concurrency features before we can self-host.

Usage

The compiler has not yet got to a stage where it is useful. But if you want to play with it or help with development, this is the way:

  1. Clone the nBallerina repository.
  2. Download and install the latest Ballerina distribution (Swan Lake not 1.2.x)
  3. You can build the compiler by using the command bal build in the compiler directory; this will generate a file target/bin/nballerina.jar. This should work on any system that Ballerina works on.
  4. You can use java -jar nballerina.jar example.bal to compile a Ballerina module into an LLVM assembly file example.ll (note that only a tiny subset of the language is currently implemented, as described in the Status section).
  5. If you want to be able to turn the LLVM assembly file into something you can execute, there are additional requirements: Linux or OS X, LLVM 16 and GNU make. With these, you can build the runtime and run the tests by running make test in the top-level directory. This compiles and executes all the test cases and checks that they produce the right outputs. You can use e.g. make -j8 to make it run tests in parallel.

If you want to turn the LLVM assembly into an executable, you can use the test/run.sh command.

You can try out the semantic subtyping implementation using

java -jar ballerina.jar --showTypes example.bal

where example.bal is a Ballerina module containing only type definitions and const definitions. It will print out the subtype relationships between all the defined types.

To output the Ballerina intermediate representation (BIR) for a module, use

java -jar ballerina.jar --bir example.bal

Testing

The compiler is tested using the test cases in the compiler/testSuite directory. The bal build command performs a first level of testing on these: it checks that the test cases that should get compile errors do, and that the test cases that should not get compile errors do not. This should work on any platform on which the Ballerina distribution works.

For those test cases that are valid Ballerina programs, the Makefile in the test directory is used to further test that the generated LLVM assembly files can be compiled with LLVM and give the correct output when executed. This Makefile has the following targets:

  • test (the default target) forces a rebuild of all the .ll files if the compiler jar has changed, and then does compile and testll
  • compile builds any out of date .ll files (but does not consider the compiler jar when determining whether a .ll is out of date)
  • testll builds an executable for each test case, executes it and checks its output

Status

We have completed subset 14 and are working on subset 15.

The semantic subtyping implementation is further along than the backend. It implements the subset of Ballerina's type syntax described by this grammar.

nballerina's People

Contributors

anupama-pathirage avatar gimantha avatar github-actions[bot] avatar hasithaa avatar heshanpadmasiri avatar jclark avatar kavinduzoysa avatar lochana-chathura avatar maheshika avatar manuranga avatar poorna2152 avatar rdhananjaya avatar rdulmina avatar sameerajayasoma avatar sanjiva avatar ushirask avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nballerina's Issues

Implement `trap` expression

Ballerina allows panics to be caught using trap expressions. These are used only for abnormal conditions, such as program bugs, not for normal error handling, so use of trap should be infrequent.

This follows on from #44.

I see three possible implementation strategies:

  1. Use something similar to what C++ exceptions does (see #28)
  2. Use setjmp/longjmp (this is viable for Ballerina, because, at least for now, we do not have destructors)
  3. Use function return values: where a function returns a value with LLVM representation R, the actual return type would be a structure { P, R }, where P can represent a error value or a nil value N, representing the absence of an error; then a normal return R is handled by returning { N, R }, and a panic with associated error E is handled by returning { E, undef }.

Extend nback.llvm to support function attributes

C API has

  • LLVMAttributeIndex (what the attribute applies to: function, return value or specific parameter)
  • LLVMAddAttributeAtIndex
  • LLVMCreateStringAttribute
  • LLVMCreateEnumAttribute (looks like val argument is typically 0), which needs
    • LLVMGetEnumAttributeKindForName

Run tests from annotated ballerina files

Description:

Write tests such that it is execute without writing any extra code. Expectation and input should be mentioned in the file itself.
Slimier to LLVM LIT system errors should be mentioned inline.

Implement nback.llvm support for conditionals

Add support for:

  • br (we need both conditional and unconditional variants) [LLVMBuildBr, LLVMBuildCondBr]
  • icmp (strictly speaking we may not need all the possible predicates but I think we can support all similar to how we do binaryInt) [LLVMBuildICmp]
  • unreachable (I think this is optional but better if we can add this after calling abort and should be trivial to implement)[LLVMBuildUnreachable]

Avoid undefined behavior for LLVM "sdiv" with RHS of 0 w/o explicit checks

LLVM defines its sdiv instruction to have undefined behaviour when the right operand is zero. Similarly, when the left operand is INT_MIN and the right operand is -1.

One way to solve this is to generate code that explicitly tests for these cases before using the sdiv instruction. This issue is to investigate whether there is a better way.

At the assembly code level, at least for x86, we would ideally produce a x86 IDIV instruction. This would result in a CPU exception which would the OS would turn into a signal, which we can then catch. Is there a way to make LLVM generate this assembly, that is portable and supported?

Rust has the same semantics for integer division that we do, and uses LLVM, so we should look at how they do it.

We should also look at how Swift does integer division.

There are similar problems with remainder also.

Take a snapshot of the stack that can be used to generate a backtrace

In order to implement the error value #167 and to implement #45, we need not just to print a backtrace as in issue #6, but take a snapshot of the stack, which we can store in the error value and be used subsequently to generate a backtrace.

We want to be as lazy as possible here, i.e. do as little work as possible when taking the snapshot, because the backtrace may never be needed.

Investigate using C++ exception handling ABI to catch panics

What this would mean would be that

  • stack unwinding can use the C++ runtime's implementation
  • we generate code for Ballerina trap similar to what C++ generates for C++ catch/throw
  • Ballerina panics can interop with code in C++ that uses exception handling

Reasons why this might be a good approach:

  • one of the implementation strategies that Rust uses for panics is on top of C++ exception/handling
  • Ballerina panic/trap semantics are a subset of C++ catch/throw
  • C++ exception handling has a well-defined, stable ABI
  • LLVM's exception handling is C++ centric: LLVM IR to a large extent just models C++ exception handling semantics
  • the C++ ABI's concept of personality function provides a degree of language-specific functionality
  • if we use C++ for the native parts of langlib, then our native langlib functions can use C++ throw when they need to panic (see #29)

Call _bal_panic after ret instruction

Description:
If bal file contains a function call, it is considered as potentially panicking instruction and call _bal_panic after return

[INSN_CALL]: true

Steps to reproduce:
ball file:

public function main() {
    int x = foo();
}

function foo() returns int {
    return 2;
}

generated llvm for the main function:

define void @_B_main () {
  %R0 = alloca i64, align 8
  %R1 = alloca i64, align 8
  %R2 = alloca i64, align 8
  %R3 = call i64 @_B_foo ()
  store i64 %R3, i64* %R0, align 8
  %R4 = load i64, i64* %R0, align 8
  store i64 %R4, i64* %R1, align 8
  ret void ; below thress line are not executed
  %R5 = load i64, i64* %R2, align 8 
  call void @_bal_panic (i64 %R5)
  unreachable
}```

Serialize SemTypes to a pseudo-Ballerina like text format

We should serialize SymType to a pseudo-Ballerina text format with possible addition of not operator.
Since SemTypes have references to TypeCheckContext we can't trivially serialize it. Some type references may be needed.

Prerequisite for #15.

Test failures in master

Description:
Running the command bal test --tests "testCompile*" ../compiler says that two tests are failing.

		[fail] testCompileError:

		    wrong error position /home/kavindu/WSO2-GIT/nbal/nballerina/tests/../tests/Eassign01.bal
		    	
		    	expected: <int> '4'
		    	actual	: <()> ''

		    	at ballerina.test.0_8_0:createBallerinaError(assert.bal:43)
		    	ballerina.test.0_8_0:assertEquals(assert.bal:75)
		    	wso2.nballerina.0_1_0.tests.integration:testCompileError(tests/integration.bal:33)

		[fail] testCompileError:

		    wrong error position /home/kavindu/WSO2-GIT/nbal/nballerina/tests/../tests/Eassign00.bal
		    	
		    	expected: <int> '4'
		    	actual	: <()> ''

		    	at ballerina.test.0_8_0:createBallerinaError(assert.bal:43)
		    	ballerina.test.0_8_0:assertEquals(assert.bal:75)
		    	wso2.nballerina.0_1_0.tests.integration:testCompileError(tests/integration.bal:33)

Also running script (sh testll.sh) says,

Aborted (core dumped)
Aborted (core dumped)
Aborted (core dumped)
Aborted (core dumped)

Steps to reproduce:

bal test --tests "testCompile*" ../compiler
sh testll.sh

Implement nback.llvm support for struct types

Add support for:

  • Define structure types (minimally {i64, i1}) and use them as a type of a value / return type of function
  • Use extractvalue so we can get values out of them by index

Backend support for integer arithmetic panics

The front-end generates BIR that allows for integer arithmetic instructions to panic on overflow or division by zero. We need to support this in the backend.

There are two parts to the design of this:

  1. how do we detect overflow/division by zero?
  2. what do we do when we have detected this?

For part 1, there are two subcases:

  • add, sub, mul: in these cases, we use the intrinsics provided by LLVM
  • div, rem: in this cases, we explicitly test whether the operands would overflow or divide by zero (this is the conclusion of issue #38)

For part 2, I think the best solution for subset 1 is to call a runtime function passing it the source location that caused the panic and a code for the type of panic (overflow vs division by zero); this runtime function will print out a message with the location and terminate the program.

Implement front-end codegen support for break and continue

The parser being developed in #23 includes support for break and continue. The flow control implemented in BIR is sufficient for these. But codeGen does not yet handle it. We need somehow to pass down into statement codeGen the labels to which break and continue should jump.

Serialize BIR format

Description:
Serialize BIR format to a text format and incorporate that into the testing.

Reorganize repo to accommodate runtime

Currently the root of the repo is the ballerina package containing the compiler.

But nballerina will need other parts, in particular a runtime, that are tightly integrated with the compiler.

My current thinking is:

  • make the current root directory become a subdirectory compiler
  • rename the package module from nballerina to nballerina.compiler
  • add runtime directory for the runtime
  • use CMake to build everything: CMake will run bal to build the compiler package
  • add tests directory to test the compiler together with runtime (when a program is compiled and executed, does it produce the right result)
  • add docs directory for project documentation

Need to handle mangling/demangling of global symbols

Global identifiers in Ballerina are structured values with at least organization/module/name (not sure about version). We need to encode this into a symbol that can go in an object file.

The backtrace code #6 needs to demangle.

Implement nback.llvm support for calling intrinsic functions

We can call intrinsics using a call instruction like other functions. But we also need to create a Function that refers to them. The nback.llvm module should take care of emitting the appropriate declaration.

LLVM C API has:

  • LLVMGetIntrinsicDeclaration
  • LLVMLookupIntrinsicID

We may need two types of Function.

This is needed in order to be able to use #24 to support arithmetic with overflow.

Call inst contains extra llvm instructions

Description:
$subject. It seems there are two extra load and store instruction which are not needed.

Steps to reproduce:
bal file:

public function main() {
    int x = foo();
}

function foo() returns int {
    return 2;
}

llvm ir for the main function:

define void @_B_main () {
  %R0 = alloca i64, align 8
  %R1 = alloca i64, align 8
  %R2 = call i64 @_B_foo ()
  store i64 %R2, i64* %R0, align 8

  ; these two instructions are not needed.
  %R3 = load i64, i64* %R0, align 8
  store i64 %R3, i64* %R1, align 8

  ret void
}
define internal i64 @_B_foo () {
  ret i64 2
}

This can be verified by generating a llvm ir for a similar C code.

Make io:println work

Hardcode a partial implementation of io:println (just accepting a single int argument), so we can do #12

Consider using C++ for the langlib runtime

This builds on #28.

The idea would be to use a carefully chosen, very small subset of C++. In particular, all Ballerina values would be represented by what C++11 calls standard-layout classes: these are basically POD structs that have the same layout as equivalent C structs (so no virtual functions, no destructors).

Compared to C, this gives us a number of useful features.

First, we have inheritance. I can say

struct Header {
  int gc_header;
};

struct String : Header {
   int length;
  // bytes go here
};

Second, we can use C++ throw (building on #28).

Third, we can use C++ templates to provide separate per-representation implementations of Ballerina structures, particularly arrays.

Fourth, we can use C++ classes to implement a wrapper around pointers into the garbage-collected heap; the goal of the wrapper would be to ensure that the collector can find all the roots that are in C++ stack frames.

Compared to C, there is one important thing that we loose: variable length structures, e.g.

struct String {
   int length;
   char[] data;
};

Print a backtrace when a signal is caught

Ballerina requires that the / operator result a panic if its right operand is 0. We could handle this by explicitly testing for 0, before performing the division, but it would be much more efficient if instead we relied on the CPU's ability to generate an exception in this case. In POSIX terms, the CPU exception will turn into a signal (SIGFPE in this case), which the kernel will deliver to the program. Using sigaction we can catch this exception and get the address (in siginfo.si_addr) of the instruction that caused CPU exception. We should then be able to use this to print a backtrace.

Enhance runtime panic function to print backtrace before exiting

Eventually panics need to unwind until they are caught with trap, but until then our implementation of panic can simply print a backtrace before exiting.

This needs some debug info (#7), and is an enhancement to the strategy in #44, which explicitly passes the location of the offending instruction to the runtime panic function.

Generate module target triple

At the moment, compiling the .ll file generated by the compiler produces the warning:

warning: overriding the module target triple with x86_64-pc-linux-gnu [-Woverride-module]

We need to generate the appropriate target triple to deal with this.

This should be a configurable value. Eventually, it should be possible to override with command-line option.

BIR verifier stage 1

Implement enough of a verifier to catch semantics errors in subset 1. At least:

  • in function call, number of arguments must match signature of called function
  • in function call, type of arguments must match signature of called function
  • integer binary operations only on int not boolean
  • == and != must have non-empty intersection
  • integer minus only on int not boolean
  • logical ! only on boolean not int
  • condition of if/else/while is boolean
  • return statement must return value that matches return type of signature
  • types in assignment

Maybe more...

Make sure we have tests for

  • function with non-nil return value in call statement
  • function with nil return value in expression

Func call inside if block contains extra store instruction

Description:
$subject

Steps to reproduce:
bal file for main function:

public function main() {
    int x = 3;
    if (x == 3) {
        bar(100);
    } 
}

llvm ir for the main function:

define void @_B_main () {
  %R0 = alloca i64, align 8
  %R1 = alloca i1, align 8
  %R2 = alloca i1, align 8
  store i64 3, i64* %R0, align 8
  %R3 = load i64, i64* %R0, align 8
  %R4 = icmp eq i64 %R3, 3
  store i1 %R4, i1* %R1, align 8
  %R5 = load i1, i1* %R1, align 8
  br i1 %R5, label %L1, label %L2
L1:
  call void @_B_bar (i64 100)
  store i1 0, i1* %R2, align 8 ; This is an extra instruction
  br label %L2
L2:
  ret void
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.