y21 / dash Goto Github PK

View Code? Open in Web Editor NEW

38.0 38.0 2.0 2.84 MB

Experimental JavaScript implementation in Rust

License: MIT License

Rust 98.19% JavaScript 1.81%

compiler ecma262 es6 interpreter javascript parser rust wasm

dash's People

Contributors

Stargazers

Watchers

Forkers

trueharuu jacherr

dash's Issues

Marking roots on the native stack

A major problem with the current GC is that the vm has no way of telling that a variable in a native function is holding a reference to a JavaScript object. If a GC cycle happens to trigger (e.g. due to calling a JS function), the object referenced by the variable is (mistakenly) deallocated.

Fixing this likely requires major changes wrt how values are represented all over the vm and how native code works with these.
One possible way to fix this is to have a separate vector of "external references" (i.e. objects referenced by bindings on the native stack). The current vm actually already has something similar, but it requires manually adding values to it. It would be nice if we could make this sort of pattern a compile error (pseudocode):

fn native_function() {
  let o = create_object();
  js_function(); // calls some JavaScript function, GC triggers in here, `o` is not marked and gets deallocated
  print(o); // using deallocated object
}

We could perhaps make some kind of macro that also adds the assigned value to the external refs vector, used like so:

fn native_function() {
  letroot!(o = create_object()); // allocate object AND add to vector of external refs
  js_function(); // GC triggers, but `o` is in external refs vector and is not deallocated
  print(o); // ok
}

Expose `Value` enum to JS

Currently, we cannot do a lot with JS values coming from WASM. VM.prototype.exec immediately inspects the value and returns it as a string, so the end user doesn't even get to decide what to do with a value. Instead of automatically inspecting values when executing a VM, we should return a Handle<Value> that users can manually inspect, or do something else with. The downside is that we can't really do RAII patterns in JS (FinalizationRegistry?), so users would have to manually free JS values.

assertion failed in async http handler when mutating external

Code:

import * as http from '@std/http';

let _t = 0;
http.listen(3000, async function (ctx) {
    _t++;
    ctx.respond({});
});

Output:

dash has unexpectedly panicked! this is a bug!

panicked at 'internal error: entered unreachable code', crates/dash_vm/src/dispatch.rs:1280:55
set RUST_BACKTRACE=1 to print a backtrace
dash has unexpectedly panicked! this is a bug!

panicked at 'called `Result::unwrap()` on an `Err` value: RecvError(())', crates/dash_rt_http/src/lib.rs:107:45
set RUST_BACKTRACE=1 to print a backtrace
dash has unexpectedly panicked! this is a bug!

panicked at 'assertion failed: `(left == right)`
  left: `ThreadId(1)`,
 right: `ThreadId(2)`', crates/dash_middle/src/util.rs:153:9
set RUST_BACKTRACE=1 to print a backtrace
thread panicked while panicking. aborting.
Aborted

Early type inference leads to misoptimizations w/ const eval

Currently, type inference runs "at the same time" as const eval.
This leads to similar problems as #48, because some optimizations are only realized to be invalid after they occurred at some later point due to the type becoming more general. One such misoptimization:

let i = 0;

// `i` is inferred as number to this point,
// so this optimizes to: `() => 'number'`
let x = () => typeof i;

// Type is inferred as number | boolean only **here**
i = true;

// The closure should return 'boolean', but returns 'number'
console.log(x())

To fix this, type inference needs to run entirely before const eval, so we know that i is number | boolean and not number at any early point

AST nodes should carry spans

... to be able to properly show errors past the parsing stage to the user (location)

semicolon treated as an expression

; returns undefined (as there is no value), but is usable in expressions such as let x = (;); or while (;) {}

Miss compilation in `else` branch

Following code segfaults (in release mode; panics in debug mode):

if (true) {} else {}

Seems like it wrongly pops the index off the stack required by a shortjmp to jump over the else branch if the condition is true
(Originally found in a test262 case: annexB/language/global-code/if-decl-else-decl-a-global-existing-fn-update.js)

Handle stack overflows

Stack overflows are not handled well. There is only an assert!() that makes sure we don't overflow (which would normally panic anyway). Stack::push should return an Option, and on the call site in VM we should handle it by throwing an error.

dash/core/src/vm/stack.rs

Lines 69 to 74 in 949a588

    
           pub fn push(&mut self, v: T) { 
        
               assert!(N > self.1); 
        
               unsafe { self.0[self.1].as_mut_ptr().write(v) }; 
        
               self.1 += 1; 
        
           }

Constant folding

We should attempt to walk the AST after parsing and apply certain optimizations by evaluating the value of constant expressions at compile time

Semicolon should not be an expression

Currently ; is a value-producing expression

Bytecode/Heap snapshot support

To speed up startup times, it may be useful to make it possible to snapshot the JavaScript heap and compiled bytecode.

Late-bound external variables break specialized instructions

This code causes a panic:

for (let i = 0; i < 2; i++) {
    (function () {
        i;
    });
}

dash has unexpectedly panicked! this is a bug!

panicked at 'internal error: entered unreachable code', crates/dash_vm/src/dispatch.rs:1338:55
set RUST_BACKTRACE=1 to print a backtrace

Normally it shouldn't emit ipostinclocal for external variables, but here it does because the incrementing part of the loop does not know about external-ness of the local variable i. This can probably be fixed by tracking external variables in the type infer pass (so it actually gets inferred as External(Number) instead of Number and specialization can avoid optimizing them).

Type erased pointers don't take alignment of object types into account

dash/crates/dash_vm/src/gc/handle.rs

Lines 123 to 125 in 530e010

    
           pub fn erased_value(&self) -> *const () { 
        
               unsafe { addr_of!((*self.0.as_ptr()).value) } 
        
           }

This is wrong for object types with a higher alignment than usize / the max alignment of all other fields. It assumes that GcNode<()> and GcNode<ConcreteType> can be used interchangeably for computing the offset of the value field, which can be totally wrong due to padding after the preceding field to align the T correctly.

Stop using `Any` (where possible)

std::any::TypeId will be (is, on nightly) 128 bits, to mitigate collisions (iiuc).
I'm not sure if this will have a perf impact (I've seen u128 be significantly slower than u64 in specific benchmarks, probably because it's emulated using two u64s), but we should move away from it if at all possible. It also doesn't fully fix the issue: collisions are still possible, just far less likely than they already were with a u64.
We currently use it a lot to, for example, downcast an object to specific object types to access "internal" object state (e.g. underlying buffer of an ArrayBuffer).
We alread have a BuiltinCapable trait that we can add methods like as_arraybuffer and that should allow us to get rid of most instances.
We probably can't fully get rid of it because external crates can define their own object types and store custom stuff in them that they may want to cast JS objects to, so we need to keep it, but we should avoid using it in dash itself

Rewrite assignments

Assignments are currently incorrectly implemented, and there are quite a few bugs because of it, which are hard to solve with the current implementation. One of them being setting new properties. Assignment works by taking the last value on the stack and updating it. That's fine for normal variables that actually exist, but not for setting a new property on an object that didn't exist before. The property lookup will return undefined and assigning to undefined is a no op. For now, Object.defineProperty(O, K, P) can be used to get around this.

Implement `arguments`

This seems to be a feature that is not very easy to implement given the current way of how parameters are handled, but it should be worth it, especially considering that we already support rest parameters and tsc desugars those to arguments in <ES2015

Prefix sub crates with `dash_`

Prefix crates with dash_
core crate name collides with Rusts core crate. Instead, it should be named dash_core
Rename wasm crate to dash_sys. These bindings also make it possible to use dash in C code (just needs an interface on the C end), so it's not strictly related to WebAssembly.

'\�' crashes lexer

Code:

'\�'

Crash:

dash has unexpectedly panicked! this is a bug!

panicked at 'Invalid UTF8: Utf8Error { valid_up_to: 0, error_len: Some(1) }', crates/dash_middle/src/util.rs:72:28
set RUST_BACKTRACE=1 to print a backtrace

The first part where things start going wrong:

dash/crates/dash_lexer/src/lib.rs

Lines 242 to 246 in 059cc81

    
           // TODO: handle \u, \x 
        
           other => { 
        
               lexeme.as_mut().unwrap().to_mut().push(other as char); 
        
               self.advance(); 
        
           }

This code assumes that the backslash escapes the first byte (239, which translates to ï), and not the whole replacement character '�'.

Proper variable/function hoisting

Declaration hoisting partly works already (you can refer to a var before it is used, containing the undefined value if uninitialized). Function declarations work different however in that they are immediately initialized with the function and you can call them even before their declarations are "reached":

(function() {
  return b();

  var a = c();
  function b() {
    return c();
  }
  function c() {
    return 3;
  }
})()

Semantics should be equivalent to this:

(function() {
  // Functions declarations
  var b;
  var c;
  // Variable declarations
  var a;
  // Function initialization
  b = function() {
    return c();
  }
  c = function() {
    return 3;
  }

  return b();
  // Definition stays
  a = c();
})();

`net.TcpStream` always returns empty `ArrayBuffer` on disconnection

When the other end disconnects, TcpStream will continue yielding empty ArrayBuffers. This is probably wrong.

Improper stack unwinding in recursive calls

Code:

// foo.js
function foo(a) {
  if (a === 0) throw new Error("bad");
  const b = {}; // <-- leaks. memory is never freed
  return foo(a - 1);
}
foo(1);

Running it through valgrind:

$ valgrind dashjs foo.js
# --- snip ---
==4513== HEAP SUMMARY:
==4513==     in use at exit: 312 bytes in 3 blocks
==4513==   total heap usage: 452 allocs, 449 frees, 106,823 bytes allocated
==4513== 
==4513== LEAK SUMMARY:
==4513==    definitely lost: 104 bytes in 1 blocks
==4513==    indirectly lost: 208 bytes in 2 blocks
==4513==      possibly lost: 0 bytes in 0 blocks
==4513==    still reachable: 0 bytes in 0 blocks
==4513==         suppressed: 0 bytes in 0 blocks

perf: fast path for builtins/instrinsic ops

Emit a fast path for static property access calls (e.g. x.push()) under certain conditions, such as if x is likely an instance of Array (possibly through type inference).
The builtin/intrinsic operation can be translated to an index into an intrinsic op table, similar to how we already translate a variable reference to a stack index.
If the prediction turns out to be true (i.e. x instanceof Array) and no builtins were mutated, we can skip the slow property lookup process and execute the intrinsic op right away. Otherwise, fall back to slow/unlikely path.
We might also be able to skip prediction guards entirely if we can prove at compile time that a variable is always going to be of a certain type, for example here

const x = [];
for (let i = 0; i < 10000; i++)
  x.push(i);

x is declared const, so its type won't change. Even if it wasn't declared const, we can tell that the code never assigns to x, so its type will always be array.
In this case, it would be nice if it could skip all of the property dispatch involved when looking up a property and directly execute Array.prototype.push.
I believe this can make a significant difference in programs that make use of a lot of builtins.

float literal and identifier parsed incorrectly

Inputs:

let x = 3; 3x
3 x
5y

Expected Outputs:

Syntax Error: Invalid or Unexpected token
Syntax Error: Unexpected identifier 'x'
Syntax Error: Invalid or Unexpected token

Actual Outputs:

3
3
ReferenceError: y is not defined

It seems that (integer)(identifier) is getting parsed as (integer), (identifier), effectively disposing the integer. None of these should pass the lexing stage, except in the case where there is whitespace between the two (both are valid tokens individually)

Assertion fail when exception in try-catch crosses native boundaries

try {
    (() => {
        [1].map(() => { throw new Error('x') });
    })();
} catch (e) {}

message: panicked at 'assertion failed: self.1 > 0', core/src/vm/stack.rs:141:9
set `BACKTRACE=1` environment variable to display a backtrace

try {
    [1].map(() => { throw new Error('x') });
} catch (e) {}

message: panicked at 'internal error: entered unreachable code', core/src/vm/mod.rs:910:8
set `BACKTRACE=1` environment variable to display a backtrace

Key `proto` in object literal is treated as an ordinary property

... when it should be treated as setting its internal prototype slot. E.g.

let b = { __proto__: null };
console.log(b.__proto__); // not null!!

Improve error for "value is not a function"

Something like console.logg(); (intentional typo) currently throws this error:

TypeError: undefined is not a function
  at <anonymous>

that's pretty useless because there's no way to find out where it happened in the source code. "console.logg is not a function" would be a better error.

`this` in get accessors is always undefined

let x = {
  get f() { console.log(this); }
};
x.f;

This should log the object itself, but instead logs undefined

`this` in arrow functions is wrong

let f;
function F() {
  this.a = 1;
  f = () => console.log(this);
}
new F();
f();

should print F { a: 1 } but prints undefined

`Weak{Set, Map}` entries do not get removed even after GC

WeakSet/WeakMap entries should only exist for as long as the key is still a GC root/still has strong references. In the current implementation, entries are never removed even when a key gets GC'd (last strong Rc gets dropped). This will probably be easier to fix with a proper GC and we no longer use Rc:
Repro

let x;
{
  let y = {};
  x = new WeakSet([y]);
}

console.log(x) // WeakSet { <1 items> }

(This wouldn't necessarily be observable with a real GC where values don't get freed immediately once unreachable but we are using Rc, and because Weak<T> is used as key, the WeakSet count should be at 0)

Function constructor returns undefined in presence of unrelated function

var x;
var A;

x = function () {
    A;
}

A = function () {
}

console.log(new A())

This logs undefined when it should be logging {}. Strangely enough, commenting out A; or doing other seemingly unrelated changes in function x (which is totally unused) gives the correct result

Missing cyclic proto value check

let z = {};
let y = {__proto__:z};
z.__proto__ = y;
y.a

running this crashes dash with a stack overflow because it gets stuck in a recursive "loop" trying to look up property 'a' in its parent __proto__ object if it can't find it

Other JS engines have guards against that. V8 throws Cyclic __proto__ value

CLI should use inspect function

The various cli components should use the inspect function (defined in crates/dash_rt/js/inspect.js).
Right now they all simply call .toString() on the return value. It's fine for primitive values such as numbers and strings, but for objects it just prints [object Object], which is not really useful. The inspect function would also return the object entries.

Reduce size of instructions

Right now, instructions have data associated to them, which makes the enum big. Ideally, generated bytecode should be a Vec<u8>. We can do this by having a separate value pool and storing indices into the pool in bytecode, rather than directly storing values which may be up to sizeof(usize) bytes big.

Rewrite property lookup and implement prototype chain

Currently, property lookup for builtin types is basically a bunch of match statements that, depending on the value type try to find a property. That's not correct. Property lookups should go through the prototype chain. About like:

function findProperty(o, p) {
  while (o !== null) {
    if (Object.prototype.hasOwnProperty.call(o, p)) return o[p];
    o = o.__proto__;
  }
}

foo[bar] === findProperty(foo, bar)
foo.bar === findProperty(foo, "bar")

Invoking a function with new should set this in the function to a new object and return that object. __proto__ of that object should point to fn.prototype. fn.prototype.__proto__ should be Object.prototype. Object.prototype.__proto__ should be null.

Add documentation

The majority of the codebase is undocumented. This makes it very hard to understand certain parts of the code for people who have never seen the code before. We can later enforce public API documentation with #![deny(missing_docs)]

Document public API (embedders pov)
Document internals (contributors pov)

Regex matching should backtrack

console.log(/.+a/.test('bba'));

This should return true, but currently returns false, because .+ eagerly consumes the whole string.
Instead it should work its way backwards and realize that .+ should only match bb.

Constants should be Copy on Write

Consider the following code:

while (true) {
  for (let i = 0; i < 3; ++i) console.log(i);
}

The first iteration of the outer loop runs as expected. Numbers 0 to 2 are printed.
On all other iterations, it seems to completely skip the inner loop.

Changing it to this seemingly equivalent code makes it run as expected.

while (true) {
-  for (let i = 0; i < 3; ++i) console.log(i);
+  for (let i = 0 + 0; i < 3; ++i) console.log(i);
}

Looks like the inner loop mutates a constant, which is reused later in the next iteration. This causes the loop condition to never be true again in all subsequent iterations.
0 + 0 is a hack that forces a new number to be created, and mutating it no longer mutates the constant.
The same behavior can be observed when calling functions that create locals

function wtf() {
  let i = 0;
  i++;
  return i;
}

console.log(wtf(), wtf(), wtf()); // expected: 1 1 1, got: 3 3 3

Move most opts to runtime

Currently dash tries to apply some optimizations at compile time, e.g. given

for (let i = 0; i < 1000; i++);

The comparison and increment get special opcodes that are faster than the generic cmp/inc opcodes as it skips the type check and is optimized specifically to work with integers.
However, doing this statically is very limited and leads to many missed optimizations.

It would be better if we applied these at runtime since we have far more information at runtime as it executes the code.
We already have some amount of infrastructure required for doing this from the JIT, namely the dash_typed_cfg crate for getting a CFG with type information from bytecode, which should already allow for some really nice optimizations that we couldn't do at compile time.
Opts like constant propagation should still run at compile time since there isn't any benefit to not doing it right away.

Unbox primitives

Currently, every value is reference counted through Rc<RefCell<T>>. This is not necessary for primitives such as numbers & bools.

computed property in object literal has key/value mixed

const O = { ['a']: 42 };
console.log(O[42]); // expected: undefined, got: 'a'
console.log(O.a); // expected: 42, got: undefined

Parse escape sequence

The lexer needs to be made aware of escape sequence (eg \n). Right now "\n" is parsed as the literal characters \ and n, and not the linebreak character.

consider using `SmallVec` in more places

It might be worth in a lot of cases, especially for function calls. I'm guessing that an overwhelming number of function calls in any JS program usually only has up to 3-4 arguments at most. We'll have to experiment and see which numbers make sense (or if it even makes sense at all), but if we can avoid many of these tiny allocations, I think that would already be a good improvement.
I know that there's quite a few places in dash where we do something like value.call(this, vec![arg]) (i.e. a one-value vec).

	pub fn push(&mut self, v: T) {
	assert!(N > self.1);

	unsafe { self.0[self.1].as_mut_ptr().write(v) };
	self.1 += 1;
	}

	pub fn erased_value(&self) -> *const () {
	unsafe { addr_of!((*self.0.as_ptr()).value) }
	}

	// TODO: handle \u, \x
	other => {
	lexeme.as_mut().unwrap().to_mut().push(other as char);
	self.advance();
	}

y21 / dash Goto Github PK

dash's People

Contributors

Stargazers

Watchers

Forkers

dash's Issues

Recommend Projects

Recommend Topics

Recommend Org