roc-lang / roc Goto Github PK
View Code? Open in Web Editor NEWA fast, friendly, functional language. Work in progress!
Home Page: https://roc-lang.org
License: Universal Permissive License v1.0
A fast, friendly, functional language. Work in progress!
Home Page: https://roc-lang.org
License: Universal Permissive License v1.0
This should be valid Roc, according to the formatter
if foo bar then
a b c
else
d e f
I'd like to try having the formatter do blank lines in defs like this:
users =
user1 = makeUser blah1
user2 = makeUser blah2
complexUser1 = userFromRecord {
name: "Blah"
email: "[email protected]"
}
complexUser2 = userFromRecord {
name: "Blah2"
email: "[email protected]"
}
user3 = makeUser blah3
user4 = makeUser blah4
[user1, user2, complexUser1, complexUser2, user3, user4]
The rules would be:
Type variables starting with as
give an error, likely because they are confused with the as
keyword to create inline aliases. e.g.
sortWith : (asd, asd -> Int), (Foo asd) -> (Foo asd)
sortWith = \_, list -> Nil
The tests are called fn_int_list_len and loaded_int_list_len
This..
identity = \a
-> a
identity 42
.. formats to this ..
identity = \a
-> a
identity 42
.. which is not valid because -> a
is indented 2 more than identity = \a
above it.
0x7fff_ffff_ffff_ffff
doesn't parse correctly.
Octal and Binary literals likely have a similar problem!
The main goal of this issue is to make sure we don't forget to take this into account when implementing Roc's standard hash maps! It can be closed when we have a hash flooding mitigation strategy in place in the standard library.
Hash flooding mitigation is a complicated issue, but I did a bunch of investigation and thinking about it in 2019, and I want to record some thoughts here to get them out of my head. This can all be revisited when we get to adding an actual hashmap implementation to Roc.
So hash flooding attacks are a thing. The basic idea is to DoS a web server by sending it HTTP payloads which are malicious in a very specific way: the attacker knows the server will parse those payloads into hash maps, and the payloads are designed to produce so many hash collisions that the hash map becomes effectively a linked list.
Lookups into linked lists are super slow compared to hash lookups, and there's apparently been a proof-of-concept of being able to lock up an entire server (Rails or PHP or something - I forget) with a single POST. This suggests an individual could theoretically DoS a vulnerable server without needing a whole botnet to turn DoS into DDoS.
Hash flooding attacks have been demonstrated in proof-of-concept laboratory attacks against real-world systems (e.g. Perl, Rails), but I have not been able to find any documented cases of a real production system actually being hit by one. Nevertheless, I'm convinced the concern is worth taking seriously.
In this talk the people behind some of the proofs-of-concept (and also someone who writes software for DNS servers, which are by default vulnerable to these attacks) talk about defending against hash flooding attacks.
The mitigation strategy they recommend is to use the SipHash hashing algorithm, which was published by researchers working on these hash flooding proofs-of-concept. SipHash has these characteristics:
I really want to avoid having random hashing in Roc. We can get a lot of benefits from everything in the language being a pure function - in terms of caching, reproducibility of test runs, etc. - and using a randomly seeded hasher would violate some of those assumptions. For example, if a different seed were chosen before every test run, then we'd have tests that are both "pure" and also potentially flaky.
Those flakes might reveal improper dependence on the hashing function on the part of the implementation, which would be perhaps worth the disruption of the flake, but they might also be net harmful if it's only the test which relies on the hash. Either way, it'd be pretty frusrtrating if clearing your cache caused one or more of your tests to either pass or fail, when they're all pure functions.
Rust uses SipHash by default.
The arguably more robust defense against hash flooding attacks is to do collision resolution using a tree rather than a linked list. This way, instead of the collisions causing the hash table to degrade to be effectively a linked list, they cause it to degrade to be effectively a tree - which of course does not take nearly as long to search.
In the talk, this strategy is acknowledged extremely briefly but dismissed as something that isn't really used in practice, because hash map implementors don't want to bother implementing tree-based collision resolution because trees are more complicated than linked lists.
Tree-based resolution sounds worth the effort for Roc. It sacrifices neither referential transparency nor performance - because we can put it behind a conditional based on number of collisions so far (e.g. don't switch from linear probing to a tree until after like 16 collisions, which suggests something fishy is happening) that can be correctly branch predicted basically every time outside of an actual hash flood attack.
One reason Rust might not do hash flooding defense this way is that in order to have an efficient tree, you need an Ord
as well as a Hash
implementation for the keys. Rust might not want to impose that restriction on all HashMap
users, but in Roc we can automatically derive one with no extra work on the part of end users anyway.
Theoretically we could do that extra
Ord
derivation only in production builds, but in practice let's be honest - nearly all hash keys are strings or integers anyway, so it's probably not going to save anyone any noticeable amount of code gen.
I looked into this a bit at one point, and it seemed that re-hashing could also be a way to mitigate hash flooding attacks. That wouldn't even require an ordering constraint. I haven't investigated it deeply though!
There are various server-specific ways to defend against hash flooding. For example, to defend against hash flooding attacks using HTTP headers specifically, you could cap the number and length of HTTP headers, such that after encountering excessively long or excessively many headers, the request gets rejected immediately. This sort of thing should probably be done regardless (along with timeouts), as other DoS attacks rely only on sending lots of data.
However, assuming tree-based conflict resolution can be done at the insignificant runtime cost of one always-correctly-predicted branch (plus some code size), that would benefit platform and application authors alike in that they would be "default safe" in case they overlooked a particular attack vector.
It seems like a reasonable idea to start with hashbrown
- which is what Rust used to speed up its standard HashMap
, and which is based on Google's highly optimized SwissTable implementation. From there, we'd add the Ord
requirement and the extra logic to fall back on a tree after a certain number of collisions.
Alternatively, we could avoid the Ord
requirement by using rehashing. The basic idea would be that after a suspiciously high number of collisions (perhaps eight, for example) we redo the hash calculation passing a tuple of (val, salt)
instead of just val
, in hopes that his will no longer collide. (The salt could be - for example - the total number of collisions we'd encountered so far, or in a linear probing scheme, the most recently tried bucket index.) The idea is that it would be infeasible to find a payload which not only triggered the initial collisions, but which also somehow continued to collide after salted rehashing.
Currently when loading modules, we combine all Problems together into one Vec
. Instead, we should have one per ModuleId
, so we can report which module the problem was in - so you can tell, for example, what the line numbers actually refer to!
parse::Expr
has a PrecedenceConflict
case, for situations like a == b == c
. However, nothing leads to it, so things like a == b == c
are parsed to BinOp(a, BinOp(b, c))
This should be the correct format for multi line definitions
nums =
[
27,
28,
29
]
nums
Specificially polymorphic numbers like 42 : Num *
.
The Int literals get a specific Reason::IntLiteral (similar for Float) in their expectation. That is what the variable in their Expr constructor is for: making this extra expectation.
NumLiteral doesn't currently do this. Perhaps that's fine? We should check when error reporting works.
I tried these formatter tests, and the parser failed:
expr_formats_same(indoc!(
r#"
x =
5
42
"#
);
expr_formats_same(indoc!(
r#"
x =
# This is 5
5
42
"#
)
Currently every Subs
has a massive quantity of entries like this:
1700 => p: 1700, c: Structure(EmptyTagUnion), r: 1, m: none c: None,
1754 => p: 1754, c: Structure(EmptyTagUnion), r: 1, m: none c: None,
1763 => p: 1763, c: Structure(EmptyTagUnion), r: 1, m: none c: None,
If we were to reserve hardcoded Variable
values for EmptyTagUnion
and EmptyRecord
, we could presumbly often use that hardcoded variable instead of creating fresh Subs
entries - for example, in canonicalization when defining a record literal (meaning it's definitely a closed record) or pattern matching on a tag union (meaning it's definitely a closed union), or when canonicalizing closed type annotations.
Rigid variables that are only used once in an annotation get lost when importing that definition. That's because after type inference the rigid variable is actually a flex variable.
I think this is because of an asymetry in type checking (flex ~ rigid != rigid ~ flex). So when we define
withDefault : Result a x, a -> a
and then import it
foo = withDefault
the type of foo
is Result a *, a -> a
.
This is probably fine, but might be worth looking into to improve error message quality.
It's no longer used across threads, so we don't need to pay the price of atomics anymore.
We should keep it around though!
Comments at the end of a module get removed by the formatter.
x = 5
# Hello
y = 10
42 # gets removed
I made a branch with a failing test, it doesn't matter to which test we add a comment at the end they all remove that trailing comment.
See failing branch => 8f4e3d2
I haven't written a test to reproduce this yet, but I believe the following is true:
If I write two different type annotations, one of which should have only *
and the other of which should have a named variable like a
, the act of generating the a
could accidentally mutate the other annotation's *
and turn it into an a
in Subs
.
Seems like we still haven't fixed it completely. Likely a matter of some variable(s) not getting into the flex_vars
field (not added to a Vec, missing exists
, who knows).
That means that
numIdentity : Num.Num a -> Num.Num a
numIdentity = \x -> x
x = numIdentity 42
y = numIdentity 3.14
{ x, y, numIdentity }
Currently gives
{ numIdentity : <type mismatch> -> Num a, x : Num a, y : Num a }
In the load
tests, top-level values that are not exposed but used in an exposed definition, are marked as unused. e.g. in the following
exposing [ foo ]
foo = 3 + bar 2
bar = \_ -> 32
the bar
function would be marked as unused and trigger an error, even though it is used by an exposed function.
The formatter should recognize and format multi line when conditions
when
v
|> f
|> g
is
Ok _ -> ..
In the case of
when x is
3 | 4 if condition -> 2
_ -> 1
I think we should format this to
when x is
(3 | 4) if condition -> 2
_ -> 1
to make it clear that the guard is for the full pattern, not just the 4
pattern
I get infinite recursion with this snippet
interface AStar
exposes [ findPath ]
imports []
# a port of https://github.com/krisajenkins/elm-astar/blob/2.1.3/src/AStar/Generalised.elm
Model position :
{ evaluated : Set position
, openSet : Set position
, costs : Map.Map position Float
, cameFrom : Map.Map position position
}
initialModel : position -> Model a
initialModel = \start ->
{ evaluated : Set.empty
, openSet : Set.singleton start
, costs : Map.singleton start 0
, cameFrom : Map.empty
}
cheapestOpen : (position -> Float), Model a -> Result position [ IndexOutOfBounds ]*
cheapestOpen = (\costFunction, model ->
folder = \position, smallestSoFar ->
when Map.get position model.costs is
Err e ->
Err e
Ok cost ->
positionCost = costFunction position
if positionCost + cost < smallestSoFar.cost then
{ position, cost: cost + positionCost }
else
smallestSoFar
Set.foldl model.openSet folder (Err IndexOutOfBounds)
|> Result.map (\x -> x.position)
)
reconstructPath : Map position position, position -> List position
reconstructPath = \cameFrom, goal ->
when Map.get goal cameFrom is
Err KeyNotFound ->
[]
Ok next ->
List.push goal (reconstructPath cameFrom next)
updateCost : position, position, Model position -> Model position
updateCost = \current, neighbour, model ->
newCameFrom = Map.insert neighbour current model.cameFrom
newCosts = Map.insert neighbour distanceTo model.costs
distanceTo = reconstructPath newCameFrom neighbour
|> List.length
# |> toFloat
newModel = { model & costs : newCosts , cameFrom : newCameFrom }
when Map.get neighbour model.costs is
Err KeyNotFound ->
newModel
Ok previousDistance ->
if distanceTo < previousDistance then
newModel
else
model
findPath : { costFunction : (position, position -> Float), moveFunction : (position -> Set.Set (position)), start : position, end : position } -> Result (List position) error
findPath = \{ costFunction, moveFunction, start, end } ->
astar costFunction moveFunction end (initialModel start)
astar :
(position, position -> Float)
, (position -> Set position)
, position
, Model position
-> Result (List position) *
astar = \costFn, moveFn, goal, model ->
when cheapestOpen (costFn goal) model is
Err e ->
Err e
Ok current ->
if current == goal then
Ok (reconstructPath model.cameFrom goal)
else
modelPopped = { model & openSet : Set.remove current model.openSet , evaluated : Set.insert current model.evaluated }
neighbours = moveFn current
newNeighbours = Set.diff neighbours modelPopped.evaluated
modelWithNeighbours = { modelPopped & openSet : Set.union modelPopped.openSet newNeighbours }
modelWithCosts = Set.foldl (updateCost current) modelWithNeighbours newNeighbours
astar costFn moveFn goal modelWithCosts
via
#51 roc::unify::unify_alias (subs=0x7ffff521d840, pool=0x7ffff51d4b08, ctx=0x7ffff5094bd0, symbol=..., args=..., real_var=...) at src/unify.rs:123
#52 roc::unify::unify_context (subs=0x7ffff521d840, pool=0x7ffff51d4b08, ctx=...) at src/unify.rs:93
#53 0x00005555559d8de5 in roc::unify::unify_pool (subs=0x7ffff521d840, pool=0x7ffff51d4b08, var1=..., var2=...) at src/unify.rs:81
#54 roc::unify::unify_alias (subs=0x7ffff521d840, pool=0x7ffff51d4b08, ctx=0x7ffff509ada0, symbol=..., args=..., real_var=...) at src/unify.rs:123
#55 roc::unify::unify_context (subs=0x7ffff521d840, pool=0x7ffff51d4b08, ctx=...) at src/unify.rs:93
#56 0x00005555559d8de5 in roc::unify::unify_pool (subs=0x7ffff521d840, pool=0x7ffff51d4b08, var1=..., var2=...) at src/unify.rs:81
must verify later that this is fixed (but I think the underlying cause is some different issue)
To reproduce, change builtins/Default.roc
to import a module that doesn't exist, and then run tests.
Presumably this is because the "file not found" error doesn't lead to the "number of modules processed" counter being incremented, which in turn means the "check to see if we're done loading everything" check never passes.
If so, should be easy to fix.
The author of wyhash
(which we're currently using as the basis for MutMap
etc) released a new, even faster hashing algorithm called o1hash
(search for "o1hash" in the wyhash
README) which the author describes as "fastest in hashmap but not secure." I'm not sure what that means (wyhash
is also not cryptographically secure; does he mean something beyond that?), but he also says o1hash
is now being used in the V programming language.
It's probably worth benchmarking, at least, to see how using it would affect compile time.
Currently you can do a standalone type annotation inside a Defs
; this is shorthand for "go ahead and compile this as if the body were Debug.todo
".
This doesn't yet work for top-level declarations inside a module, but the design is that it should work there too! Writing this down so we don't forget it.
We currently have a canonical Expr::If
but we don't use it, because we still desugar If
to When
!
When we know we are branching between two hardcoded constant values, it's better performance to emit a select
instruction than a branch instruction.
This should happen after any constant propagation.
We now track lookups more reliably in Env and in Scope (with .introduced
and .referenced
), so we should start using those instead of References.lookups.
We may still need References.calls though.
Currently the first type variable name will always be a
even if there's already a rigid for a
in scope. Instead, it should move on to try b
etc.
There should always be exactly one blank line before the return value of a def-expression
This..
x = { x: 4
, y: 42
}
cannot be parsed.
We should report a Problem for when a value (or type, or ability) is exposed, but then never used by any module (including the current one).
Unlike the other "unused" problems, this one can only be done after all modules have been canonicalized, not just the current one.
Note that modules which are exposed in packages should be exempt from this, as then you're exposing it to the outside world.
LLVM offers a number of features as C intrinsics, including stuff like trap
which we might use for cross-platform runtime errors, memcpy
, etc. These will apparently be available in Inkwell's upcoming LLVM 9 bindings.
It appears Cranelift already has things like this implemented. So as a stopgap, if we need things like that urgently, we could implement them in Cranelift and leave them unsupported in LLVM until Inkwell's LLVM 9 bindings are ready.
Currently, Scope
is an ImMap
which gets cloned each time we enter a new scope. This is so that we know which things are "in scope" and which aren't.
An alternate approach to investigate would be to use a MutMap
, and then when we exit a given scope (e.g. at the end of a Closure
when its args go out of scope, or at the end of a Defs
when its bindings go out of scope), remove
all the entries we'd added to the Scope
.
Since we don't allow shadowing, this should be equivalent to cloning the scope, but it would mean that instead of paying the cost of reference counting in ImMap
, we would pay the cost of hash removals. Of course, we'd need to benchmark to see which was actually faster in practice.
Every time we use an alias, it means twice as many Subs
entries and the extra work of expanding it every time it's used.
This is unavoidable in userland code, but builtins get used a lot, so it might be worth "pre-expanding" their aliases such that we could skip the step of expanding the aliases for them since we know that 100% of them will be annotated anyway, and their implementations hardcoded - meaning there's no innate reason they would need to be aliases.
This could mean (for example) having them go in directly as Type::Apply
instead of as an Alias
.
This should format to itself
if
foo
bar
then
a b c
else
d e f
An multi-line pattern as the condition, should format the whole if ... then
into the multi-line format.
This should be processed as a type annotation, not an alias:
UserId x : [ UserId Int ]
UserId x = UserId 42
However, it currently thinks the first line is a type alias because UserId
is capitalized.
The rule should be that if the signature is followed immediately by a Def
body with an equivalent pattern match, that's a type annotation, not a type alias.
Test reproducing this: https://github.com/rtfeldman/roc/pull/177/files#diff-8a8ff9b41a352e25cdb1f9103945600bR1771
What if you put comments after the then
? What if you put a common after the condition? What if you put a comment before the else
? I am sure that this currently doesnt work.
This fails
foo = \bar ->
baz =
3
42
while this succeeds
foo = \bar ->
baz = 3
42
Recursion in type aliases is tricky, and we don't handle it very well right now. In particular invalid recursion like
Foo : Foo
must be detected, while valid recursion
List a = [ Cons a (List a), Nil
must be corrected (make tag union into recursive tag union)
Then there is mutual recursion between aliases. The problem is definitions like this:
ListA a : [ ConsA a (ListB a) ]
ListB a : [ ConsB a (ListC a) ]
ListC a : [ ConsC a (ListA a), Nil ]
val : ListC {}
val = ConsC {} (ConsA {} (ConsB {} Nil))
If we don't do anything, type_to_var will blow the stack, trying to expand the alias.
We want to expand the type alias up to where it is self-recursive, but no further. in this case ListC's body should be expanded like so:
ListC a = [ ConsC a [ ConsA a [ ConsB a (ListC a) ] ], Nil ]
Which we can then turn into a recursive tag union. The question is how to do this expansion safely and efficiently, and when to do it.
I think we can do this during canonicalization, using Type::Alias
to make sure error messages still use the correct names (e.g. ListB a
instead of its rhs in the example above).
We can also do it at type checking time, but then we have less information about how the alias is recursive, and have to do the process for every occurrence of the alias in a signature, instead of just once.
This:
# This variable is for greeting
a = "Hello"
a
is parsing to this:
SpaceBefore(
Defs(
[
|L 5-5, C 0-1| Body(|L 5-5, C 0-1| Identifier("a"), |L 5-5, C 4-11| Str("Hello")),
],
|L 7-7, C 0-1| SpaceBefore(Var([], "a"), [Newline, Newline]),
),
[
LineComment(
" This variable is for greeting",
),
Newline,
Newline,
Newline,
Newline,
],
)
The parser is apparently parsing four Newlines, but it looks like it should be just one.
See this line: https://github.com/rtfeldman/roc/blob/58f780bb6047fdc252dab45e49d448636b607088/tests/fixtures/build/interface_with_deps/Primary.roc#L7
This shadowing error is not reported in module.problems
, but it should be!
There is no abs
instruction, but it's possible to get the right answer in a few instructions without branching (that is, without doing like if x < 0 then -x else x
):
https://stackoverflow.com/a/14194764
Should add a property-based test (plus testing extremes manually) to check that it always gets the same answer as Rust's negation.
When compiler/builtins/bitcode/src/lib.rs
changes, nothing immediately happens. You have to run cargo rustc
with some particular args (see compiler/builtins/bitcode/README.md
) and then check in the resulting .bc
file.
Having the .bc
file checked in seems good, but we should probably have a test which verifies it's in sync with the lib.rs
file which generated it.
We could ensure this by having a test which recompiles the .bc
file and then asserts that it has the same contents as the one that's checked in. That way if they ever differ, we'll get a failed test which tells us to run the procedure to fix it.
If I write a function foo
with an argument x
and x
only ever gets passed to foo
, then although x
is technically "used" (because it's passed to foo
), it's a function argument that is only ever passed to the function!
I think we should at least give a warning for an if
guard without an expression. But perhaps the formatter should just drop it?
when x is
2 | 3 -> 0
3 | 4 if -> 2
_ -> 1
The canonical representation is now
If {
cond_var: var_store.fresh(),
branch_var: var_store.fresh(),
branches: vec![(loc_cond, loc_then)],
final_else: Box::new(loc_else),
},
A vector of conditions and branches, and a final branch representing the last else
of an expression like
if b1 then
x
else if b2 then
y
else
z
But the parser doesn't parse such an expression as one.
I've seen this failure in CI twice, but I've been unable to reproduce it locally, even after running the test suite 100,000 times in a loop. I suspect it's easier to reproduce with fewer cores, since CI is presumably limited on cores.
Anyway, here's the trace from CI:
running 3 tests
thread 'tokio-runtime-worker' panicked at 'Failed to send Header message', src/load/mod.rs:449:45
test test_load::load_only_builtins ... FAILED
test test_load::interface_with_deps ... ok
test test_load::load_principal_types ... ok
failures:
---- test_load::load_only_builtins stdout ----
thread 'test_load::load_only_builtins' panicked at 'Could not find module ID in waiting_for_solve', src/libcore/option.rs:1185:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
failures:
test_load::load_only_builtins
I believe the problem is that inside load_filename
, we do two tokio::spawn
calls with more processing in between. The first call sends a Header
message, which the main thread uses to add an entry to waiting_for_solve
. Then, later, the second tokio::spawn
call sends a Constrained
message which depends on waiting_for_solve
having that entry in it already.
There's no blocking in between the tokio::spawn
calls, so there's no guarantee that the Header
message will get received and processed before the Constrained
message. If that were happening here, it would explain the symptom of the Constrained
branch sometimes not yet having an entry that should have been inserted by the previous Header
message.
This should be okay:
x = [
1,
2
]
According to rust-lang/rust#67186 LLVM only permits generating arrays (or perhaps indexing into them) of isize::MAX
as opposed to the theoretical maximum of usize::MAX
imposed by the size of a pointer. (Apparently the reason for the restriction is related to doing pointer arithmetic potentially involving negative numbers, which a language like C would permit in userspace.)
Is this still true? Inkwell's API doesn't suggest or mention it, so it seems worth verifying empirically. What about for Cranelift? (If it's not true, it seems better to go with usize::MAX
since Roc doesn't support arbitrary pointer arithmetic.) Probably the easiest way to verify it is through a List.range
test.
If it is true, how should we enforce this? Error if attempting to create a list that's over isize::MAX
elements?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.