fluencelabs / nox Goto Github PK

View Code? Open in Web Editor NEW

930.0 36.0 167.0 38.74 MB

Rust implementation of the Fluence network peer

Home Page: https://fluence.network

License: Apache License 2.0

Rust 99.52% Makefile 0.24% Shell 0.07% Dockerfile 0.17%

p2p wasm fluence decentralized protocol peer peer-to-peer rust

nox's Introduction

Nox

Nox is the reference implementation of the Fluence peer. It is used as a Relay for all Clients and as a Host for all Workers.

Installation and Usage

nox is distributed as a docker image. To start a local instance of nox, run:

docker pull fluencelabs/nox:latest
docker run -d --name fluence -e RUST_LOG="info" -p 7777:7777 -p 9999:9999 fluencelabs/nox:latest --local

This will setup a network of one nox not connected to any other network.

For more info about the docker image see the README.

Documentation

Comprehensive documentation on everything related to Fluence can be found here. Check also our YouTube channel.

Support

Please, file an issue if you find a bug. You can also contact us at Discord or Telegram. We will do our best to resolve the issue ASAP.

Contributing

Any interested person is welcome to contribute to the project. Please, make sure you read and follow some basic rules. The Contributor License Agreement can be found here.

License

All software code is copyright (c) Fluence Labs, Inc. under the Apache-2.0 license.

nox's People

Contributors

Stargazers

Watchers

Forkers

satybald diemyst soros1321 praveenmunagapati captake gitter-badger octaplexsys ricogit bitsavage0202 krivchun cdetrio arkadiuszsz lcoenen folex binarycook dapptips sfrias tomzhang maniacs-oss everesio xdralex yongqian10 boneyard93501 code2expert anthonyjameshand amar-jain forkkit cyberbono3 ktzanida crypto-forks hyperpoly fruitbox12 t3chnicallyinclined molyee web3-net eddiejjefferson phial3 wahello poshmayvee 0x3zero iq-scm grandsun1987

nox's Issues

Session/ordering attack protection

Motivation

In addition to lack of an implemented session data cleanup mechanism (#168) the current State machine sessions implementation is vulnerable to various attacks, for example:

client generating too many sessions
client sending too many transactions
client causing queuing of too many transactions within a single session

Those attacks might be performed without significant attacker's efforts and would cause temporary or permanent Data engine unavailability. This makes some prevention measures against them obligatory.

Proposed change

First, we need to identify major and secondary attack vectors to the sessions mechanism, in addition to those described in the motivation paragraph.

Then, we need to specify some conditions and numbers that should be limited to prevent the identified attacks. Likely, it should include limitation on #sessions per client, #concurrent sessions per client, #txs per session, #queued txs per session, etc.

Finally, we need to implement those protection procedures. They are likely should be tunable (allowing to configure some limits and switch them on/off).

Integrate Master process with Deployer contract

Motivation

Currently, Fluence clusters might be launched only locally (via bash scripts) or using some centralized solution
which combines solvers' keys and delivers the combined information (namely, genesis and p2p configuration) to solvers.

Also, there already exist Deployer eth contract and Master node process prototype implementations.

In order to unlock launching decentralizing Fluence network functionality, we need to integrate Deployer contract,
Master process, and solver Docker container.
We also need a scalable solution that allows launching several solvers on a single node from a single Master process.

Proposed change

Deployer contract changes

Contract provides the following state-modifying methods:
- addCode(codeName, numberOfSolvers) to request a formation of one cluster served by the requested number of solvers running the specified code.
  - The code is not reused for different clusters.
  - It's possible to invoke it for the same code several times.
- addNode(masterKey, address, startPort, endPort) to claim that the specified master is ready to launch solvers on ports from the specified range.
  - This claim cannot be updated (like by extending port range later).
  - Once specified via addNode, a master key cannot be reused anymore for subsequent addNode invocations.
Cluster is formed once there is an added code and there is enough amount of registered nodes whose port ranges are not exhausted yet.
Contract provides information about formed clusters:
- Via ClusterFormed event emitted as soon as the cluster is formed.
- Via getClustersForNode(masterKey) contract view to retrieve previously formed clusters.
- Via getCluster(clusterID) contract view to get all information about a particular formed cluster.
Ports are not reused; the contract is unaware of actual solver statuses: whether they are running, not running yet, or terminated.

Master process changes

Fluence Master process (FluenceNode.scala) becomes a non-CLI process.
It takes launching parameters:
- Master key information (location of the file with public/private key pairs for Tendermint consensus and p2p).
- Eth node URL.
- Deployer contract address.
- (possibly) Eth account to authorize contract method invocations.
It actualizes the list of solvers to run:
- By retrieving clusters that had formed before Master started (via GetClustersForNode contract view).
- By listening to ClusterFormed contract events after Master started.
It launches actual solvers (if they not launched previously).
In order to be launched properly, solvers configured by:
- Master key information.
- Genesis/transport information from view/event.
- Reference to WASM code to run on solvers:
  - For now, this code is manually placed to node locally and referenced by location.
  - In future this code would be downloaded via Swarm and referenced by Swarm hash.

Additional utilities

A dedicated utility/script to invoke addCode and addNode methods is required.

CORS error on workers

Workers cannot accept requests from outside nodes. cors_allowed_origins = ["*"] parameter in config.toml presented and it worked before.
Firstly, check if the generated config is used in workers.

Session-based transactions happens-before relationship

At the current moment, we are planning to support a happens-before relationship for the client transactions made within a single client session.

That's a complex machinery to support on the state machine side: transactions coming out of order have to be stored in a temporary buffer first. Afterwards, a transaction will get applied once all the preconditions for it (such as transactions it depends upon) are satisfied. Not only the state machine has to track completed transactions, but it also has to periodically garbage collect abandoned sessions.

I think it would be great to understand whether the happens-before requirements can be fulfilled without a complex state machine code but with just few client library modifications. Few examples are listed below.

A bunch of transactions

If the client sends a bunch of transactions first and then a query, we generally want transactions to be completed one after another, and the query – only once all the transactions are completed:

tendermint.tx(tx1)
tendermint.tx(tx2)
tendermint.tx(tx3)
result = tendermint.query(...)

Expected happens-before relationship: tx1 –> tx2 –> tx3 –> query. Instead of using the session-based ordering of transactions, a client could send one supertransaction tx123 having tx1, tx2 and tx3 inside.

An ordering between transactions and queries would still be needed, but that's a different story. The query will have to wait until transactions are composed into a block by Tendermint and this block is processed by the state machine. I don't think there is a way to achieve this on the client-side only.

Conditional transaction execution

Sometimes a client might need to execute a transaction only if a specific condition was satistfied:

tendermint.tx(tx1)
if (tendermint.query(...) == 10) {
  tendermint.tx(tx2)
}

Expected happens-before relationship: tx1 –> query –> tx2. However, if an ordering between transactions and queries is already supported, there is no need for an explicit transactions ordering. That works because: tx1 –> query & query –> tx2 automatically means an order between transactions.

Transaction execution based on an external condition

Sometimes a client might need to execute a transaction only if a specific external condition was specified.

tendermint.tx(tx1)
if (some external condition) {
  tendermint.tx(tx2)
}

Expected happens-before relationship: tx1 –> tx2. In this particular example we can't combine tx1 and tx2 into one supertransaction, and nothing else establishes a happens-before relationship. However it's possible to rewrite this code as:

if (<some external condition>) {
  tendermint.tx(tx1)
  tendermint.tx(tx2)
} else {
  tendermint.tx(tx1)
}

Intuitively, it feels such a rewrite doesn't switch into a weaker consistency model if the external condition is independent from the state machine. I'm not sure if this can be used for any client code though: in some cases derived supertransactions might become too big, in the other – a combinatorial explosion might happen. Consider this snippet for example:

while (true) {
  sleep(1000)
  if (<some external condition>) {
    tendermint.tx(tx_i)
  }
}

In this example, it's obvious we want tx_k –> tx_k+1 but it's not really possible to unroll the loop to use the technique described above. Would be nice to reiterate on this once we have a better understanding of the potential use cases.

Client session termination management

Motivation

Currently, there is no concept of the closed session: clients come and go, create sessions, send transactions, and forget about them. This concept is important, without it we cannot:

Understand that the session data can be cleaned up (#168).
Provide the client an information that some session was aborted during the execution of some of its transactions.

Proposition

First, we need to introduce the session state: it might be active or terminated. Then, we define 3 termination ways:

Explicit termination by the client (normal case): the client needs to do it after it finishes sending transactions.
Aborted (or failed) termination: if an unrecoverable error happened during transaction processing.
Expiration: the session is terminated automatically after some 'period' (measured in time or number of transactions or blocks) without transactions from the client.

Whichever the session termination type was, if the client sends any transaction in the terminated session, it fails and the client is notified about it. It should also cause all in-flight requests to fail, so if the client is still waiting on the response (e. g. on promise or awaitable) or retrieves it later, it will receive an error.

This proposal implies some mechanism for closed session's data cleanup, it is to be implemented in #168.

Information on a master process via API

Motivation

Right now developers and node owners lack information on the state and internal workings of the master node process. Also, it's not possible to watch for changes in clusters or in the system.

This information could be provided by some API and be beneficial for:

Nodes owners. To monitor and control the lifecycle of their nodes.
Developers. Debugging and exploring the system's behavior through an external API is much easier than digging apps internals.
Testnet monitoring and debugging. Get information about network's behavior, about user behavior, planning network upgrades, etc.
Bug reporting. Bugs are really hard to fix without status/debug information.

Proposition

Create an HTTP API on the master node that accepts requests and responds with the information on the node's state:

information about solver state (number of solvers, clusterID's, code id's, uptime, last block, last app hash, etc)
information about the master node (IP, list of ports, uptime)
debug information (configs, parameters)

Browser client for interaction with the real-time cluster

Motivation:

Users cannot interact with real-time cluster and call deployed code's methods from a browser without implementing the client on JavaScript. This will also give a possibility to work with the client from the console using node.js.

Proposed Changes:

The list of client's features to implement:

Session support. State Machine works with the client through sessions. In every session, the client should control an order of requests by a counter. That gives the state machine a possibility to execute commands in the right order.
Submit commands. The client has the ability to send custom commands for execution.
Await responses. State Machine responds asynchronously. The client should wait and check a response.

Alternatives:

Watch on js-tendermint.
Maybe it's better to write logic over this lib.

We can use the counter on WASM as a pre-demo.

State machine session data cleanup

Motivation

Currently, the State machine stores all the session metadata and results without any limits on the size of these data, leading to high storage usage – this may cause a performance degradation.

Proposition

To implement automatic cleanup procedures in the State machine they must be triggered by client session termination (see #167).

Also this 'triggering' should ideally be tunable somehow, making it possible to define some 'period' (measured in time or number of transactions or blocks) when the cleanup is actually performed, possibly depending on how the termination occured (whether it was normally closed, failed, or expired).

Data that should be cleaned up include:

queued payloads
statuses
results of successful invocation and error messages
client/session/tx hierarchy keys

Vm error strong hierarchy

Motivation

Virtual Machine (VM) produces a lot of different error kinds. Right now VM's methods signatures don't provide a knowledge about produced errors. Each VM method returns a general type of error and there is no way to know, without reading source code, which errors might be produced and which errors will never appear. This is an error-prone and unfriendly API. When new error type will be created a compiler will not make a tip about that and new error might be unhandled.

Proposition

Error hierarchy should allow a compiler to make a tip when we forgot to handle some error kind.
Method signatures must reflect possible errors. For example, Wasm.apply should produce InitializationError, TrapError, InternalVmError and we need to know this from the signature of the method apply.

Possible solution

Create sealed traits for each method in a VM, for invoke - InvokationError, for getVmState - GetVmStateError and so on. InvokationError sealed trait should become a parent for each error which might be produced in the invoke method. It allows the compiler to detect unhandled errors in pattern-matching through exhaustiveness check.
For example:
sealed trait InternalVmError extends ApplyError with InvokeError with GetVmStateError means that InternalVmError can be produced by methods apply, invoke, getVmState

If there are more suitable solutions with the less boilerplate code we should consider them too.

Plans about a standard response format

I've met Evgeny at a meetup in Warsaw and we've talked about the need of creating a standard response format so that you could pull data from multiple blockchains and expect similarly-shaped results for ease of development.

Are there any plans for that?

Fluence Rust sdk - version 1

Motivation

Passing complex data to WebAssembly requires its allocation in VM’s memory. Then a pointer to that data is passed to the called function instead. The same holds for the reverse case: the VM returns a pointer, then we can pick the bytes from memory. That creates a necessity to convert complex types (string, byte) into integers (pointers) before a function could pass them to VM or read data from memory. This logic is very general and it is the same for many Rust applications. Our proposal is to bundle them together and provide as a dependency in an SDK. This will also simplify using the Fluence network for developers.

Proposition

Create the first version of Fluence SDK for Rust that consists of:

Allocate and deallocate methods for external memory management.
Methods for reading from and writing to raw memory for UTF strings.

All of this functions will be used frequently and must be well tested.

Also, the SDK has to publish to crates.io.

This task is a continuation of (#163).

Tendermint high CPU usage

Tendermint processes in workers consume a lot of CPU power without any load. This is limits number of workers on one node.

Decoupling VM from State machine

Motivation

Currently State machine and VM projects are tightly coupled:

Sbt projects: VM is the dependency for State machine.
Configuration: wast module filenames are hardcoded in State machine's Main class. There setting should be tunable and provided via some configuration.

As a consequence those 2 projects are not isolated: someone cannot use the State machine without an initialized WASM instance; the State machine behavior cannot be tested in separation from VM. Also, the State machine code would likely be needed to change in case of even minor changes in the VM processing logic.

Proposed change

First, we need to move an integration logic with tests into a dedicated Sbt project. As a result, the State machine should not depend on Asmble-specific logic anymore: it might be even run and tested with some stub instead of real VM instance. Some new, Asmble-'agnostic', tests should be implemented in addition to the current, mostly integrational, State machine tests.
We also need to provide a way to configure the setup of currently used VM implementation flexibly, including the list of WASM modules used and Asmble-specific properties.

Simplify the work with mutable resources in the Swarm client

Motivation:

The current implementation of Swarm client forces to write a lot of boilerplate code to work with mutable resources (MRs).
Because it is necessary to pass the identifier of an MR to each call of creating and updating the MR. It degrades the readability of the code.

val id = MutableResourceIdentifier(
    name = Some("name of resource"),
    frequency = 300 seconds,
    startTime = System.currentTimeMillis() millis,
    ownerAddr = ethAddress)
    
api.updateMutableResource(id, dataVersion1, false, period1, version1, signer) //call1
api.updateMutableResource(id, dataVersion2, false, period2, version2, signer) //call2
...
api.updateMutableResource(id, dataVersion10, false, period10, version10, signer) //call10

Proposed Changes:

Create a class charged with identifier and signer, this will make it possible to call a method with only mutable data.
Period and version calculation should be hidden inside methods.
A period is calculated like: currentTime <= startTime + frequency * period.
A version is an incremental counter over one period.

val mutableResource1 = MutableResource(id, signer)

mutableResource1.update(data1)
mutableResource1.update(data2, multiHash = true)
...
mutableResource1.update(data10)

Python client enhancements

Motivation

Current Data engine Python client has some drawbacks:

It is implemented in 2.7 version. Despite it is still active, it's adoption is decreasing and it might be treated as outdated.
There are some style issues, like long lines, inconsistent naming, etc.
There are some vague and messy function examples.
License headers are not generated for Python files.

Those drawbacks affect project quality and obstruct further maintenance and enhancements.

Proposed change

Rewrite client to Python 3.
Resolve style issues (long lines, naming etc.)
Resolve some vague and messy implementation issues.
Generate absent license headers.

State machine documentation enhancement

Motivation

The current documentation is sometimes vague and not consistent:

Some important procedures are not described.
Some class descriptions contain forward references.
When some statemachine functional moves to vm, add documentation described that.
ReadMe document is partly outdated and missing description of crucial parts of the state machine.

Proposed change

Remove forward references; segregate independent classes' scaladocs from examples of their usage.
Bring ReadMe document up to date.
Describe missing procedures, e. g. how to launch the local cluster and the client.

TrueBit forced errors approach – batch validation application

This issue first gives some context on how does the forced error mechanism work in TrueBit (as described in A scalable verification solution for blockchains paper and wiki), then speculates about how forced errors could work in Fluence and finally poses the currently unsolved problem of hiding the correct result hash from verifiers.

Forced errors behavior in TrueBit

Initial assumption

It is assumed that Solvers and Verifiers are rational, so they make an action only if it is profitable for them. It should be noted that certain actions might be unprofitable within the system, but profitable in a greater arrangement. For example, if transactions transferring millions of dollars are forged, TrueBit might notice and punish malicious Solvers, but it's still profitable for them to supply a bogus solution.

Solver registration

In order to receive tasks each Solver should register itself in the TrueBit contract. On registration Solver supplies a deposit and a hash of private random bits r. Private random bits r are generated by the Solver before the registration and are used for deciding if the forced error state should be enabled.

Algorithm

In short, the algorithm can be described as following:

The Task Giver is the one who wants its task to be solved for a reward. It posts a task to the TrueBit contract specifying the minimum deposit, reward and the timeout for solving/verifying the task.
After that any Solver could bid itself on the task. Single suitable Solver is then chosen by a lottery and assigned to the task with the assignment recorded in the TrueBit contract.

Note that the lottery is performed naturally when miners select one of the concurrent transactions sent to the TrueBit contract. A sender of the chosen transaction is selected as a Solver. It is still an open question however – see How is the Solver selected exactly? in the TrueBit's Open Problems list.
If the Solver has solved the task within the specified timeout, it submits an array of two hashes (which is called soln_hashes) to the contract: hash(correct solution) and hash(incorrect solution) where correct and incorrect result might go in any order.
Once the transaction with the submission is mined and included in some block, Solver knows if the forced error is in effect and calls designate method in the TrueBit contract passing hash(index || r) as an input. The index variable can take 0 or 1 value and is the position in soln_hashes array. This way the Solver is committing itself on which hash in the soln_hashes array should be treated as correct one. The Solver, however, is not revealing which hash was designated at this moment.

Note that the designated hash might not always be the correct one. If the forced error is in effect, the Solver will designate the hash of the incorrect solution to be treated as a correct one. If there is no forced error in effect, the Solver should designate the correct solution.
Then any Verifier who has found one or two incorrect solutions in soln_hashes should publish a hash of a certain number v to the TrueBit contract within the specified timeout. v designates which hash in soln_hashes is deemed incorrect by the Verifier. If v mod 3 == 0 it means the first hash is deemed incorrect and v mod 3 == 1 – the second one. If v mod 3 == 2 it means that both hashes are marked incorrect by the Verifier.
If the forced error is in effect, challenged Solver reveals private random bits r to prove that it specifically designated an incorrect solution because of the forced error. In this case the Verifier can easily verify that another solution in the soln_hashes array is actually correct.

Forced error detection

The forced error is in effect if random bits hash(r || block hash) is less than forced error rate constant defined in the TrueBit contract. They should not be confused with private random bits r which is the Solver's private bits generated when it has registered in the contract.

It's important to note that neither Solver nor Verifiers know if the forced error is in effect in the beginning:

The Solver knows if the forced error is in effect only after the transaction with the submitted solutions is mined and placed into some block. At that point, the Solver knows the block hash so it is able to calculate random bits hash(r || block_hash) learning if the forced error is in effect.
Verifiers learn if the forced error is in effect only after the challenged Solver reveals its r in order to prove that an incorrect solution was selected because of the forced error. This is done by the Solver to avoid being penalized.

TrueBit paper states this in 4.4 Generating forced errors:

In order to motivate verification of all tasks and to guard the jackpot repository against swindle, forced errors must appear unpredictably. < ... > The system derives its unpredictability via the following properties.

The Task Giver does not know the random bits at the time that it announces a task.

The Solver does not know the random bits until after he has committed his solution.

Verifiers do not know the random bits until after they decide whether to challenge.

Jackpot repository

Once Verifier succesfully challenges a task with the forced error in effect it receives a big payout which is called a jackpot payout. This payout is coming from a jackpot repository – i.e. a separate TrueBit contract that holds big enough amount of money and pays some of them to Verifiers in the case of the forced error.

That contract is supplied with the money collected through taxes from Task Givers who pay these taxes along with the reward for a submitted Task and also from forfeiting deposits of misbehaving actors. For example, a Solver producing incorrect results will lose its deposit. In this case the half of this deposit will go to the jackpot repository while up to the whole other half will be paid to the Verifier who won a challenge against that Solver.

Possible attacks

The Solver might share with Verifiers whether the forced error is in effect

Attack

This might be done in different ways, for example:

if the Solver and the Verifier are controlled by the same entity (e.g. we have a Sybil attack)
if the Solver could tell the Verifier if the forced error is in effect through a side channel
if the Solver always submits hash(0) of the incorrect result – in this case if hash(0) is designated as a solution it's obvious that the forced error is in effect
if the Solver always places the correct solution first and the incorrect one – second in the soln_hashes array

My understanding of TrueBit defence

To deal with this attack it seems that TrueBit proposes this is where Solvers rationality should come into play. Since the Solver knows if there is a forced error before any of Verifiers, it could (and would) challenge itself in the case of forced error and receive a jackpot payout. Moreover, TrueBit assumes that Solver will always challenge itself and receive a jackpot payout in the case of forced error, so there is no usual reward in that case.

Now, to understand the Solver's reasoning let's look at how the jackpot payout is calculated in TrueBit. For each task there is a possible jackpot payout J wich is bounded by 1/3 of the total jackpot repository and scaled for the task difficulty to equally incentivize solving complex and simple tasks. While J is the upper bound on the payout the actual payout is dependent on the total number of challenges submitted for the task. With each new challenge the payout gets exponentially decreased and is defined as J / 2 ^ (k - 1) where k is the number of submitted challenges.

Since the jackpot payout decreases exponentially with each new challenger the Solver doesn't have any economic reason to reveal the forced error state to anyone and is heavily incentivized to keep it in secret.

[NOT SOLVED] The Task Giver might set an unrealistically low timeout

Attack

It might happen that the Task Giver specifies a timeout which is too short. It might even make the timeout so short that the Solver won't ever be able to complete the computation correctly within that time. In this case the Solver's deposit will be slashed – so the Task Giver could slash innocent Solvers losing nothing but transaction fees. Ideally the Solver could make an assessment of the task difficulty, but that's not really possible for non-trivial tasks without the actual execution.

[UNCLEAR] The Solver might try to push bogus solutions through by publishing forced error state

Attack

This attack is similar to Scaring off Verifiers attack described in 5.2 The trifecta in the TrueBit paper.

The Solver might publish whether the forced error is in effect. In this case it will forfeit the jackpot but still receive rewards. Validators might learn about the forced error state and challenge only tasks with the forced error in effect ignoring all other tasks solved by this Solver. This way the Solver might submit an incorrect solution when there is no forced error and no Verifier will check it.

Potential defence

As of my understanding, it sounds unrealistic, because it's not necessary that Verifiers would listen to some public information and obey it. It also seems that TrueBit also states it's profitable for Verifiers to catch non-forced errors too, slashing malicious Solver's deposit and receiving at most the half of it. So, even if some of Verifiers would follow the "challenge only when found the forced error", it would be enough for one Verifier to challenge that Solver.

Reflection

Intuitively, it's unclear why we need forced errors mechanic if Verifiers will perform a verification even when forced errors and jackpot payouts are effectively shut down for the malicious Solvers solutions. Arbitrum: Scalable, private smart contracts seems to have a more formal analysis on this matter.

Applying forced errors in Fluence network

Differences and similarities between Fluence and TrueBit

In Fluence network the Client acts as a Task Giver. The difference is that in TrueBit the Task Giver submits batch tasks through Ethereum contract, but in Fluence the Client sends a number of requests directly to the real-time cluster demanding low (less than few seconds) response time.

The Solver role is almost the same but is played not by a single machine, but by a Cluster of Solvers using Tendermint to reach the BFT consensus. Every request that is modifying the state causes a Tendermint transaction, which is eventually included into the Tendermint block.

Every time the block is processed by the Cluster, it changes the Cluster state. The new state can actually be considered the result of the computation similar to the batch computation result obtained in TrueBit.

Because the Cluster of Solvers is driven by a consensus mechanism and every Solver in it chooses the same decision, we will refer to it simply as the Cluster like it was a single entity. Additionally, the Cluster also stores the computation results in a decentralized deep storage like Swarm or IPFS. This allows batch validators to check computations later on without the need to interact with the real-time Cluster.

Validators in Fluence network behave pretty much the same as Verifiers in TrueBit.

Algorithm

Once VM has processed all requests coming from the Client it will have a state S which is considered a result of the computation. Without loss of generality we can state that each request is responded by the Cluster to the Client with the result R.
For each block the cluster generates private random bits r. The Cluster stores a pair of hashes H_c = hash(S) and H_w = hash(some random number) which are correct and incorrect hashes respectively in the deep storage. Those hashes might be stored in any order similar to how this is done in TrueBit. The Cluster storeshash(r) along with H_c and H_w.

Caveat: here we also assume there is no way to derive H_c or H_w from R – see below for the Correct state hash disclosure problem.
The Cluster commits hashes H_c, H_w and hash(r) to the Contract by sending a checkpoint transaction.
Contract stores the hashes alongside with the current Ethereum block hash at the moment the checkpoint transaction has arrived.
Once the checkpoint transaction is mined, the Cluster checks the block hash and calculates if the forced error should be in effect in the same way as it's done in TrueBit: hash(r || Ethereum block hash) < forced error rate.
The Cluster writes the new state to the deep storage in which it designates either H_c or H_w as a solution.
Validators are notified by the Contract that there is the new work to check.
Chosen Validator downloads a block and waits for the designation.
If the Validator wishes to challenge, it publishes hash(v) where v mod 3 == 0 means the first hash is incorrect, v mod 3 == 1 – second, v mod 3 == 2 – both similar to how it's done in TrueBit.
The Cluster reveals r to prove that forced error was or was not in effect for the block. This is also similar to TrueBit.
The Validator should now have:

the block hash, verifiable by Contract since it has stored it at the step 3
the proof that Cluster committed to specific r which would be S_r from the step 9
the proof that Validator has committed validation results

That should be enough for Contract to decide if Validator really checked for a forced error and should be paid amplified reward. So Validator submits these values to Contract in order to claim forced error payout.

Contract checks values correctness and send money to the Validator

Open problems

Correct state hash disclosure

Since the Client receives the correct response R almost immediately, it can share it with Validators, so the latter gather enough information from the shared response to know if there is the forced error in effect so they could stop validating all the other computations. This would render forced errors useless.

For example, let's assume that the Cluster returns the entire state S to the client in response: R == S. In this case the Client/Validator pair could compute the hash(R), compare it to H_c == hash(S) and H_w == hash(some random number) and instantly know which result is the correct one.

So, the question is how to obfuscate H_c and H_w that the validator knowing R would not be able to to guess the correct hash.

Potentially this could be done by computing H_c as not just hash(S) but as hash(S || computation execution fingerprint), where the computation execution fingerprint is not revealed to the Client. For example, we could randomly sample instructions from the execution similar to how it's done in Proof of Independent Execution. In this case the Validator would have to actually repeat the computation to be able to find the incorrect hash from H_c/H_w pair.

Integration tests for Fluence Publisher CLI

Motivation

If the contract is updated, unhandled errors may occur in the Fluence Publisher CLI. Releases will be inconsistent, and users will notice it only during runtime. Solving this issue will allow us to avoid unexpected problems in the future.

Proposition

Create an integration test for interaction between the Fluence CLI, the smart contract running on Ethereum blockchain and the Swarm. In particular, cover the publishing functionality of CLI.

[duplicate] Tendermint bootstrapper

Motivation

In order to be started from within the bootstrapper contract, Tendermint cluster needs to be configured and started programmatically. Otherwise, the crucial node functionality remains not implemented.

Current state

Currently, all setup is performed through several bash scripts, was developed and used for testing purposes only.

Proposed change

To implement separate management module that will start and set up Tendermint cluster on receiving a command from e. g. contract.

The module should prepare the environment and the cluster:

Create TM home directory with required files
Setup TM configuration
Launch TM Core process and start ABCI State machine with required parameters

The module should also set the following in the Tendermint cluster:

Local private/public keys locations (both for TM transport and consensus)
Genesis info (TM chain id + initial peers' public keys list + initial voting powers + initial state/hash)
TM-specific configuration (timeouts, retries, cache sizes, etc.)

Add Shadow stack

Motivation
In the current version, Asmble translates one WASM instruction to one or more JVM instructions. Since WASM is stack-based virtual machine most of its instructions manipulate the operand stack and Asmble translates such instructions to corresponding JVM instructions which manipulate JVM internal operand stack. However, for verification game, our stepper need access to all memory state which includes memory, tables, functions, globals, and stack.

Proposed Changes
We need to explicitly implement shadow stack which operates like the stack in common wasm interpreter/compiler but is used only for find the difference between 2 WASM code snippets.

Rejected Alternatives
We have considered an alternative with using native JVM operand stack: JNI-based, JDI-based and some others. But there aren't any reliable and documented way to get raw JVM operand stack only with JVM bytecode without some heuristics. Moreover if such a method, it could not be used for our stepper because we can't rely on internal JVM operand stack representation since it could be changed in future versions.

Safe state switching in State machine

Motivation

Currently, there are race condition situations related to providing states to ABCI handlers.

In particular, there might be a situation when State machine provides an unverifiable state for Query requests:

The State machine Committer just finished Commit processing.
But further processing of the response slows down or fails – in jTendermint, during delivery to Tendermint, or even inside Tendermint cluster.
However, as soon as the processing in Committer finishes, this previous committed state becomes available to QueryProcessor as Query state.
Then the client cannot perform the whole verification because it cannot retrieve the header of this Query state block from Tendermint API.

Also, the current document lacks some important details regarding concurrency.

Proposed change

The described and similar situation must be prevented:

The straightforward way to do it is to provide such state to the Query request that is surely verifiable by the client – that is the state preceding the currently provided one.
An alternative approach is to initially provide the same state as now but make it possible to fall back to the surely verifiable state in case the client cannot the verify initially provided one.

The implemented concurrency guarantees should be as clear as possible; the documentation should exhaustively describe any unclear concepts used.

Tendermint / State machine Process manager

Motivation

Fluence network is to be combined from many clusters. The key feature we lack now is to enable nodes to take part in different clusters. In order to support this feature, the process management is required: it includes cluster nodes’ creation, monitoring, and shutdown. The Process manager would allow launching a new Tendermints / State machines and tracking their lifecycle.

Current state

Currently, a single local cluster setup is only supported, it is performed through several bash scripts, was developed and used for testing purposes only.

Proposition

By Solver worker a single TM Core instance + Fluence State machine combination is meant. Those processes should be run in Docker containers with all required settings being configurable.

The Process manager must provide the following functionality:

to launch Solver workers by command
to monitor launching Solver workers: track the processes’ health, access their metrics; to expose the monitoring data outside
to shut Solver worker down by command

Launching the Solver worker implies starting a single Docker image that should:

take the prearranged configuration settings/files and VM module files passed as instance(s)' parameters
create necessary home directories with the prearranged files put there
start TM Core process on a requested port
start ABCI State machine process with required parameters

Investigate ways of C++ code translation to WASM

Motivation
We need to investigate ways how our users could use some public toolsets to translate their C++ code to WASM that in its turn could be run on our ecosystem. After some prior investigation some free tools have been revealed:

binaryen can compile asm.js to WASM,
emscripten can directly translate C++ code to WASM,
clang with some wasm-related targets which has the only experimental support of wasm backend.
However, all of them couldn't translate this simple C++ snippet:

#include <vector>

int main() {
  std::vector<int> vec;
  vec.push_back(1);
  return 0;
}

to "static" WASM module out of the box.

Proposed research
We need to carefully investigate the options supported by these tools and find a way to translate some subset of C++11/14/17 to WASM module without functions import.

Solver registration for an arbitrary number of RT Clusters

Motivation

Each physical machine could participate in arbitrary number of RT Clusters. Currently, each Cluster requires a separate registration in the Deployer.sol contract. These registration are made through Ethereum transactions to the contract and thus cost money. Cheaper is better in this case, and one-time registration for an arbitrary number of RT Clusters is cheaper than multiple registrations.

Proposition

Update current addSolver method to support registration of a single physical machine for an arbitrary number of RT Clusters. Specifically, allow machines to specify port range that they will accept connections from other Cluster members. For each new RT Cluster a machine participates, port in range should be marked as reserved/used so no physical machine uses the same port for different Clusters.

Created Clusters should be listed in a way their port and addresses are discoverable by reading contract's data, so anyone who wish to connect to the Cluster could find them.

Solver-initiated disputes

Motivation

Tendermint provides strong and precise safety and liveness guarantees that allow building Fluence real-time cluster on top of it efficiently. However, those guarantees are provided only until more than 2/3 of cluster nodes are non-Byzantine (for safety) and live (for liveness). To treat this requirement as practically steady and permanent a significant number of cluster nodes is needed, which is likely very expensive and subject to a worse performance than for smaller clusters.

To keep real-time clusters safe enough in presence of a small amount of Tendermint nodes, some additional mechanism is needed. It should be able to initiate the recovery the cluster safety and liveness even if most of the cluster nodes are inactive or Byzantine – until at least one honest node is active.

At any point in time, an honest solver node might observe exactly one of 3 cases:

Case A. The cluster makes progress by creating (committing) new blocks thus reaching a consensus, specifically a small amount of time elapsed since the latest block committed (this amount is bound by some cluster-wide timeout constant); at the same time, our honest solver agrees with the blocks committed by the cluster. This case is normal and covered by Tendermint consensus.
Case B. The cluster makes progress, but the honest solver disagrees with the committed block(s). The honest solver treats this situation as cluster take-over.
Case C. The cluster can't make progress: a timeout is elapsed since the latest non-empty block. Note the 'non-empty' word here: Tendermint cluster must commit any block a short time (configurable) after every non-empty block – it would be either another non-empty block (it there are some transactions to be included) or an empty block (if no transactions exist at the commit time).

The last 2 cases witness the problems with the cluster, affecting liveness, safety, or both of them. Note that only the last case might be considered 'unbiased' for the observer outside the cluster; in general, the distinction between the first 2 cases can only be made by the in-cluster node.

While the whole procedure for cluster recovery looks complicated and requires the intervention by external validators in the invalid cluster, this issue is mostly dedicated to initiating this recovery by cluster nodes in a timely manner. The situation with no active honest nodes in the cluster is also beyond this issue.

Proposition

Monitoring

First of all, the solver node must periodically monitor the cluster liveness and safety (and revealing Cases B and C), by:

Measuring time since the last Commit ABCI request.
Comparing its own state with the peer nodes' states: obtained by requesting their Tendermint RPC (<address>:<port>/block?height=<latest_height>) and summarized into the app hashes.

In case of detecting a problem (Case B or C) the solver node becomes a claimant. The objectives of the claimant are: to prove its own correctness and to restore cluster normal behavior.

Claimant-defined block

Then, the claimant must construct a Claimant-defined block (CDB) – its own version of the 'disputed' block:

either the alternative (to the quorum decision) block for Case B
or the next block (for which consensus is not reached) for Case C

CDB structure should be similar to Tendermint block structure. CDB notable components are:

The app hash of state after the previous block – which is likely the main reason for the dispute.
The block ID of the previous block – this is a block ID of the last undisputed committed block.
Transactions of the current block (if the claimant already received some of them).
Other essential block fields, like height, time, etc.
The signature of the other CDB fields signed with the claimant's Tendermint private key.

The presence of the signatures in Tendermint blocks and CDBs allows confirming corresponding solvers' authorization.

External interactions

While the previous actions are bounded by the interaction inside the Tendermint cluster only, the last component of this proposal implies some outside actions:

Publishing CDB in some reliable storage (likely, Fluence Cold storage).
Signalling the validators that dispute is initiated.

After that, the claimant should be ready to enter the Verification game. In the Verification game CDB would be opposed to:

for Case B: the block committed by the cluster
for Case C: CDBs created by the other solvers

In any case, it is also possible that the claimant has no counterpart in the Verification game – if other nodes are offline or ignore it.

Adaptive frequency MRUs and other Swarm API updates

PR with a lot of changes in Swarm API and a new lookup updates algorithm in the master branch of ethereum main client has already opened: ethersphere/swarm#881

Motivation

At first, we need to understand the new algorithm and features.
Secondly, new functions will be developed in the project after PR is integrated into the ethereum client.

Staging net

It is common that new updates have bugs. Deploying new updates to our devnet can break it. Create staging net for testing new updates in production.

Logger from WASM code

Motivation

When something goes wrong inside VM there is no possibility to figure out what actually went wrong. Code that runs into the Fluence WASM VM can either return a success result or return 'Trap' error without any explanations. This and the lack of proper debugging tool creates barriers for the developers using Fluence.

Proposition

Create possibility to write to log from VM for debugging purpose. It might be an SDK function that allows write to stdout when VM runs in debug mode. Logging in production environment is out of the scope of this task, but it should be kept in mind.

Developer documentation

Hello,
When will you make dev docs available for writing to the network?
Thanks,
Dennis

Access geth in light mode from Scala code

Motivation

Geth as an independent process is used both for launching Swarm process and directly by Fluence Client and Node:

Client needs to read the cluster contract in order to find out cluster nodes public keys
Node needs to check the token balances, make transactions
Node needs to listen for Ethereum events of changing Fluence clusters contract, e.g. to form a cluster
In the future, we want to implement an index over Ethereum substate

That's why we need to introduce an approach to work with geth from within our Scala code.

Geth must be launched in light mode, as we are not able to enforce Clients, and even Nodes, to store all the Ethereum blockchain.

Proposition

Try to work with geth in light mode via https://github.com/web3j/web3j. If it doesn't fit, fallback to JSON-RPC.

What to try:

Read data from a contract (e.g. token balance)
Subscribe to events
Get log message from the transaction receipt
Send a transaction to a contract

Solver registration with CLI

Motivation

A node that wants to be a member of a computation cluster should register manually with geth. This process is time-consuming and not user-friendly.

Proposition

Add logic to the Fluence CLI that will use the node’s public key as an input, go automatically through the registration process and return registration transaction’s hash, if successful.

Avoid or refactor JDK check in build.sbt

Motivation

There is a JDK version check in the root build.sbt that prevents running sbt under JDK version other than 1.8.

@folex: Unsupported JDK: java.specification.version 1.9 != 1.8 is not clear, which version is required, and which is current. Maybe reword it to something like.

Note that this check is added because the current implementation of the underlying VM cannot work properly with JDK 1.9 and later versions.

Proposition

At least, clearer message is needed here.
Ideally we need to avoid the limitation on JDK version.

Download code for a new cluster from the Swarm

Motivation

Currently, nodes can only use the code that has been already downloaded locally on that node. To simplify and automate new code deployment a node should be able to use code from external sources.

Proposition

Use the code field from ClusterFormed event in the Fluence smart contract as a Swarm address. Download the code from Swarm when a solver is deploying, store it in a local volume.

State machine <-> Client: interaction specification

Motivation

The interaction protocol between the State machine and the client primarily based on Tendermint protocol which is fairly clear and well-documented. At the same time, Tendermint architecture is rather flexible and doesn't restrict concrete solutions like RPC message formats, encoding, error handling, timeouts, etc.
The Fluence also introduces additional layers of interaction logic: transaction authentication, deduplication, and ordering.

These aspects make the current protocol rather complicated. Currently, there is no single source of some documentation/specification describing this protocol: there is implementation code only. Also, some interaction cases are not even implemented yet. Such a situation leaves the protocol unclear and tangled, obstructs the understanding of the whole picture of the interaction, which is quite important. It also slows down the future development of the new interaction features and alternative client implementations.

Proposition

To implement the interaction protocol specification. It would describe the interactions from different points of view and different levels:

The client:
- Language-specific client high-level API: how to use this API, what to expect, what guarantees it provides. Currently such API implemented in Python
- Tendermint RPC message formats. It's primarily a specification for transaction (broadcact_tx_*) and query (abci_query) RPCs: what is request format, what response fields (Info, Result, Proof, etc.) mean.
The state machine ABCI methods implementation specification: how they act depending on the current state and given arguments.

The specification must describe both normal behavior and various corner cases, including client/node failures and Byzantine behavior. The specification must explicitly inform what features are not actually implemented.

It's supposed that the required specification would be implemented as Markdown document(s) and placed in Fluence main repo.

SBT 1.1.5+: lightbend config loading fails under `sbt run`

Motivation

Since SBT 1.1.5, lightbend-style config loading (used in lightbend config and pureconfig) doesn't load configs from default Scala resource locations (application.conf and reference.conf) when the application launched via sbt run. It works, however, under Idea Run or under SBT 1.1.4.

As a temporarily solution to allow normal config loading SBT version is set to 1.1.4. However, it couldn't be a good long-term solution, disabling the access to new features and fixes in the newer SBT versions.

Known details

The Scala-agnostic class loading and resource discovery (via ClassLoader.getResource) work for described locations under described SBT 1.1.5+. The problems begin in Scala-specific lightbend's ConfigFactory.load: those locations are not scanned during config loading.

It's known that SBT 1.1.5 and later versions use different temporary run directories and different base ClassLoader comparing with the 1.1.4 version.

Proposition

To fix the problem or to find a workaround that allows both:

flexible config loading from resource files
running the project after the latest SBT versions

Effecient merkelized tree for State machine

Motivation

Currently, a simple unbalanced tree with lazy merkelizing is used. Such implementation might be enough for a prototype, but is not production-ready: this tree is not self-balancing and the number of node's children is not limited, so its performance may degrade easily.
We need to choose and implement an appropriate data structure.

Proposed change

Time/memory/size characteristics of typical operations (tree modification, traversal, Merkle proofs) should not degrade because under different workloads, in particular, it should be resistant to typical attacks against performance.
Specifically, it should have O(log N) average-case complexity and the number of node's children should be constant (to avoid huge Merkle proofs).
A good reference is Patricia tree, used in Ethereum: it is a binary trie structure with logarithmic asymptotics of basic tree operations.

Rewrite MerkleTreeNode merging rule

Motivation

Future work for a better merkelized storage is planned in #156. This is relatively large work. Before we do that we would still face problems with an existing implementation of Merkle hash merging:

Two merging rules are provided, one is mostly for debugging purposes.
The current implementation is potentially vulnerable to 2nd preimage attack: different combinations of merging hashes produce the same digest function input. (While it is impossible for the current implementation because the lengths of the merged hashes are always the same, these lengths are not validated.)

These problems would not require huge efforts to resolve them so they might be resolved prior to #156.

Proposed change

To leave the only one merging rule.
To rewrite hash merging either validating the lengths or prepending the hashes with the lengths, thus preventing the most straightforward 2nd preimage attack scenario.

Proof of independent execution – batch validation application

In our network batch validators are going to verify computations performed by the real-time nodes and other batch validators. As it was shown earlier, this approach might be prone to the verifier's dilemma where a validator has to make a choice between actually performing the validation or skipping it, stating correctness and receiving a reward. In this case, a validator might simply agree with whatever computation result which was produced before.

Different approaches were proposed to mitigate this issue. One of them is the forced errors approach introduced by TrueBit. It makes the original solver to make unpredictable mistakes in the effort to catch the validator if it confirms every computation without verification. Instead of a single computation result two results are returned – the original and the fake one. The validator doesn't know which one is actually in effect, which forces it to perform the computation independently.

However, it's unclear whether we can use this approach for the Fluence network because it seems difficult to hide from the batch validator which result is the original one. The reason is that in the Fluence network real-time nodes return results to the client as they are computed, and then store them in Swarm for the further validation. Without digging further into whether this does or doesn't prevent the use of forced errors, it might be fruitful to take a look at another approach named "Proof of Independent Execution" which was originally described in Ethereum Research by JustinDrake.

Here I'll try to add a bit of interpretation to the topic as it was discussed with @michailvoronov and @folex, throw in few ideas and describe it's potential application in our environment. The credit for the original idea should go to the JustinDrake / ethresearch community – mostly I've just expanded the moments that were not so easy for me to understand.

Introduction

Basic idea

Let's say we have a source code which might be as simple as:

00  int i = 0;
01  while (i < 10) {
02    i++;
03  }

This code could get translated into the following (pseudo-)WASM program:

00  block $0
01    loop $1
02      get_local $0
03      i32.const 9
04      i32.gt_s
05      br_if $0        ;; jumps to #12 if i > 9 – exits the loop
06      get_local $0
07      i32.const 1
08      i32.add
09      set_local $0    
10      br $1           ;; unconditionally jumps to #02 – repeats the loop iteration
11    end label $1
12  end label $0

Before starting the program execution an executor E generates a secret salt and computes a secret S = hash(salt || address_E). Now it can launch the program execution which will essentially be a stream of instructions computed by the CPU. Let's call the stream an execution trace Tr. This trace Tr can be split into chunks, each chunk hashed with S as a salt and obtained hashes combined into the Merkle root MR.

;; ...
;; omitted the beginning of the execution until i == 9


;; <chunk k> ==> H_k = hash(S || chunk_k)
;;
02    get_local $0    ;; i == 9
03    i32.const 9
04    i32.gt_s


;; <chunk k + 1> ==> H_k+1 = hash(S || chunk_k+1) 
;;
05    br_if $0        ;; don't exit the loop yet
06    get_local $0
07    i32.const 1


;; <chunk k + 2> ==> H_k+2 = hash(S || chunk_k+2) 
;;
08    i32.add
09    set_local $0    ;; i := 10
10    br $1


;; <chunk k + 3> ==> ...
;;
02    get_local $0
03    i32.const 9
04    i32.gt_s


;; <chunk k + 4> ==> ...
;;
05    br_if $0        ;; exit the loop

To compute the hash of the executed instructions chunk, we can treat it as the data, for example:

get_local $0
i32.const 9
i32.gt_s

after converting it into WASM binary representation is equivalent to:

0x20 0x00
0x41 0x09
0x4A

in which case the chunk hash would be calculated like H_k = hash(S || 0x200041094A). Now, we can merkelize the obtained hashes and compute the Merkle root: MR = merkelize(H_0, ..., H_n) where n is the number of chunks in Tr.

Streaming Merkle root computation

Note it's possible to compute the Merkle root in a streaming fashion without the need to store the entire execution trace which might be gigantic for certain computations. To do that, we will need to store not more than ceil(log2(n)) intermediate Merkle tree hashes.

Here is in example of an incomplete Merkle tree built for 7 executed trace chunks. Note that we need to store only three intermediate hashes instead of the entire tree:

one second level hash (H2:0-3)
one first level hash (H1:4-5)
one chunk hash (H_6)

chunk_0 chunk_1    chunk_2 chunk_3       chunk_4 chunk_5    chunk_6
   |       |          |       |             |       |          |
  H_0     H_1        H_2     H_3           H_4     H_5      >>H_6<<
    \      /           \      /             \       /
     H1:0-1             H1:2-3              >>H1:4-5<<
           \          /
            >>H2:0-3<<

Once eighth chunk appears, it's hash H_7 can be combined with H_6 producing first level hash H1:6_7. Now, H1:6_7 can be combined with H1:4_5 producing second level hash H2:4-7, which in it's turn in combination with H2:0-3 will produce the Merkle root H3:0-7.

chunk_0 chunk_1    chunk_2 chunk_3       chunk_4 chunk_5    chunk_6 chunk_7
   |       |          |       |             |       |          |       |
  H_0     H_1        H_2     H_3           H_4     H_5        H_6     H_7
    \      /           \      /             \       /          \       /
     H1:0-1             H1:2-3                H1:4-5             H1:6-7
           \          /                             \          /
              H2:0-3                                   H2:4-7
                    \                                 /
                     \                               /
                       -------- >>H3:0-7<< ---------

Computational overhead

There is a potential issue in the described approach – and that's how much effort it will take to execute the program (cost_execute_trace) compared to the effort to hash the execution trace (cost_hash_trace). For a really simple, non-cryptographically strong hash function it seems these costs are approximately equal: cost_execute_trace ≈ cost_hash_trace.

For example, let's use XOR as a "hash function". To execute a single instruction, it has to be loaded from memory (or processor cache) into the CPU. After that, the instruction will update few registers / memory cells. To use this instruction in the XOR "hash", it will also have to be loaded from memory, and then XOR computation will update few registers / memory cells. This means that even a simple XOR "hash" requires as much effort as executing an instruction.

If we need to use a cryptographically strong hash function, it seems that cost_execute_chunk << cost_hash_chunk because of the numerous hashing rounds. Leaving the question whether we actually need a cryptographically strong hash function open, it means that we won't be able to hash the entire execution trace and keep a reasonable processing performance.

Instead, it seems we need to sample the execution trace and hash only the sampled chunks, in which case we can achieve cost_hash_trace < cost_execute_trace even while using complex hash functions. But before we go further into how the trace sampling could work, let's take a look how the proposed approach deals with the verifier's dilemma.

Validation

Let's imagine the executor A has already performed the computation and produced the execution Merkle root MR_A. Now, the executor B is going to repeat the computation and produce another Merkle root MR_B. Because MR_A depends on the secret S_A = hash(salt_A || address_A) and MR_B depends on the secret S_B = hash(salt_B || address_B), we can expect that no matter what salt B chooses, it won't be able to make MR_B equal to MR_A.

This means that B has to generate the computation Merkle root itself, and it can do so either by repeating the computation from scratch or by fetching execution trace chunks from somewhere (allegedly, the executor A). If we want B to always perform the computation independently, one way would be to make chunks sharing between different machines economically infeasible.

However, there are two possible obstacles to achieve that:

if the execution trace sampling is really sparse it could be easy for colluding machines to exchange chunks because of the small download size
if two executors share the same physical hardware there won't be too much transfer overhead

To get back to the validation, B doesn't only compute MR_B, it also computes MR_A and checks it's equal to the correspondent Merkle root computed by A. However, if computing MR_A has too much overhead compared to the actual program execution, B might opt to just agree with whatever Merkle root the executor A has produced. This means we are still facing the verifier's dilemma, which we could fight by reducing the overhead to compute the Merkle root to the minimum using an aggressive sampling.

Execution trace sampling

Hash modulo approach

We could use the following construction to sample the execution trace:

each chunk has a predefined constant size p
the gap between the previous chunk and the next chunk is selected by computing a H_k % m, where H_k is the hash of the previous chunk, and m is the predefined constant

In this case the sample trace will look like (note the variable gap size):

;; p = 3, m = 5

;; chunk_0 ==> H_0 % 5 = 1
;;
op #01
op #02
op #03

;; <gap> - size 1
op #04

;; chunk_1 ==> H_1 % 5 == 4
;;
op #05
op #06
op #07

;; <gap> - size 4
op #08
op #09
op #10
op #11

;; chunk_2 ==> ...
;;
op #12
op #13
op #14

Assuming a uniform distribution of H_k % m, an expected reduction of the number of chunk hashes to compute will be 1 + m / 2p.

Sampling takeover attack

Intuitively we also don't want an executor to have a control over or be able to predict which chunks will be selected for sampling without actually executing the program. Let's consider the following trivial program:

00  i++
01  goto 00

If we execute it with sampling constants p = 2 and m = 4, and if hash(S || "i++" || "goto 01") % 4 == 2 then this loop will unroll into:

;;  chunk_0 ==> H_0 % 4 = 2
;;
00  i++
01  goto 00


;;  <gap> - size 2
;;
00  i++
01  goto 00


;;  chunk_1 ==> H_1 % 4 = 2
;;
00  i++
01  goto 00


;;  <gap> - size 2
;;
00  i++
01  goto 00


;;  chunk_2 ==> H_2 % 4 = 2
;;
00  i++
01  goto 00

...

Note that chunk_0, chunk_1 and chunk_2 are all identical and will have the same hashes. Remember that an executor has a control over salt – and hence some control over secret S = hash(salt || address_E). So, for certain simple enough programs and specific sampling constants there could exist a salt that would allow to produce the Merkle root without program execution. Such salt could be found by an executor through the trial and error method.

Sampling randomization

It seems there are few potential ways to combat this issue.

One is to make secret S to not only depend on salt, but also on some externally chosen random prefix: S = hash(rnd_prefix || salt || address_E).

Another – to have each chunk hash to be computed using the hash of the previous chunk and the (Merkle) hash of the memory state at the end of the previous chunk execution: chunk_hash_k+1 = hash(S || chunk_hash_k || memory_k || chunk_k+1). Because for non-trivial fully optimized programs it's impossible to predict the memory content at the end of execution without actually executing the program (~halting problem), an executor won't be able to predict or control the sampling.

Execution trace fetching attack mitigation

As it was already described in § Validation, two colluding executors A and B might agree that only A performs the execution, and B fetches required execution trace chunks from A. Because with sampling only a tiny fraction of chunks is required to produce the Merkle root, this practice makes sense economically.

To make it economically infeasible, we could require that the computation performed by B is always launched after the computation performed by A. To do so, we could require the secret S_B to be computed as S_B = hash(salt_B || address_B | MR_A). This has a nice property that during the execution of the program, A doesn't know which chunks B would need. So, to be able to serve chunk requests, A would need to store the entire execution trace.

Now, it looks like for non-trivial programs the cost of storing the execution trace (cost_store_trace) is more than the cost of executing it from scratch (cost_execute_trace).

The upper estimate to execute the program is ||Tr|| * cost_execute_slowest_instruction where ||Tr|| is the number of instructions in the trace. The lower estimate to store the trace is ||Tr|| * cost_store_smallest_instruction * time_execute_trace / 2 – assuming the first instruction in the trace will need to be stored for the entire time the program is running, and the last instruction won't need to be stored at all.

Assuming cost_execute_slowest_instruction < cost_store_smallest_instruction * time_execute_trace / 2 for non-trivial programs, we can derive cost_execute_trace < cost_store_trace. This means we can protect against trace chunks fetching attack, but only if we require sequential programs execution by different executors.

Open questions

It's unclear how to prevent collusion if multiple executors are running the computation at the same time. One of them might do the actual computation and serve the chunks to the rest. With an executing trace sampling it might be economically reasonable for executors.
It's unclear how to actually force the validator B to verify the Merkle root produced by A. By sampling the execution we are reducing an overhead to compute MR_A which means it's not a big deal for B to also check it. However, nothing actually forces B to run the check except possibly losing a security deposit if the computation turns out to be incorrect. But to guarantee this we might need to introduce forced errors into the system which is what we were trying to avoid all the way...

Application in the Fluence network

Haven't thought about this deep enough yet, but I think we could use the Proof of Independent Execution in the following way in our network:

Real-time nodes of the same cluster could share the same (public) salt and secret and produce the same execution trace Merkle root. This seems different from the originally proposed idea where an untimely secret publication would be a reason for slashing the executor's deposit. However, having the same salt would allow the real-time cluster to reach consensus on the execution trace Merkle root.
Batch validation of the history written by the real-time nodes will be always sequential as it's described in § Execution trace fetching attack mitigation. First, this would prevent batch validators from fetching trace chunks from the real-time nodes. Second, for the batch validation it seems we don't really need the parallelism anyways – time-sensitive stuff happens in the real-time cluster.

Implement the client for interaction with Swarm API

Motivation:
External storage exists as a component in fluence's architecture for storing snapshots and logs of computation cluster and ability to check them by validators. We decided to use Swarm after some research.

Proposed Changes:
For a start, Fluence should be able to use main Swarm functionality like uploading/downloading files and directories and usage of Mutable Resource Updates (MRU).

We will create a thin client wrapper that will manage connections, prepare and send requests to Swarm. This client will be able to interact with Swarm nodes through existing API. After some research, we realized, that it is better to use HTTP API rather than JSON-RPC or something else. It is more common and well-documented.

Rejected Alternatives:
There is a lot of decentralized external storages, e.g IPFS, Sia, etc.
Swarm is mature, well-documented and with economics in active development, hence we decide to use it.

Rewrite State machine local cluster commands

Motivation

There is a local cluster management scripts to perform some actions with the local cluster like starting, stopping etc.

The current implementation doesn't follow the best practices of the process management. In particular, in delete.sh:

kill -9 is used instead of more graceful process termination.
Processes killed by grep pattern, it is dangerous because may kill more processes than required.

Proposed change

It would be better to store pid-s of the running processes in a dedicated file(s) and retrieve them from there to remove these processes only.
Look up for other similar issues in other scripts and fix them.

Tendermint / State machine Configurator

Motivation

We are planning to implement 2 important components of the Fluence network:

Bootstrapper contract that registers Fluence nodes and assembles them together in clusters.
Process manager (#195) that launches nodes' processes and maintains them.

We also need to translate the cluster configuration (which is relatively high-level and focused mostly on what nodes the clusters consist of) specific to the Bootstrapper contract to the Tendermint- and VM-specific configuration (which are finer-grained and depend on actual Tendermint / State machine / VM implementation).

Proposition

To implement the Configurator. It takes the high-level parameters and produces the configuration files based on them.

The Configurator is essentially parameterized by:

cluster-wide parameters
- WASM code to run on the cluster's VMs
- genesis info
  - Tendermint chain ID
  - peers' public keys + voting powers
  - initial state/hash (probably we may omit it in the first implementation)
node-specific parameters
- Tendermint p2p port (to connect the local TM Core to its peers)
- Tendermint JSON RPC port (to listen to requests from clients, retrieving blockchain data, etc.)

The Configurator itself configures additional parameters:

auxiliary ports used by the State machine
- State machine ABCI port (to listen to ABCI requests from TM Core)
- monitoring port (when monitoring will be implemented)
local private/public keys locations (both for TM transport and consensus)
TM-specific configuration (timeouts, retries, cache sizes, etc.)

The configuration files are composed of settings listed above. These files follow the formats used by the target application (Tendermint Core or State machine + VM process) so that the Process manager might be called being unaware of their particular contents.

Separate Asmble VM and real-time worker / batch validator process

Currently, we have the following design: docker { jvm { node <----> Asmble {untrusted WebAssembly code} } }. However, because real-time workers participate in Tendermint consensus and batch validators participate in verification signing, this has potential security issues.

In particular, if a malicious piece code breaks from Asmble VM, it might be able to double vote in Tendermint consensus or confirm an incorrect batch – which are punishable activities in the network.

I think that to improve security the following design should be adopted instead:
dockerA { jvmA { node } } <----> dockerB { jvmB { Asmble {untrusted WebAssembly code} } }. This way we have more security layers, which should significantly decrease the probability of malicious code harming the miner.

Floating point determinism

Motivation
There are some known floating-point determinism issues of IEEE 754 standart. Some examples are described here and here.

Proposed research
To guarantee floating point operation determinism some issues should be investigated:

reveal all possible ways of floating-point indeterminism during WASM code execution;
parsing of floating-point function arguments (it could be based on strtod);
print the result of function execution if it returns floating-point number (it could be based on dtoa);

Error hierarchy for State machine

Motivation

Currently, a significant amount of State machine code uses String as left part of Either/EitherT – for handling various errors and exceptional cases. It includes both creation of new error messages for some edge cases and wrapping caught exceptions. It makes the current code not expressive enough and not clean.

Proposition

A carefully designed error class/object hierarchy is required. StateMachineError as the root class has already applied for the transaction invocation and other statemachine part. VM project error hierarchy can be used as a reference.

Support complex function signatures in WebAssembly VM

Motivation
WebAssembly supports only 4 arguments types: i32, i64, f32, f64. However, many other languages which could be translated to wasm support more complex types (Array, String, Map, HashMap, Set, ...). Also it's impossible to find out the original function signature only from the wasm code. So we need a way to support passing any types and some kind of source mapping to pass arguments in the wasm function call correctly.

Proposed Research
In the context of this issue it is proposed to investigate the best way for:

passing of arrays and strings to function;
passing more complex types like map, hashmap, set, ... to function;
source mapping function arguments to call wasm function correctly.

Request manager for JS client

Motivation

The real-time cluster API has an asynchronous nature. A client has to use requests counter to keep happens-before semantic and ordering of requests. The client can send multiple requests in parallel, but the client has to wait for a new block for each result of a request because the result appears in a new block. On the other side, if VM returns an unexpected error, the client has to recreate the session, because after that all requests wouldn't have results. Request management shouldn't impose additional overhead since Fluence aims to provide sub-second latency. It also has to be stable and predictable for users, so they wouldn't have to deal with unexpected behavior. A detailed explanation of this described in #194.

Proposed Changes

Add a thin layer between the user's calls in the code and transport requests, to optimize the requests queue. The user shouldn't wait for every result after calling the next request. The client can send all requests at once and then summarize their's results. And if there are some unexpected errors occur, the request manager will fail all other requests. In this way, a developer that writes the client for their application deployed on VM will decide for themselves to call functions in sync or async way. The user doesn't need to manage the counter and the order of requests themselves.

So, there are four tasks:

add queue of requests
update the counter for each request in the order that the developer of the client sets
fail all requests if unexpected error from VM occurred
process all requests at once and check results in parallel

Implement gas counter

Motivation
Since we allow users to execute arbitrary WASM code in our ecosystem we should provide a solution for halting problem. So we need to have some Ethereum gas alternative. As in Ethereum gas is a unit that measures the amount of computational effort that it will take to execute certain WASM instructions. So we also need a way to count it during wasm code execution.

Proposed changes
There are two possible ways to implement gas counter:

add instructions for increment gas counter after each each generated JVM instructions correspond to one WASM instruction;
add instructions for increment gas counter after each JVM basic blocks;
The second variant is more complicated in implementation but faster.

Real-time cluster's transactions history discovery

Motivation:
Real-time cluster saves transactions history of computations in an external storage, e.g in Swarm. This history will be used by verifiers to validate those computations.
Verifiers cannot trust transactions history in an external storage without the verifiable association of history's address with a real-time cluster. It cannot be guaranteed that verifiers will be able to verify sent requests and passed computations if there is no trusted association.

Proposed changes:

Computation cluster will store the personal address into trusted storage on registration in Fluence system.
Verifiers and clients should be able to get snapshots and logs from external storage knowing only public information about the cluster. They could get addresses from trusted storage or predict through public information, i.e generated cluster identifier stored in the blockchain.

ENS smart-contract is used for this task in Ethereum: https://swarm-guide.readthedocs.io/en/latest/usage.html#using-ens-names. We can use our own contract where the cluster will register.

ENS, simply, is a hash map in Ethereum smart contract, where the key is an Ethereum address (something.eth) and value is an information about the owner and another custom information, e.g address in Swarm or something other.

A console utility for deploying code to cluster

Motivation

Current code deployment process to the Fluence network looks like this:

Upload code to Swarm. Swarm will return the code hash.
Publish the received hash to a default “deployer” contract.

If done manually this process is cumbersome, time-consuming, and error-prone.

Proposition

Create an automation utility that will take the code as an input, go through the publishing process and return a hash of transaction of publishing to the developer, if successful.

Make Deployer.sol pausable

Motivation

Ethereum Smart Contracts, being deployed on the mainchain, may have unknown bugs. These bugs might be found later by some adversary and exploited in a way that destroys the Fluence network, or Fluence token economy, or does harm in another unpredictable way.

In this case, the network operator have to replace the smart contract with a new one, migrate cluster contracts to it, and propagate the new contract's address throughout the network. It takes time.

During the period of fixing the bug, it worth blocking all operations with the Smart Contract to prevent or limit the harm. Usually, it's being done with "haltable", "pausable" or "circuit-breaker" pattern.

Proposition

Make our cluster-forming smart contract "haltable".

When the contract is marked as halted (by halt call), it rejects all the calls to it, except the call to unhalt.

halt and unhalt calls are whitelisted, so that only the moderators may call these functions. There are different approaches for managing the moderators, e.g. Ownable pattern. It should be investigated and discussed.