Code Monkey home page Code Monkey logo

narwhal's Introduction

Narwhal development now happens at https://github.com/MystenLabs/sui/tree/main/narwhal

For Narwhal (and Bullshark / Tusk) users: Narwhal (together with Bullshark and Tusk) still exists as an independent project. We will be periodically publishing Narwhal packages to crates.io.

For Narwhal (and Bullshark / Tusk) contributors: to consolidate infrastructure and iterate faster, the latest development of Narwhal will happen in the Sui repo. Please open your PRs and issues over there. Appreciate the contributions!

Narwhal

build status rustc license Narwhal Rust Crates Documentation (main) codecov

This repo provides an implementation of Narwhal, Tusk and partially synchronous Bullshark, a DAG-based mempool and efficient BFT consensus. The codebase has been designed to be small, efficient, and easy to benchmark and modify.

This repo uses fastcrypto as its cryptography library.

Quick Start

The core protocols are written in Rust, but all benchmarking scripts are written in Python and run with Fabric. To deploy and benchmark a testbed of four nodes on your local machine, clone the repo and install the python dependencies:

$ git clone https://github.com/mystenlabs/narwhal.git
$ cd narwhal/benchmark
$ pip install -r requirements.txt

You also need to install Clang (required by RocksDB) and tmux (which runs all nodes and clients in the background). Finally, run a local benchmark using Fabric:

$ fab local

This command may take a long time the first time you run it (compiling rust code in release mode may be slow), and you can customize a number of benchmark parameters in fabfile.py. When the benchmark terminates, it displays a summary of the execution similarly to the one below.

-----------------------------------------
 SUMMARY:
-----------------------------------------
 + CONFIG:
 Faults: 0 node(s)
 Committee size: 4 node(s)
 Worker(s) per node: 1 worker(s)
 Collocate primary and workers: True
 Input rate: 50,000 tx/s
 Transaction size: 512 B
 Execution time: 19 s

 Header size: 1,000 B
 Max header delay: 100 ms
 GC depth: 50 round(s)
 Sync retry delay: 10,000 ms
 Sync retry nodes: 3 node(s)
 batch size: 500,000 B
 Max batch delay: 100 ms

 + RESULTS:
 Consensus TPS: 46,478 tx/s
 Consensus BPS: 23,796,531 B/s
 Consensus latency: 464 ms

 End-to-end TPS: 46,149 tx/s
 End-to-end BPS: 23,628,541 B/s
 End-to-end latency: 557 ms
-----------------------------------------

Next Steps

The next step is to read the paper Narwhal and Tusk: A DAG-based Mempool and Efficient BFT Consensus. It is then recommended to have a look at the README files of the worker and primary crates. An additional resource to better understand the Tusk consensus protocol is the paper All You Need is DAG as it describes a similar protocol.

The README file of the benchmark folder explains how to benchmark the codebase and read benchmarks' results. It also provides a step-by-step tutorial to run benchmarks on Amazon Web Services (AWS) across multiple data centers (WAN).

License

This software is licensed as Apache 2.0.

narwhal's People

Contributors

aakoshh avatar akichidis avatar allan-bailey avatar andll avatar arun-koshy avatar aschran avatar asonnino avatar bmwill avatar brad-mysten avatar clay-mysten avatar dependabot[bot] avatar gdanezis avatar huitseeker avatar joyqvq avatar kchalkias avatar lavindir avatar lxfind avatar mwtian avatar mystenmark avatar punwai avatar sadhansood avatar velvia avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

narwhal's Issues

Unify consensus testing facilities

Description

The consensus crate is growing and there are now a number of testing facilities spread across the unit test files (e.g., consensus_tests.rs, subscriber_tests.rs, etc). These testing facilities are used multiple times in various test files (not only in the one where they are defined). It would great to group all facilities into a single file common.rs.

[crypto] Check canonicity of our base64 encodings

We encode and decode key material in base 64:

pub trait EncodeDecodeBase64: Sized {
fn encode_base64(&self) -> String;
fn decode_base64(value: &str) -> Result<Self, eyre::Report>;
}
// The Base64ct is not strictly necessary for (PubKey|Signature), but this simplifies things a lot
impl<T: ToFromBytes> EncodeDecodeBase64 for T {
fn encode_base64(&self) -> String {
base64ct::Base64::encode_string(self.as_bytes())
}
fn decode_base64(value: &str) -> Result<Self, eyre::Report> {
let bytes = base64ct::Base64::decode_vec(value).map_err(|e| eyre!("{}", e.to_string()))?;
<T as ToFromBytes>::from_bytes(&bytes).map_err(|e| e.into())
}
}

Using base64ct:
https://docs.rs/base64ct/latest/base64ct/f

We should check base64ct is using (or can be configured to use) canonical encodings as well to avoid the information leakage through padding.

[get_collections] : primaries handle must-block request

Story

Parent story: ##67

Description

As part of this task it should be implemented the logic on the primary nodes to handle the requests for fetch the missing certificates. It is reminded that the first step of the protocol is for a primary node to ask their peers for a list of certificate ids (block ids) . Then the peers should respond back with the certificate them selfs (or empty for the ones that are not found).

[store] Feedback from use in fastnft

We have now used the DBKey and DBBatch to implement persistence in fastnft. Some feedback:

  • For a couple of operations we need atomic test-and-write, ie. to process orders we need to check lock values (none or the same) and then set them; for certificates we test objects / locks exist then do the write. The current abstractions do not provide a native (efficient) way of doing this, and therefore we need a mutex across this logic.
  • Ser and Deser is really nice, and functional. BUT right now this is performed synchronously in the main thread that needs to have a mut lock on the database. The abstraction promotes doing ser / deser in the critical region (incidentally). It would probably be better to allow threads to have concurrent access to the DBMap, and internally protect operations using a mutex of RWlock.
  • Its probably worth supporting a variant of typed multi_get, it is used often enough.

[consensus] Error dialogue with light clients

Story

This issue is part of the light client story #95

Description

The subscriber core can handle a fixed (maximum) number of light clients. Right now, when this maximum is reached, additional light clients' connections are simply refused. It would be nice to instead send them back an error message.

[Narwhal mempool integration] Draft API for consensus

Our customers want to integrate with Narwhal as a mempool first, which requires their consensus modules, co-located with a primary, to be able to query the mempool and update it with the outcome of consensus. Note: in this context, we call Narwhal blocks "collections" (because the consensus has a lock on the word block).

We have identified this will require at least the following API, exposed by Narwhal to consensus:

  • read_causal( bd: Digest ) -> Vec<Digest> takes a collection digest and returns the collection digests encountered in a BFS of the graph, up to a commit boundary (see also MystenLabs/sui#5187 for the current implementation of this walk).
  • get_collection(Digest) -> Vec<Transaction> takes a collection digest and returns the transactions contained in the collection.
  • drop_causal_read( bd: Digest ) takes a collection digest and registers it to update the "commit boundary" of the aforementioned BFS, thereby making sure the collections in question are no longer considered valid for proposal. This should be called by the co-located validator's consensus module when voting for such a proposal.
  • drop_collections( bd: Vec<Digest> ) takes a collection digest and drops the contents of the collection from storage. This should be called by the co-located consensus module when witnessing a commit, and thereby enshrining the content of a block (proposed beforehand through collections) into some ledger storage.

The present umbrella epic is about building the API in question. It involves three broad areas of work:

  1. exposing the functionality through an IPC front-end with RPC,
  2. accessing and managing collection data, which transactions are batched and stored at workers, from the primary. (see also MystenLabs/sui#5184)
  3. building a cache functionality on the primary to make the cost of external requests manageable.

More details here:
https://docs.google.com/document/d/1BuD7JaZxes8UtWOJZd8D2ZJkcj7tqYSj4601ffuceTg/edit?usp=sharing

Be able to deploy Narwhal in a "staging" environment

Description

Currently we do have a way to deploy Narwhal to Amazon and run a benchmark, which is definitely great and helpful. However, we are lacking the ability to have a constantly running network of Narwhal nodes in a way that "staging" (or development) environments do in a more traditional enterprise setup. Ideally, we would like to deploy in a (or a set of) AWS Amazon nodes a few Narwhal nodes in a primary-worker setup and let them run uninterrupted. Ideally we would like to have a nice CI/CD way to deploy after merging to a staging branch - or for simplicity we could just do it after the main branch.

Why is this important?

  • Currently we have no way to tell whether Narwhal is working correctly when left running for long periods (e.x weeks)
  • Reveal issues/problems when up and running
  • Have a deployment for developers to test the nodes in a real deployment (not just via our integration tests)

Approach
On its simplest form we would like to be able to trigger a (re)deployment after a CI/CD run on a main or staging branch. I am aware that things can get quite complex and certainly we are not looking for SRE level of work here, but something simple enough and robust that will allow us to deploy Narwhal in AWS Amazon. I would suggest looking into the following direction (some thoughts):

  • Use Terraform to ensure that whatever we build is reproducible and track changes. Highly recommended
  • Build Docker images for our Narwhal node that will allow us to run either as primary or worker
  • Integrate the image building process as part of our CI/CD to push images on AWS ECR
  • Choose something like ECS to deploy the nodes. Alternatively, if we want to really abstract the deployment and node management process , we can use something like AWS Fargate
  • Start with a simple setup of primary --> workers node. We can expand to multiple primaries network as follow up
  • Be able to see logs in Cloudwatch would be an extra bonus here

Issue list:

[virtual DAG] Leaf removal

#85 introduces a DAG but special cases leaves as not removable / compressible.
This hinders actual removals of nodes from the dag at the end of a parent-pointer chain (suffix removals).

[get_collections] re-asses the storage model for worker batch store

Story: #67

Following the comments on the PR #66 , we should consider whether we should stick with the current storage model , meaning storing the WorkerMessage::Batch(Batch) serialised in database, or just store the Batch directly to allow us directly retrieve the required quantity.

Points that should be taken into account:

  • potential performance penalties: it was deliberately chosen to store the serialised WorkerMessage::Batch(Batch) as is to avoid the serialisation cost

[remove_collections]: introduce endpoint to remove the provided collections from Narwhal

Epic: #44

In order to ensure that:

Our customers will only receive collections that haven't already consumed and proposed (or discarded)
We reclaim space in Narwhal nodes
we are looking to introduce an endpoint that will allow a validator to remove the specified collections both from the internal DAG and the data store. At the moment more details can be found on the doc here.

The endpoint on high level will look like:

remove_collections(collection_refs: Vec<CollectionDigest>) -> Result((), Error)

where the collection_refs is a vector with the collection digests that should be removed.

[store] use the reopen! macro to reopen multiple column families

A new macro has been introduced as part of the typed-store to allow us reopen multiple column families at once and avoid code duplication.

As part of this issue, the macro reopen! should be used instead of handling the reopen of column families one by one.

For example, a place where this could be used:

narwhal/node/src/main.rs

Lines 123 to 141 in 6b08995

let header_store = Store::new(
rocks::DBMap::<Digest, Header<Ed25519PublicKey>>::reopen(&rocksdb, Some("headers"))
.expect("Failed keying headers database"),
);
let certificate_store = Store::new(
rocks::DBMap::<Digest, Certificate<Ed25519PublicKey>>::reopen(
&rocksdb,
Some("certificates"),
)
.expect("Failed keying certificates database"),
);
let payload_store = Store::new(
rocks::DBMap::<(Digest, WorkerId), PayloadToken>::reopen(&rocksdb, Some("payload"))
.expect("Failed keying payload database"),
);
let batch_store = Store::new(
rocks::DBMap::<Digest, Vec<u8>>::reopen(&rocksdb, Some("batches"))
.expect("Failed keying batch message database"),
);

Draft API for execution layer

Expose an API for Narwhal to communicate with an external execution layer. This API should be general enough to support a wide variety of execution layers, but two known clients are:

  • Cosmos, which uses ABCI (application-blockchain consensus interface). ABCI is probably a good starting point for Narwhal to mimic.
  • Celo, whose relevant interfaces are here

One important factor is the interaction with XO vs. OX models of executions (see: https://arxiv.org/abs/1906.11229 for toponymy)
See Google Doc

[collection] worker nodes process RequestBatch messages and respond

Following the work on the PR , we should introduce the necessary functionality to the worker nodes in order to be able to process the received RequestBatch messages and respond back to the primary with the results.

The RequestBatch message (received over the TCP communication) is basically dictating to a worker node to retrieve from its internal store a batch by the provided id. Then the worker should respond back to the primary node with a RequestedBatch piggy backed with the transactions fetched from the retrieved batch.

If the requested batch has not been found (although not really expected this to happen), then the worker should respond with an error of batch not found.

[get_collections] introduce endpoint to retrieve the collections

Epic: #44

In order to allow our customers to integrate with Narwhal, one of the endpoints that we would like to introduce is the get_collections. As part of this story we are looking to expose an endpoint that will allow our clients to retrieve the transactions for a list of given headers / certificates. We'll call this a collection.

Differentiate Digest type based on hashed type

We store many digests as hashes deduced from their contents, where the deduction is driven by an impl<T> Hash for T.

This applies for votes, certificates, headers ... It would be good to be able to differentiate between a Certificate Digest and a Vote Digest. Using phantom types, we should be able to do this at no runtime cost.

[DB API] Make a cross CF BatchWrite

The async typed storage API is very nice and we can use it in other projects too. It would be nice to provide a WriteBatch functionality that allows for multiple writes / deletes potentially across many Column Families to be executed atomically. This is low priority since for the moment we do not need this feature in Narwhal / Tusk.

[get_collections] fetch missing collections - implement must-block protocol for certificates

Story

Parent story: #67

Description

When the get_collections request is received, it is possible for some of the requested collections to be missing from the validator node that servers the request. On the first iteration we have handled this by emitting back an error message that says the collection is missing. However, this is not acceptable as it can block the consensus protocol. Legitimate and valid certificates that are missing from a validator node should be fetched instead and populated internally before responding back to the get_collections request. If there is no way to retrieve them, then an error will be returned.

On this design doc 4 different options have been presented. We are going to implement option 1 for now:
https://docs.google.com/document/d/1DpwOPgI-YSIlt2SrQaT3ioF6G-rlNJoYtHls3eZogkw/edit#heading=h.rgoyj9yf0kb8

As part of this task we should focus on implementing the want-block protocolling for fetching the missing certificates. Syncing the missing batches is part of another issue. The blocker_waiter should be extended to:

  • detect the missing blocks
  • start the must-block protocol to fetch the missing certificates
  • validate the fetched certificates. As duplicates are expected to be received, we care to evaluate and process the first valid one, discard the rest.
  • keep and store the valid (unique) certificates (store in certificate_store)
  • put a placeholder to start the process of syncing the batches

It has to be noted that we should allow fetching the batches for the certificates that have already be found to ensure that we can work in parallel while we are fetching the missing ones.

Narwhal + Tusk : block proposals under a size limit

Most blockchains have a mempool size limit which they use as backpressure.

This is necessary because as a point of decentralization most blockchains' block are bound by a size limit and a block going over the limit is considered invalid.

Tusk commits, through

  1. the graph walk they consist of
  2. the lack of a limit on individual block sizes,

can consist of a large amount of blocks, and, transitively, transactions. Even if we assumed we had a way to price those transactions in gas ahead of time (which often requires an execution attempt "in real life"), we do not have a way to present the contents of the mempool under a certain size limit. Would it be possible to design one?

/cc @gdanezis

[remove_collections] create module to orchestrate and remove collections on primary

Story

Parent story: #70

Description

To remove collections from a Narwhal node, for each collection the following should happen on primary (ignore order for now):

  • delete the certificate from the certificates_store
  • delete the header from the headers_store
  • delete the batch references from the payload_store
  • trigger requests to downstream worker nodes to delete the batches from their internal batch_store

to achieve the above a mechanism / module should be build to orchestrate the deletion of all those resources. As part of this task we should focus mainly on the part of orchestrating the deletion of batches from the worker nodes. On high level the primary should:

  • retrieve the collections one by one from its internal store
  • identify the batches included into the header
  • send messages to delete the batches to the corresponding worker nodes
  • gather the replies and identify a successful deletion per collection

then the results can be used on further processes to perform deletions on the primary node storage it self.

Dealing with the actual deletion of the batches from the worker nodes should be addressed on different issue.

[Narwhal as a mempool] Expose block / collection information to the primary

This is a sub-part of #44 .
We should expose the ability for the primary to read the contents of a collection (a Narwhal block) locally.
The transactions for a block:

  • arrive at workers of a primary,
  • are grouped there as batches,
  • are grouped at the primary - and proposed there - as a block / collection.

Reading the collection should therefore involve mapping all of this in reverse:

  • which workers have related batches (from the block's header),
  • connecting to those and re-constituting the data locally.

An extension of this issue will include a cache at the primary, see #44.

Upstream issue:
facebookresearch/narwhal#13
Related issue
MystenLabs/sui#5184
Note:
Worker fault tolerance will require data-tracking at the primary, and C&C for replication of worker's data at other workers. It would be nice if it could extend the rails built here.

[get_collections] fetch missing collections - implement batch sync

Story

Parent story: #67

Description

Following the implementation of fetching the missing certificates from peer nodes #97 , as part of this task we should focus on syncing the missing batches. The approach that should be followed is the one described in Option 1 on the design doc.

We should reuse the existing mechanism to sync batches but with the difference that we'll extend the existing solution to dictate the peer from which we want to fetch the batches from, as on the current solution the batch is fetched via the original author of the header. Also, as part of this ticket we should also extend the block_waiter to wait for the batches to be synced before going ahead and requesting them to eventually fulfil the request.

Block remover test failures on DB creation

Description

The block remover tests may execute concurrently and create databases conflicting for the same path

One solution would be to make the tests operate in different crates using https://crates.io/crates/tempfile or whatever we use in the code base already.

Steps to reproduce

Run the block remover tests


running 2 tests
test src/block_remover.rs - block_remover::BlockRemover (line 94) ... FAILED
test src/block_waiter.rs - block_waiter::BlockWaiter (line 122) ... ok

failures:

---- src/block_remover.rs - block_remover::BlockRemover (line 94) stdout ----
Test executable failed (exit code 101).

stderr:
thread 'main' panicked at 'Failed creating database: RocksDBError("IO error: While lock file: /tmp/LOCK: Resource temporarily unavailable")', src/block_remover.rs:25:10
stack backtrace:
   0: rust_begin_unwind
             at /rustc/ee915c34e2f33a07856a9e39be7e35e648bfbd5d/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /rustc/ee915c34e2f33a07856a9e39be7e35e648bfbd5d/library/core/src/panicking.rs:143:14
   2: core::result::unwrap_failed
             at /rustc/ee915c34e2f33a07856a9e39be7e35e648bfbd5d/library/core/src/result.rs:1785:5
   3: core::result::Result<T,E>::expect
   4: rust_out::main::{{closure}}
   5: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
   6: <core::pin::Pin<P> as core::future::future::Future>::poll
   7: tokio::runtime::basic_scheduler::CoreGuard::block_on::{{closure}}::{{closure}}::{{closure}}
   8: tokio::coop::with_budget::{{closure}}
   9: std::thread::local::LocalKey<T>::try_with
  10: std::thread::local::LocalKey<T>::with
  11: tokio::runtime::basic_scheduler::CoreGuard::block_on::{{closure}}::{{closure}}
  12: tokio::runtime::basic_scheduler::Context::enter
  13: tokio::runtime::basic_scheduler::CoreGuard::block_on::{{closure}}
  14: tokio::runtime::basic_scheduler::CoreGuard::enter::{{closure}}
  15: tokio::macros::scoped_tls::ScopedKey<T>::set
  16: tokio::runtime::basic_scheduler::CoreGuard::enter
  17: tokio::runtime::basic_scheduler::CoreGuard::block_on
  18: tokio::runtime::basic_scheduler::BasicScheduler::block_on
  19: tokio::runtime::Runtime::block_on
  20: rust_out::main
  21: core::ops::function::FnOnce::call_once
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.



failures:
    src/block_remover.rs - block_remover::BlockRemover (line 94)

test result: FAILED. 1 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 17.37s

error: test failed, to rerun pass '--doc'

[remove_collections] receive and handle messages to delete batches on workers

Story

Parent story: #70

Description

Following the work on #80 , as part of this task the mechanism should be built in Worker nodes to actually delete from their internal store batch_store, the batches dictated by the received message. The worker node should reply back to the primary with the successful response or error of the operation.

It has to be noted that the operations on the storage should be atomic to guarantee that either all or none will succeed (see multi_remove )

[read_causual] introduce endpoint to retrieve collection digests

Epic: #44

In order to allow our customers to interface with our mempool we need to introduce a series of endpoints. One of those is the read_causal which will takes a collection digest and returns the collection digests encountered in a BFS of the graph, up to a commit boundary. On high level this endpoint will something like:

read_causal( collection_boundary: Digest ) -> Vec<Digest>

The behaviour of this endpoint on high level will work like this:

  • Start by the last "checkpoint" that has been read previously read
  • Start fetching collection digests by performing the BFS traversal (see MystenLabs/sui#5187 ) (probably up to a limit??)
  • Return the results

Assume batches can be stored on workers with ids other than the author's

On the current implementation when a worker forms a new batch it broadcasts it to the worker nodes of the same id of other validators. That assumption is propagated across the codebase - meaning that irrespective of which primary node we query, we assume that a specific batch will be found to the worker of same id (which is basically dictated by the propagated header)

The above assumption is making scaling quite restrictive and eventually we want it removed. On our network our primary node should keep the mapping it self to which local worker a batch has been stored and basically decouple the storying/query semantics from the ones that the author produces (worker id). It has to be noted that a batch is identified by two parameters: the batch digest and the worker_id (author). We still want to be able to identify batches based on both those two properties, but for our primary node we should be able to translate that to which local worker the batch has been stored.

Cryptographic agility

The Celo implementation supports BLS12-377 for validator signatures only : https://docs.celo.org/validator-guide/summary

Sommelier uses Tendermint Core, which supports either of Ed25519 signatures or Secp256k1 ECDSA signatures: https://docs.tendermint.com/v0.35/spec/core/encoding.html#public-key-cryptography

Our implementation currently considers only one variant : Ed25519 signatures. As a consequence the code base will need enough crypto agility to support signing with multiple types of keypairs.

This umbrella issue aims at referring further issues on the topic.

  • a first baseline is a generic API for signing and verification,
  • we may also want to support batch verification for those signature schemes that do have it,
  • we may also want to support aggregated / accountable signature verification for those signature schemes that do have it (hi Celo),

Prior APIs that allow this:
https://github.com/diem/diem/blob/main/crypto/crypto/src/traits.rs (though YMMV)

[tusk] disable tusk via command line argument

Story: #74

As part of this task we should be able to disable Tusk and just run a primary node without any consensus algorithm. We should also try to run a test how the server will respond. A command line argument should be introduced that will enable/disable the Tusk (consensus). When disabled, the produced certificated should still be consumed but basically not processed.

[construct identifiers] form an identifier by using arbitrary data

Description

On various places in our codebase we have the need to construct request/look up keys that are based on some other arbitrary (but deterministically defined) data. One of the examples is on the block_remover component the construct_request_key method which forms a key based on provided block (certificate) ids. Those ids are used to form a unique request id which is further used on map look ups. This is now done in a bit naive way by just concatenating the id bytes, which is not ideal since:

  • it leads to variable length keys
  • it can lead to quite big keys

As part of this task we want to define a way of creating more concise keys.

Note: Ideally we would like to implement a proper request/response pattern so we don't have to handle this logic on the core application level.

[virtual dag] Preserve the utility of the `read_causal` after a graph cut

Epic

We would like to make sure the customer can implement a priority-based block proposal.

This is described in detail in this talk: https://drive.google.com/drive/u/1/folders/1xf8wJdQTsN_nvlMMOOGztIDdg81EAatJ

Description

  • we would like read_causal to make complete graph walks in between two remove_collections calls (the later signals committed data that should move off the mempool),
  • the remove_collections call therefore needs to remove suffixes of the graph, to avoid graph cuts,
  • this forces block proposals to be FIFO in the first place.

Solution

  • we should have a remove_collections call that remove the storage of transactions but keeps the headers,
  • this may make header storage in the DAG grow arbitrarily,
  • we would like to walk the DAG more efficiently, with a "path compression" feature: any path that goes "through" a removed collection should "jump" it and go straight to the next node.

[get_collections] block_waiter to call block_synchronizer for fetching missing blocks

Story

Parent story: #67

Description

Currently the block_waiter returns an error when a certificate is not found. As part of this task the block_waiter should be modified in order to leverage the block_synchronizer to fetch the missing blocks. The process will be best effort since we might still fail to fetch the missing blocks. To ensure that we avoid unnecessary latency, the block_waiter should continue the normal processing on fetching the found the blocks and continue fetching the retrieved blocks from the block_synchronizer once they become available.

[remove_collections] introduce endpoint to remove collections

Epic: #44

In order to ensure that:

  1. Our customers will only receive collections that haven't already consumed and proposed (or discarded)
  2. We reclaim space in Narwhal nodes

we are looking to introduce an endpoint that will allow a validator to remove the specified collections both from the internal DAG and the data store. At the moment more details can be found on the doc here.

The endpoint on high level will look like:

remove_collections(collection_refs: Vec<CollectionDigest>) -> Result((), Error)

where the collection_refs is a vector with the collection digests that should be removed.

[tusk] allow Narwhal run without Tusk

Currently on the repository Tusk is running by default without having the option to disable it. All the information is kept on its internal DAG. However, not all of our customers will want to run Narwhal with Tusk - in fact at the moment no-one will immediately want to. For that reason we should refactor the codebase to:

  1. Give the ability to run a node with or without Tusk
  2. When running without Tusk, the DAG should still be available to use in order for our customers to be able to perform read_causal requests #69 and remove_collections #70 . Those 2 issues are basically blocked until this work is finished first

On a high level we should aim to:

  • allow the node to be run with a command line option to disable Tusk
  • extract the common DAG parts of Tusk and create a component of "lightweight" Tusk to allow us form the DAG and advance rounds
  • be able to boot the node with the above component

[get_collections]: improve the endpoint to receive a list of digests

Story: #67

On the first iteration we were aiming to retrieve only one collection for simplicity. As part of this task we want to improve the functionality and make the endpoint receive a list of digests (collection ids) and return back the corresponding collections. That will ensure we save roundtrips and improve the overall performance.

[remove_collections] introduce retry logic for failed worker requests

Story

#70

Description

In order to improve the block_remover robustness, we want to retry a few times to fetch a batch from a worker in case the request has been failed or not response has been received in time.

As part of this task this logic should be implemented on the block_remover. The number of retries should be configurable.

[network] be able to send error messages across network

Story: #67

At the moment there is no way to pass error messages across the network communication between nodes (either primary <-> primary, worker <-> worker, primary <-> worker) . So far there wasn't any need to propagate any errors between nodes. However, as we progress with features that demand some request-response communication, this becomes necessary.

As part of this issue it should be provided a common way to wrap and send errors across the network for any node communication. As reference is provided the way that Sui handles errors and the corresponding SuiError type.

Missing crypto "lego blocks"

We already have identified two missing bits of our cryptographic inventory story:

  1. we need to abstract signature and verification in a scheme-agnostic manner (#7)1
  2. we need to have a leader election mechanism for Tusk, which looks like a random beacon, as far as the paper is concerned (#10)

This umbrella issue is to map out the inventory of the cryptographic building blocks we need as a minimalistic MVP of Narwhal:

  • a slack discussion hinted that a VRF might be sufficient for leader election rather than a full-on random beacon,
  • what should we do w.r.t aggregation and / or batch verification, to make sure we have "pipes" (code paths) that exploit those primitives where available, (low prio, may be seen as an extension of #7 )
  • how does that interact with quorum changes?

The lighter our "footprint" the better: an anti-example of this design task would be to require a random beacon fed by a Joint-Feldman DKG, which:

  • is a synchronous algorithm that's implemented on an asynchronous network, with large delays "simulating" the synchronicity,
  • this comes at a huge operations cost in dealing with a public bulletin board, emulating time delays with the chain, etc.,
  • if the customer doesn't run this already, they may not be happy being tied in hours of baby-sitting the DKG in Discord channels w/ their node runners
  • JF only secure to feed re-keyable signature schemes which may not mesh well with any customer that cannot implement a pairings-based scheme.

Footnotes

  1. this in fact maps the "forever" constraint that in so far as we integrate Narwhal/Tusk in other chains / code bases, we'll need to operate within their validators' crypto tooling environment.

Implement Consensus fault-tolerance through memory-mapped files

Along with the certificate DB:

narwhal/primary/src/core.rs

Lines 281 to 284 in 46f2457

// Store the certificate.
let bytes = bincode::serialize(&certificate).expect("Failed to serialize certificate");
self.store.write(certificate.digest().to_vec(), bytes).await;

The only state that the consensus needs to recover is the last_voted_round table:
https://github.com/MystenLabs/narwhal/blob/main/consensus/src/lib.rs#L32

This fixed-sized element can be recovered from a memory-mapped file.

This PR should implement:

  • the synchronous consensus state writes,
  • the recovery flow

[safety] Reject high rounds in block headers

The sanitize_header function processes all headers received by a primary:

narwhal/primary/src/core.rs

Lines 328 to 340 in 03e91f8

fn sanitize_header(&mut self, header: &Header<PublicKey>) -> DagResult<()> {
ensure!(
self.gc_round < header.round,
DagError::TooOld(header.id.clone(), header.round)
);
// Verify the header's signature.
header.verify(&self.committee)?;
// TODO [issue #3]: Prevent bad nodes from sending junk headers with high round numbers.
Ok(())
}

It should reject blocks with a claimed round much higher than the current block round.

This block round is managed using an AtomicU64 in the core's fields:

consensus_round: Arc<AtomicU64>,

A simple but large integer interval between a received header's claimed round and the current round should be enough to avoid spam (which is the main goal of this issue). Determining what "large enough" looks like will require analyzing how fast the block round evolves over time. The idea is that a small gap in round numbers is acceptable (though the header will not be voted on, it is not invalid, and post verification should contribute to an increase of the node's own round number).

The error produced here should clearly indicate the cause of the rejection, and produce a new variant in DagError:

pub enum DagError {

Indeed at worst, this could indicate the local authority is an out-of-sync node that hasn't caught up with the network.

It should also emit logs, ideally with context allowed since the switch to tracing (#26).

[non-synthetic client transactions]

At the moment, Narwhal only verifies its own transactions and votes.

This is problematic because in production the payload of those messages is itself a set of transactions (whether received through a mempool or directly by workers), and those should at minima have their signatures verified before hitting the proposer.

We should support the following as user signatures:

But that also includes subtleties like EIP-155

(see also CIP-35)

Footnotes

  1. reminder: go has a single type, ECDSA, to represent public keys on both secp256k1 and secp256r1, and dynamically injects the correct verification impl at runtime.

[virtual dag] Node addition

On the coat tails of #85, Implement:

  • keeping pointers to the heads of the DAG,
  • adding Nodes and maintaining the head pointers,
  • maintaining an association table from Collection Digests to nodes.

Smarter subscriber

Story

This task is part of the light client story #95

Description

The way the subscriber works increases the latency of commit since we wait for the consensus to output in order to start downloading batches at the executor. Can't we use the DAG to start earlier than consensus and ideally be ready to execute the moment the consensus outputs? From @LefKok

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.