Tribbler

The goal of this project is to refactor Tribbler, a web service backed by a key-value storage engine (designed to run on a single machine), into a scalable and fault tolerant distributed service.

Scalability

Making Tribbler scalable involves splitting up the entire space of key-value pairs across multiple backend servers. We solve this problem using consistent hashing, where each backend server is hashed onto a circular space such that each server occupies a unique position in this space and that the distance between these servers (on this circular space) is equal with a high probability.

Every incoming request is first allocated a virtual key-value store, called a 'bin', that is hashed onto the same circular space as the backend servers and is assigned to the server immediately succeeding it on this circular space. Since multiple bins could potentially share a physical back-end, we must ensure that we differentiate between the key-value pairs of each bin. To solve this problem, we prepend each key with the name of the bin performing the operation. This is implemented by means of a wrapper around the client.

Consistency

In a scenario where a single user is concurrently using multiple front-ends, it is necessary to ensure that simultaneous update operations do not result in race conditions that leave data in an inconsistent state. For example, let's assume that user 'A' is currently following user 'B'. Now, user 'A' tries to concurrently unfollow user 'B' on two different front ends. The system must ensure that at most one of these calls succeeds.

To implement these semantics, we make use of a sequentially consistent log, to which we append all operations. However, before returning to the user, we check the log to see whether our operation succeeded or some other concurrent client's operation succeeded. The example scenario above is solved using the log as follows. Each concurrent front end will generate a unique logical clock number, which is attached to the unfollow operation before appending it to the log. Once appended, each front-end will retrieve the log contents to see whether it won the data race. Only the front-end which wins the race will return a success status to the client.

Ordering

The system requirements dictate an ordering among tribs (a.k.a tweets) that are posted by users. To enforce this ordering, we assign each trib a 'logical timestamp', which is a monotonically increasing number maintained by every backend server. Therefore, all front-ends sharing a single backend are guaranteed to see monotonically increasing values every time they generate a logical timestamp. However, it is necessary to keep these clocks synchronized among all backend servers. A dedicated keeper process performs this synchronization operation every second by polling all backends to retrieve the most recent clock value that each of them has seen and then updates all backends with this value. In a situation where the logical timestamp is insufficient to order tribs, we fall back to other parameters such as physical timestamps, user names and message contents.

Bin Storage Fault Tolerance

The system requirements dictate that we must not lose any information if there are at least three backend servers alive. Therefore, we ensure that every operation performed by a bin is replicated on at least three backend servers before returning to the front end that services the request. However, replication may not succeed due to a failure that happened just before replication was attempted. To mitigate this problem, we ensure that the keeper process responsible for the failed backend will eventually complete the replication before a subsequent failure occurs. This is possible because the system requirements guarantee a certain time interval between successive failures. For a given bin, the first three live backends immediately succeeding it on the circular space are designated as its replicas.

Also, since write requests must be propagated to multiple backends (for durability), we must be able to deal with a scenario where all of these backends see the operations inserted in different order due to the possibility of concurrent updates by other bins. Instead of trying to enforce a consistent order during a write operation, we instead reconstruct the order of operations whenever they are retrieved. To do this, we attach a unique sequence number (that monotonically increases) to every operation before replicating it on backends, such that the operation has the same sequence number on every replica. This unique sequence number is obtained by calculating the maximum value of logical clocks on all backends that this operation will be replicated on and combining it with the physical timestamp. Doing this ensures that we will be able to reconstruct the log from any of these replicas and be sure that all these would have the same consistent view of operations in the log.

Backend Fault Tolerance

Next, we must ensure that if a backend fails, its key-value pairs are successfully replicated onto a different backend. First, we must be able to detect backend failures. Failure detection is done by keeper processes, where each keeper is responsible for detecting failures of backends preceeding it on the circular space. This is possible because keeper addresses are also hashed onto the same circular space as bins and backends. When a keeper detects a failure on one of its backends, it migrates the failed backend's data to three different backends and uses the following approach to do this. If a backend with a position of 'n' (on the circular space) fails, we know that all its data will surely be available on the next two live backends (i.e. 'n+1' and 'n+2'). However, the backends preceeding 'n' on the circular space (i.e. 'n-1' and 'n-2') also have some of their data replicated on 'n'. As a result, it is necessary to copy over this data to the next two live backends as well.

Since it is important to ensure that we have at least three copies of data available at all times, the keeper migrates backend n's keys from 'n+1' to 'n+3'. But, we must take care that we don't migrate any extra keys from 'n+1' (since it will have its own set of exclusive bins). To do this, we make use of the fact that every key on every backend is prefixed with its bin name. Therefore, we only need to migrate keys that belong to bins that were mapped to the failed backend. This mapping information (which gives us a list of bins for every backend) is durably stored on a different set of backends by bin storage.

Keeper Fault Tolerance

Finally, we must handle keeper failures. A keeper tracks the liveness status of every other keeper by periodically polling all of them. To do this, each keeper polls its immediate live successor on the circular space using RPC. When a keeper has detected another keeper's failure, it must check whether the failed keeper was in the process of migrating a backend's data to a different backend. If so, and if this backend now belongs to the keeper that detected the failure, it must handle the responsiblity of making sure that this replication operation is successfully completed.

To handle this correctly, we ensure that before a keeper proceeds with any replication operation, it appends a unique entry to a log which contains details about the operation it is about to perform. Once the replication has successfully completed, it writes another entry to the log. In this manner, a keeper that detects a failure is able to look at this log for any operations that may have not been completed and handle them if the backends involved in this operation are now this keeper's responsibility.

abhishekvijeev / tribbler Goto Github PK

tribbler's Introduction

Tribbler

Scalability

Consistency

Ordering

Bin Storage Fault Tolerance

Backend Fault Tolerance

Keeper Fault Tolerance

tribbler's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent