Code Monkey home page Code Monkey logo

raft's Introduction

GoDoc Go Report Card Coverage Status CircleCI

Overview

Raft is a protocol with which a cluster of nodes can maintain a replicated state machine. The state machine is kept in sync through the use of a replicated log. However, The details of the Raft protocol are outside the scope of this document, For more details on Raft, see In Search of an Understandable Consensus Algorithm

Why another library?

Raft algorithm comes in search of an understandable consensus algorithm, unfortunately, most of the go libraries out there required a deep knowledge of their implementation and APIs.

This raft library was born to align with the understandability raft principle and its sole purpose is to provide consensus with the minimalistic, simple, clean, and idiomatic API.

Etcd Raft is the most widely used Raft library in production But, it follows a minimalistic design philosophy by only implementing the core raft algorithm which leaves gaps and ambiguities.

So, instead of reinventing the wheel, this library uses etcd raft as its core.

That's how you can benefit from the power and stability of etcd raft, with an understandable API. indeed, it keeps your focus on building awesome software.

Finally, the raft library aimed to be used internally but it is worth being exposed to the public.

Features

This raft implementation is a full feature implementation of Raft protocol. Features includes:

  • Mange Multi-Raft
  • Coalesced heartbeats to reduce the overhead of heartbeats when there are a large number of raft groups
  • Leader election
  • Log replication
  • Log compaction
  • Pre-Vote Protocol
  • Membership changes
    • add member
    • remove member
    • update member
    • promote member
    • demote member
  • Leadership transfer extension
  • Efficient linearizable read-only queries served by both the leader and followers
    • leader checks with quorum and bypasses Raft log before processing read-only queries
    • followers asks leader to get a safe read index before processing read-only queries
  • More efficient lease-based linearizable read-only queries served by both the leader and followers
    • leader bypasses Raft log and processing read-only queries locally
    • followers asks leader to get a safe read index before processing read-only queries
    • this approach relies on the clock of the all the machines in raft group
  • Snapshots
    • automatic snapshots when the log store reaches a certain length
    • API to force new snapshot
  • Read-Only Members
    • learner member
    • staging member
  • Segmented WAL to provide durability and ensure data integrity
  • Garbage collector to controls how many WAL and snapshot files are retained
  • Network transport to communicate with Raft on remote machines
    • gRPC (recommended)
    • http
  • gRPC chunked transfer encoding
  • Network Pipelining
  • Restore cluster quorum and data
    • force new cluster from the existing WALL and snapshot
    • restore the cluster from snapshot
  • Optimistic pipelining to reduce log replication latency
  • Flow control for log replication
  • Batching Raft messages to reduce synchronized network I/O calls
  • Batching log entries to reduce disk synchronized I/O
  • Writing to leader's disk in parallel
  • Internal proposal redirection from followers to leader
  • Automatic stepping down when the leader loses quorum
  • Protection against unbounded log growth when quorum is lost

WAL's and snapshots

There are two sets of files on disk that provide persistent state for Raft. There is a set of WAL (write-ahead log files). These store a series of log entries and Raft metadata, such as the current term, index, and committed index. WAL files are automatically rotated when they reach a certain size.

To avoid having to retain every entry in the history of the log, snapshots serialize a view of the state at a particular point in time. After a snapshot gets taken, logs that predate the snapshot are no longer necessary, because the snapshot captures all the information that's needed from the log up to that point. The number of old snapshots and WALs to retain is configurable.

WALs mostly contain protobuf-serialized user data store modifications. A log entry can contain a batch of creations, updates, and deletions of objects from the user data store. Some log entries contain other kinds of metadata, like node additions or removals. Snapshots contain a complete dump of the store, as well as any metadata from the log entries that needs to be preserved. The saved metadata includes the Raft term and index, a list of nodes in the cluster, and a list of nodes that have been removed from the cluster.

Raft IDs

The library uses integers to identify Raft nodes. The Raft IDs may assigned dynamically when a node joins the Raft consensus group, or it can be defined manually by the user.

It's important to note that a Raft ID can't be reused after a node that was using the ID leaves the consensus group. These Raft IDs of nodes that are no longer part of the cluster are saved (persisted on disk) as part of the nodes pool members to make sure they aren't reused. If a node with a removed Raft ID tries to use Raft RPCs, other nodes won't honor these requests.

The removed node's IDs are used to restrict these nodes from communicating, affecting the cluster state and avoid ambiguity.

Initializing a Raft cluster

The first member of a cluster assigns itself a random Raft ID unless it pre-defined. It creates a new WAL with its own Raft identity stored in the metadata field. The metadata field is the only part of the WAL that differs between nodes. By storing information such as the local Raft ID, it's easy to restore this node-specific information after a restart. In principle it could be stored in a separate file, but embedding it inside the WAL is most convenient.

The node then starts the Raft state machine. From this point, it's a fully functional single-node Raft instance. Writes to the data store actually go through Raft, though this is a trivial case because reaching consensus doesn't involve communicating with any other nodes.

Joining a Raft cluster

New nodes can join an existing Raft consensus group by invoking the Join RPC on any Raft member if proposal forwarding is enabled, Otherwise, Join RPC must be invoked on the leader. If successful, Join returns a Raft ID for the new node and a list of other members of the consensus group.

On the leader side, Join tries to append a configuration change entry to the Raft log, and waits until that entry becomes committed.

A new node creates an empty Raft log with its own node information in the metadata field. Then it starts the state machine. By running the Raft consensus protocol, the leader will discover that the new node doesn't have any entries in its log, and will synchronize these entries to the new node through some combination of sending snapshots and log entries. It can take a little while for a new node to become a functional member of the consensus group, because it needs to receive this data first.

The new node can join the cluster as a Voter, Learner, or Staging member.

Initializing a predefined Raft cluster

The library also provides a mechanism to initialize and boot the Raft cluster from a predefined members configuration. This is done by applying the same configurations to all members even for the later joining member.

Each member of the cluster use its predefined id and creates a new WAL with its own Raft identity stored in the metadata field. The member node then starts the Raft state machine.

Once all members nodes are started the election process will be initiated, and when all members agrees on the same leader the cluster becomes fully functional, Writes to the data store actually go through Raft and it considered complete after reaching a majority.

Usage

The primary object in raft is a Node. Either start a Node from scratch using raft.WithInitCluster(), raft.WithJoin() or start a Node from some initial state using raft.WithRestart().

To start a three-node cluster from predefined configuration:

Node A

m1 := raft.RawMember{ID: 1, Address: ":8081"}
m2 := raft.RawMember{ID: 2, Address: ":8082"}
m3 := raft.RawMember{ID: 3, Address: ":8083"}
node := raft.NewNode(<FSM>, <Transport>, <Opts>)
// The first member should reference to the current effective member.
node.Start(raft.WithInitCluster(), raft.WithMembers(m1, m2, m3))

Node B

m1 := raft.RawMember{ID: 1, Address: ":8081"}
m2 := raft.RawMember{ID: 2, Address: ":8082"}
m3 := raft.RawMember{ID: 3, Address: ":8083"}
node := raft.NewNode(<FSM>, <Transport>, <Opts>)
// The first member should reference to the current effective member.
node.Start(raft.WithInitCluster(), raft.WithMembers(m2, m1, m3))

Node C

m1 := raft.RawMember{ID: 1, Address: ":8081"}
m2 := raft.RawMember{ID: 2, Address: ":8082"}
m3 := raft.RawMember{ID: 3, Address: ":8083"}
node := raft.NewNode(<FSM>, <Transport>, <Opts>)
// The first member should reference to the current effective member.
node.Start(raft.WithInitCluster(), raft.WithMembers(m3, m1, m2))

Start a single node cluster, like so:

m := raft.RawMember{ID: 1, Address: ":8081"}
opt := raft.WithMembers(m)
// or 
opt = raft.WithAddress(":8081")
node := raft.NewNode(<FSM>, <Transport>, <Opts>)
node.Start(raft.WithInitCluster(), opt)

To allow a new node to join a cluster, like so:

m := raft.RawMember{ID: 2, Address: ":8082"}
opt := raft.WithMembers(m)
// or 
opt = raft.WithAddress(":8082")
node := raft.NewNode(<FSM>, <Transport>, <Opts>)
node.Start(raft.WithJoin(":8081", time.Second), opt)

To restart a node from previous state:

node := raft.NewNode(<FSM>, <Transport>, <Opts>)
node.Start(raft.WithRestart())

To force new cluster:

node := raft.NewNode(<FSM>, <Transport>, <Opts>)
// This will use the latest wal and snapshot.
node.Start(raft.WithForceNewCluster())

To restore from snapshot:

node := raft.NewNode(<FSM>, <Transport>, <Opts>)
node.Start(raft.WithRestore("<path to snapshot file>"))

Examples and dcos

  • More detailed development documentation can be found in go docs
  • Fully working single and multiraft cluster example can be found in Examples Folder.

Contributing to this project

We welcome contributions. If you find any bugs, potential flaws and edge cases, improvements, new feature suggestions or discussions, please submit issues or pull requests.

raft's People

Contributors

aderouineau avatar dependabot[bot] avatar fredpetersen avatar shaj13 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

raft's Issues

Production fitness

Hello @shaj13

Thanks for this great library. It has helped a lot understanding the etcd raft. However I would like to know whether this library has been used in production already.

Extend grpc api

Hi @shaj13 ,
Looks like current gprc protos do not expose cluster management apis, for example remove failed node from the cluster.
Are there any plans on doing so?
Thank you

Remove dead node

Is it possible to remove dead node from the cluster?
RemoveMember seems to be just marking node as removed
Thanks

example cannot work under http protocol

I found there's a critical error in transport.Dialer while using transport.HTTP instead of transport.GRPC. The address parameter is like 127.0.0.1:8080. rafthttp.Dialer while pass this address to rafthttp.client thus problems will happen when the leader is trying to communicate with followers. rafthttp.client.Message proceed client request via http.NewRequestWithContext in rafthttp.client.requestProto method. Because the scheme of URL is missing from which we passed (127.0.0.1:8080) in the dialer.

More detail please see net/url.Parse. URL schema is required.

Got rpc error: code = Unimplemented desc = unexpected HTTP status code received from server: 404 (Not Found)

Hello @shaj13

I am getting the following error:

ERROR: raft.membership: sending message to member 24669d2775c8b223: rpc error: code = Unimplemented desc = unexpected HTTP status code received from server: 404 (Not Found); transport: received unexpected content-type "text/plain; charset=utf-8"

I am running the application on k8 with a replica of 2.

These are the logs from the two Pods:

Pod-1:

{"level":"debug","ts":1683374970.9820142,"msg":"raft node URL=:3100"}
{"level":"debug","ts":1683374972.0948684,"msg":"kubernetes has discovered 1 nodes"}
2023/05/06 12:09:32 INFO: 61cca80dec1edadd switched to configuration voters=()
2023/05/06 12:09:32 INFO: 61cca80dec1edadd became follower at term 0
2023/05/06 12:09:32 INFO: newRaft 61cca80dec1edadd [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
2023/05/06 12:09:32 INFO: 61cca80dec1edadd became follower at term 1
2023/05/06 12:09:32 INFO: 61cca80dec1edadd switched to configuration voters=(7047192294677469917)
2023/05/06 12:09:32 INFO: 61cca80dec1edadd switched to configuration voters=(7047192294677469917)
2023/05/06 12:09:33 INFO: 61cca80dec1edadd is starting a new election at term 1
2023/05/06 12:09:33 INFO: 61cca80dec1edadd became candidate at term 2
2023/05/06 12:09:33 INFO: 61cca80dec1edadd received MsgVoteResp from 61cca80dec1edadd at term 2
2023/05/06 12:09:33 INFO: 61cca80dec1edadd became leader at term 2
2023/05/06 12:09:33 INFO: raft.node: 61cca80dec1edadd elected leader 61cca80dec1edadd at term 2
2023/05/06 12:10:00 INFO: 61cca80dec1edadd switched to configuration voters=(2622956625795265059 7047192294677469917)
2023/05/06 12:10:01 ERROR: raft.membership: sending message to member 24669d2775c8b223: rpc error: code = Unimplemented desc = unexpected HTTP status code received from server: 404 (Not Found); transport: received unexpected content-type "text/plain; charset=utf-8"

Pod-2:

{"level":"debug","ts":1683374999.3820262,"msg":"raft node URL=:3100"}
{"level":"debug","ts":1683375000.8786376,"msg":"kubernetes has discovered 2 nodes"}
2023/05/06 12:10:01 INFO: 24669d2775c8b223 switched to configuration voters=()
2023/05/06 12:10:01 INFO: 24669d2775c8b223 became follower at term 0
2023/05/06 12:10:01 INFO: newRaft 24669d2775c8b223 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
2023/05/06 12:10:01 INFO: 24669d2775c8b223 became follower at term 1
2023/05/06 12:10:01 INFO: 24669d2775c8b223 switched to configuration voters=(2622956625795265059)
2023/05/06 12:10:01 INFO: 24669d2775c8b223 switched to configuration voters=(2622956625795265059 7047192294677469917)
2023/05/06 12:10:01 INFO: 24669d2775c8b223 switched to configuration voters=(2622956625795265059 7047192294677469917)
2023/05/06 12:10:01 INFO: 24669d2775c8b223 switched to configuration voters=()
2023/05/06 12:10:01 INFO: 24669d2775c8b223 became follower at term 0
2023/05/06 12:10:01 INFO: newRaft 24669d2775c8b223 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]

I am sure I am doing something wrong

TODO

TODO

  • add tcp rpc
  • check the conf change v2
  • Repair wall
  • add more logs.

How do I get notified of the change of leader?

I want to implement a distributed KV system with data sharding based on consistent hash , which involves the need to re-shard and transfer the cluster data after the cluster leader changes.I see that etcd does not provide a notification mechanism, but hashiCorp/raft does provide LeaderCh and NotifyCh to actively notify the application layer that the leader has changed.

In the implementation of this system, in addition to the above functions, a preVote function is required, but this function is not implemented in hashicorp/raft. So eventually I have to resort to etcd/raft. So I'd like to ask if you have that notification mechanism in your implementation of raft, or if there are other better suggestions

Log is being applied on restart

Hello,
Is it an expected behavior?
After a node is restarted I see it fetches all the data from the leader
I'd expect it to get only the diff
Thanks

Nodes try to self connect when given predefined config

Hi there and thanks for sharing this project. I'm keen to introduce Raft into one of my projects in which there is a predefined cluster. During boot I'm getting the follower error on 2 of the three nodes:

panic: raft.membership: attempted to send msg to local member; should never happen

Here is the relevant code

func raftInit() {
	raftgrpc.Register(
		raftgrpc.WithDialOptions(grpc.WithInsecure()),
	)
	fsm = newstateMachine()
	node = raft.NewNode(fsm, transport.GRPC)
	raftServer = grpc.NewServer()
	raftgrpc.RegisterHandler(raftServer, node.Handler())

	m1 := raft.RawMember{ID: 1, Address: "host1:8081"}
	m2 := raft.RawMember{ID: 2, Address: "host2:8081"}
	m3 := raft.RawMember{ID: 3, Address: "host3:8081"}

	go func() {
		lis, err := net.Listen("tcp", ":8081")
		if err != nil {
			log.Fatal(err)
		}

		err = raftServer.Serve(lis)
		if err != nil {
			log.Fatal(err)
		}
	}()

  go func() {
	err := node.Start(raft.WithInitCluster(), raft.WithMembers(m1, m2, m3) )
                 if err != nil {
			log.Fatal(err)
		}
	}()
}

Not sure if this is misconfigured on my side or if there's an issue here. Would greatly appreciate you assistance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.