Code Monkey home page Code Monkey logo

dsync's Introduction

This project has moved to https://github.com/minio/minio/tree/master/pkg/dsync

dsync Slack Go Report Card codecov

A distributed locking and syncing package for Go.

Introduction

dsync is a package for doing distributed locks over a network of n nodes. It is designed with simplicity in mind and hence offers limited scalability (n <= 32). Each node will be connected to all other nodes and lock requests from any node will be broadcast to all connected nodes. A node will succeed in getting the lock if n/2 + 1 nodes (whether or not including itself) respond positively. If the lock is acquired it can be held for as long as the client desires and needs to be released afterwards. This will cause the release to be broadcast to all nodes after which the lock becomes available again.

Motivation

This package was developed for the distributed server version of Minio Object Storage. For this we needed a distributed locking mechanism for up to 32 servers that each would be running minio server. The locking mechanism itself should be a reader/writer mutual exclusion lock meaning that it can be held by a single writer or an arbitrary number of readers.

For minio the distributed version is started as follows (for a 6-server system):

$ minio server http://server1/disk http://server2/disk http://server3/disk http://server4/disk http://server5/disk http://server6/disk

(note that the same identical command should be run on servers server1 through to server6)

Design goals

  • Simple design: by keeping the design simple, many tricky edge cases can be avoided.
  • No master node: there is no concept of a master node which, if this would be used and the master would be down, causes locking to come to a complete stop. (Unless you have a design with a slave node but this adds yet more complexity.)
  • Resilient: if one or more nodes go down, the other nodes should not be affected and can continue to acquire locks (provided not more than n/2 - 1 nodes are down).
  • Drop-in replacement for sync.RWMutex and supports sync.Locker interface.
  • Automatically reconnect to (restarted) nodes.

Restrictions

  • Limited scalability: up to 32 nodes.
  • Fixed configuration: changes in the number and/or network names/IP addresses need a restart of all nodes in order to take effect.
  • If a down node comes up, it will not try to (re)acquire any locks that it may have held.
  • Not designed for high performance applications such as key/value stores.

Performance

  • Support up to a total of 7500 locks/second for a size of 16 nodes (consuming 10% CPU usage per server) on moderately powerful server hardware.
  • Lock requests (successful) should not take longer than 1ms (provided decent network connection of 1 Gbit or more between the nodes).

The tables below show detailed performance numbers.

Performance with varying number of nodes

This table shows test performance on the same (EC2) instance type but with a varying number of nodes:

EC2 Instance Type Nodes Locks/server/sec Total Locks/sec CPU Usage
c3.2xlarge 4 (min=3110, max=3376) 12972 25%
c3.2xlarge 8 (min=1884, max=2096) 15920 25%
c3.2xlarge 12 (min=1239, max=1558) 16782 25%
c3.2xlarge 16 (min=996, max=1391) 19096 25%

The min and max locks/server/sec gradually declines but due to the larger number of nodes the overall total number of locks rises steadily (at the same CPU usage level).

Performance with difference instance types

This table shows test performance for a fixed number of 8 nodes on different EC2 instance types:

EC2 Instance Type Nodes Locks/server/sec Total Locks/sec CPU Usage
c3.large (2 vCPU) 8 (min=823, max=896) 6876 75%
c3.2xlarge (8 vCPU) 8 (min=1884, max=2096) 15920 25%
c3.8xlarge (32 vCPU) 8 (min=2601, max=2898) 21996 10%

With the rise in the number of cores the CPU load decreases and overall performance increases.

Stress test

Stress test on a c3.8xlarge (32 vCPU) instance type:

EC2 Instance Type Nodes Locks/server/sec Total Locks/sec CPU Usage
c3.8xlarge 8 (min=2601, max=2898) 21996 10%
c3.8xlarge 8 (min=4756, max=5227) 39932 20%
c3.8xlarge 8 (min=7979, max=8517) 65984 40%
c3.8xlarge 8 (min=9267, max=9469) 74944 50%

The system can be pushed to 75K locks/sec at 50% CPU load.

Usage

NOTE: Previously if you were using dsync.Init([]NetLocker, nodeIndex) to initialize dsync has been changed to dsync.New([]NetLocker, nodeIndex) which returns a *Dsync object to be used in every instance of NewDRWMutex("test", *Dsync)

Exclusive lock

Here is a simple example showing how to protect a single resource (drop-in replacement for sync.Mutex):

import (
	"github.com/minio/dsync/v3"
)

func lockSameResource() {

	// Create distributed mutex to protect resource 'test'
	dm := dsync.NewDRWMutex(context.Background(), "test", ds)

	dm.Lock("lock-1", "example.go:505:lockSameResource()")
	log.Println("first lock granted")

	// Release 1st lock after 5 seconds
	go func() {
		time.Sleep(5 * time.Second)
		log.Println("first lock unlocked")
		dm.Unlock()
	}()

	// Try to acquire lock again, will block until initial lock is released
	log.Println("about to lock same resource again...")
	dm.Lock("lock-1", "example.go:515:lockSameResource()")
	log.Println("second lock granted")

	time.Sleep(2 * time.Second)
	dm.Unlock()
}

which gives the following output:

2016/09/02 14:50:00 first lock granted
2016/09/02 14:50:00 about to lock same resource again...
2016/09/02 14:50:05 first lock unlocked
2016/09/02 14:50:05 second lock granted

Read locks

DRWMutex also supports multiple simultaneous read locks as shown below (analogous to sync.RWMutex)

func twoReadLocksAndSingleWriteLock() {

	drwm := dsync.NewDRWMutex(context.Background(), "resource", ds)

	drwm.RLock("RLock-1", "example.go:416:twoReadLocksAndSingleWriteLock()")
	log.Println("1st read lock acquired, waiting...")

	drwm.RLock("RLock-2", "example.go:420:twoReadLocksAndSingleWriteLock()")
	log.Println("2nd read lock acquired, waiting...")

	go func() {
		time.Sleep(1 * time.Second)
		drwm.RUnlock()
		log.Println("1st read lock released, waiting...")
	}()

	go func() {
		time.Sleep(2 * time.Second)
		drwm.RUnlock()
		log.Println("2nd read lock released, waiting...")
	}()

	log.Println("Trying to acquire write lock, waiting...")
	drwm.Lock("Lock-1", "example.go:445:twoReadLocksAndSingleWriteLock()")
	log.Println("Write lock acquired, waiting...")

	time.Sleep(3 * time.Second)

	drwm.Unlock()
}

which gives the following output:

2016/09/02 15:05:20 1st read lock acquired, waiting...
2016/09/02 15:05:20 2nd read lock acquired, waiting...
2016/09/02 15:05:20 Trying to acquire write lock, waiting...
2016/09/02 15:05:22 1st read lock released, waiting...
2016/09/02 15:05:24 2nd read lock released, waiting...
2016/09/02 15:05:24 Write lock acquired, waiting...

Basic architecture

Lock process

The basic steps in the lock process are as follows:

  • broadcast lock message to all n nodes
  • collect all responses within certain time-out window
    • if quorum met (minimally n/2 + 1 responded positively) then grant lock
    • otherwise release all underlying locks and try again after a (semi-)random delay
  • release any locks that (still) came in after time time-out window

Unlock process

The unlock process is really simple:

  • broadcast unlock message to all nodes that granted lock
  • if a destination is not available, retry with gradually longer back-off window to still deliver
  • ignore the 'result' (cover for cases where destination node has gone down and came back up)

Dealing with Stale Locks

A 'stale' lock is a lock that is left at a node while the client that originally acquired the client either:

  • never released the lock (due to eg a crash) or
  • is disconnected from the network and henceforth not able to deliver the unlock message.

Too many stale locks can prevent a new lock on a resource from being acquired, that is, if the sum of the stale locks and the number of down nodes is greater than n/2 - 1. In dsync a recovery mechanism is implemented to remove stale locks (see here for the details).

Known deficiencies

Known deficiencies can be divided into two categories, namely a) more than one write lock granted and b) lock not becoming available anymore.

More than one write lock

So far we have identified one case during which this can happen (example for 8 node system):

  • 3 nodes are down (say 6, 7, and 8)
  • node 1 acquires a lock on "test" (nodes 1 through to 5 giving quorum)
  • node 4 and 5 crash (dropping the lock)
  • nodes 4 through to 8 restart
  • node 4 acquires a lock on "test" (nodes 4 through to 8 giving quorum)

Now we have two concurrent locks on the same resource name which violates the core requirement. Note that if just a single server out of 4 or 5 crashes that we are still fine because the second lock cannot acquire quorum.

This table summarizes the conditions for different configurations during which this can happen:

Nodes Down nodes Crashed nodes Total nodes
4 1 2 3
8 3 2 5
12 5 2 7
16 7 2 9

(for more info see testMultipleServersOverQuorumDownDuringLockKnownError in chaos.go)

Lock not available anymore

This would be due to too many stale locks and/or too many servers down (total over n/2 - 1). The following table shows the maximum toterable number for different node sizes:

Nodes Max tolerable
4 1
8 3
12 5
16 7

If you see any other short comings, we would be interested in hearing about them.

Tackled issues

  • When two nodes want to acquire the same lock at precisely the same time, it is possible for both to just acquire n/2 locks and there is no majority winner. Both will fail back to their clients and will retry later after a semi-randomized delay.

Server side logic

On the server side just the following logic needs to be added (barring some extra error checking):

const WriteLock = -1

type lockServer struct {
	mutex   sync.Mutex
	lockMap map[string]int64 // Map of locks, with negative value indicating (exclusive) write lock
	                         // and positive values indicating number of read locks
}

func (l *lockServer) Lock(args *LockArgs, reply *bool) error {
	l.mutex.Lock()
	defer l.mutex.Unlock()
	if _, *reply = l.lockMap[args.Name]; !*reply {
		l.lockMap[args.Name] = WriteLock // No locks held on the given name, so claim write lock
	}
	*reply = !*reply // Negate *reply to return true when lock is granted or false otherwise
	return nil
}

func (l *lockServer) Unlock(args *LockArgs, reply *bool) error {
	l.mutex.Lock()
	defer l.mutex.Unlock()
	var locksHeld int64
	if locksHeld, *reply = l.lockMap[args.Name]; !*reply { // No lock is held on the given name
		return fmt.Errorf("Unlock attempted on an unlocked entity: %s", args.Name)
	}
	if *reply = locksHeld == WriteLock; !*reply { // Unless it is a write lock
		return fmt.Errorf("Unlock attempted on a read locked entity: %s (%d read locks active)", args.Name, locksHeld)
	}
	delete(l.lockMap, args.Name) // Remove the write lock
	return nil
}

If you also want RLock()/RUnlock() functionality, then add this as well:

const ReadLock = 1

func (l *lockServer) RLock(args *LockArgs, reply *bool) error {
	l.mutex.Lock()
	defer l.mutex.Unlock()
	var locksHeld int64
	if locksHeld, *reply = l.lockMap[args.Name]; !*reply {
		l.lockMap[args.Name] = ReadLock // No locks held on the given name, so claim (first) read lock
		*reply = true
	} else {
		if *reply = locksHeld != WriteLock; *reply { // Unless there is a write lock
			l.lockMap[args.Name] = locksHeld + ReadLock // Grant another read lock
		}
	}
	return nil
}

func (l *lockServer) RUnlock(args *LockArgs, reply *bool) error {
	l.mutex.Lock()
	defer l.mutex.Unlock()
	var locksHeld int64
	if locksHeld, *reply = l.lockMap[args.Name]; !*reply { // No lock is held on the given name
		return fmt.Errorf("RUnlock attempted on an unlocked entity: %s", args.Name)
	}
	if *reply = locksHeld != WriteLock; !*reply { // A write-lock is held, cannot release a read lock
		return fmt.Errorf("RUnlock attempted on a write locked entity: %s", args.Name)
	}
	if locksHeld > ReadLock {
		l.lockMap[args.Name] = locksHeld - ReadLock // Remove one of the read locks held
	} else {
		delete(l.lockMap, args.Name) // Remove the (last) read lock
	}
	return nil
}

See dsync-server_test.go for a full implementation.

Sub projects

  • See performance directory for performance measurements
  • See chaos directory for some edge cases

Testing

The full test code (including benchmarks) from sync/rwmutex_test.go is used for testing purposes.

Extensions / Other use cases

Robustness vs Performance

It is possible to trade some level of robustness with overall performance by not contacting each node for every Lock()/Unlock() cycle. In the normal case (example for n = 16 nodes) a total of 32 RPC messages is sent and the lock is granted if at least a quorum of n/2 + 1 nodes respond positively. When all nodes are functioning normally this would mean n = 16 positive responses and, in fact, n/2 - 1 = 7 responses over the (minimum) quorum of n/2 + 1 = 9. So you could say that this is some overkill, meaning that even if 6 nodes are down you still have an extra node over the quorum.

For this case it is possible to reduce the number of nodes to be contacted to for example 12. Instead of 32 RPC messages now 24 message will be sent which is 25% less. As the performance is mostly depending on the number of RPC messages sent, the total locks/second handled by all nodes would increase by 33% (given the same CPU load).

You do however want to make sure that you have some sort of 'random' selection of which 12 out of the 16 nodes will participate in every lock. See here for some sample code that could help with this.

Scale beyond 32 nodes?

Building on the previous example and depending on how resilient you want to be for outages of nodes, you can also go the other way, namely to increase the total number of nodes while keeping the number of nodes contacted per lock the same.

For instance you could imagine a system of 64 nodes where only a quorum majority of 17 would be needed out of 28 nodes. Again this requires some sort of pseudo-random 'deterministic' selection of 28 nodes out of the total of 64 servers (same example as above).

Other techniques

We are well aware that there are more sophisticated systems such as zookeeper, raft, etc. However we found that for our limited use case this was adding too much complexity. So if dsync does not meet your requirements than you are probably better off using one of those systems.

Other links that you may find interesting:

Performance of net/rpc vs grpc

We did an analysis of the performance of net/rpc vs grpc, see here, so we'll stick with net/rpc for now.

License

Released under the Apache License v2.0. You can find the complete text in the file LICENSE.

Contributing

Contributions are welcome, please send PRs for any enhancements.

dsync's People

Contributors

balamurugana avatar fwessels avatar harshavardhana avatar kannappanr avatar krishnasrinivas avatar krisis avatar quasilyte avatar smola avatar tkfftk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dsync's Issues

tests: go test -race reports races.

WARNING: DATA RACE
Write by goroutine 102:
  runtime.slicecopy()
      /home/harsha/.gimme/versions/go1.6.2.linux.amd64/src/runtime/slice.go:113 +0x0
  github.com/minio/dsync.(*DRWMutex).Lock()
      /home/harsha/mygo/src/github.com/minio/dsync/drwmutex.go:132 +0x37e
  github.com/minio/dsync.TestThreeSimultaneousLocksForSameResource.func2()
      /home/harsha/mygo/src/github.com/minio/dsync/dmutex_test.go:253 +0x8d

Previous write by goroutine 188:
  github.com/minio/dsync.(*DRWMutex).Unlock()
      /home/harsha/mygo/src/github.com/minio/dsync/drwmutex.go:315 +0x2d5
  github.com/minio/dsync.TestThreeSimultaneousLocksForSameResource.func3.1()
      /home/harsha/mygo/src/github.com/minio/dsync/dmutex_test.go:271 +0x110

Goroutine 102 (running) created at:
  github.com/minio/dsync.TestThreeSimultaneousLocksForSameResource()
      /home/harsha/mygo/src/github.com/minio/dsync/dmutex_test.go:259 +0x6ed
  testing.tRunner()
      /home/harsha/.gimme/versions/go1.6.2.linux.amd64/src/testing/testing.go:473 +0xdc

Goroutine 188 (finished) created at:
  github.com/minio/dsync.TestThreeSimultaneousLocksForSameResource.func3()
      /home/harsha/mygo/src/github.com/minio/dsync/dmutex_test.go:272 +0x7f

Sending unlock messages in a go-routine makes ``DRWMutex.[R]Lock()`` racy

DRWMutex's [R]Lock() method sends unlock messages (see sendRelease), when quorum number of positive responses weren't received, on nodes that granted lock. This is executed concurrently to subsequent lock retries until quorum is met. The following order of events could cause in more RPC messages exchanged amongst the nodes synchronizing using dsync.DRWMutex than optimal.

Consider a set of 4 nodes (i.e, 4 lock servers).

  1. Lock("test") -> [true, false, false, true]; Quorum not met.
  2. Unlock before locking is retried -> [true, , _, rpc.ErrShutdown], where "" represents "don't care"
  3. Attempting lock on "test" again -> [true, _, _, blocked]; i.e, this lock message raced the unlock message from Step 3.
    This is not protected by the underlying TCP's message ordering since connectLazy replaces rpcClient.rpc with newer socket after every failed Call (or failed rpc.DialHTTP).

Fix all golint issues on dsync

$ golint ./...
drwmutex.go:50:6: exported type Granted should have comment or be unexported
drwmutex.go:52:2: struct field lockUid should be lockUID
drwmutex.go:63:6: exported type LockArgs should have comment or be unexported
drwmutex.go:72:1: exported method LockArgs.SetToken should have comment or be unexported
drwmutex.go:76:1: exported method LockArgs.SetTimestamp should have comment or be unexported
drwmutex.go:80:1: exported function NewDRWMutex should have comment or be unexported
drwmutex.go:167:4: var bytesUid should be bytesUID
drwmutex.go:281:9: if block ends with a return statement, so drop this else and outdent its block
dsync.go:36:1: comment on exported function Init should be of the form "Init ..."
chaos/chaos-server.go:31:1: comment on exported const LockMaintenanceLoop should be of the form "LockMaintenanceLoop ..."
chaos/chaos-server.go:38:7: exported const LockCheckValidityInterval should have comment or be unexported
chaos/chaos.go:510:6: exported type RWLocker should have comment or be unexported
chaos/chaos.go:517:6: exported type DRWMutexNoWriterStarvation should have comment or be unexported
chaos/chaos.go:522:1: exported function NewDRWMutexNoWriterStarvation should have comment or be unexported
chaos/chaos.go:529:1: exported method DRWMutexNoWriterStarvation.Lock should have comment or be unexported
chaos/chaos.go:536:1: exported method DRWMutexNoWriterStarvation.Unlock should have comment or be unexported
chaos/chaos.go:540:1: exported method DRWMutexNoWriterStarvation.RLock should have comment or be unexported
chaos/chaos.go:547:1: exported method DRWMutexNoWriterStarvation.RUnlock should have comment or be unexported
chaos/lock-rpc-server.go:30:38: error strings should not be capitalized or end with punctuation or a newline
chaos/net-rpc-client.go:131:1: exported method ReconnectRPCClient.ServerAddr should have comment or be unexported
chaos/net-rpc-client.go:135:1: exported method ReconnectRPCClient.Resource should have comment or be unexported
performance/net-rpc-client.go:131:1: exported method RPCClient.ServerAddr should have comment or be unexported
performance/net-rpc-client.go:135:1: exported method RPCClient.Resource should have comment or be unexported
performance/performance-server.go:28:38: error strings should not be capitalized or end with punctuation or a newline
performance/performance-server.go:30:7: exported const WriteLock should have comment or be unexported
performance/performance-server.go:77:7: exported const ReadLock should have comment or be unexported

Scalability (curiosity)

Would you mind to document the shortcomings that limits scalability to 16 nodes? ⚡️
also how this number was figured out

Go module: github.com/minio/dsync/[email protected]: go.mod has non-.../v2 module path "github.com/minio/dsync" (and .../v2/go.mod does not exist) at revision v2.0.0

I run GO111MODULE=on go mod tidy or GO111MODULE=on go get github.com/minio/dsync/[email protected] and get a error message:

go: github.com/minio/dsync/[email protected]: go.mod has non-.../v2 module path "github.com/minio/dsync" (and .../v2/go.mod does not exist) at revision v2.0.0
go: error loading module requirements

here is a blog: https://blog.golang.org/using-go-modules.

dsync: chaos test fails with "Timed out"

[12345] 17:01:57.648577 
[12345] 17:01:57.648581 **STARTING** testClientThatHasLockCrashes
[12348] 17:01:58.155173 RPC server listening at port 12348 under /dsync-12348
[12348] 17:01:58.257384 Acquired write lock: test-stale (never to be released)
[12345] 17:02:01.148991 Client that has lock crashes; leaving stale locks at other servers
[12345] 17:02:11.149241 Crashed server restarted
[12345] 17:02:11.149278 Trying to get the lock again
[12348] 17:02:11.155769 RPC server listening at port 12348 under /dsync-12348

[12345] 17:03:11.149439 Timed out -- SHOULD NOT HAPPEN

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.