Code Monkey home page Code Monkey logo

go-caskdb's Introduction

logo

CaskDB - Disk based Log Structured Hash Table Store

made-with-go build codecov GitHub License twitter@iavins

architecture

CaskDB is a disk-based, embedded, persistent, key-value store based on the Riak's bitcask paper, written in Go. It is more focused on the educational capabilities than using it in production. The file format is platform, machine, and programming language independent. Say, the database file created from Go on macOS should be compatible with Rust on Windows.

This project aims to help anyone, even a beginner in databases, build a persistent database in a few hours. There are no external dependencies; only the Go standard library is enough.

If you are interested in writing the database yourself, head to the workshop section.

Features

  • Low latency for reads and writes
  • High throughput
  • Easy to back up / restore
  • Simple and easy to understand
  • Store data much larger than the RAM

Limitations

Most of the following limitations are of CaskDB. However, there are some due to design constraints by the Bitcask paper.

  • Single file stores all data, and deleted keys still take up the space
  • CaskDB does not offer range scans
  • CaskDB requires keeping all the keys in the internal memory. With a lot of keys, RAM usage will be high
  • Slow startup time since it needs to load all the keys in memory

Community

CaskDB Discord

Consider joining the Discord community to build and learn KV Store with peers.

Dependencies

CaskDB does not require any external libraries to run. Go standard library is enough.

Installation

go get github.com/avinassh/go-caskdb

Usage

store, _ := NewDiskStore("books.db")
store.Set("othello", "shakespeare")
author := store.Get("othello")

Cask DB (Python)

This project is a Go version of the same project in Python.

Prerequisites

The workshop is for intermediate-advanced programmers. Knowing basics of Go helps, and you can build the database in any language you wish.

Not sure where you stand? You are ready if you have done the following in any language:

  • If you have used a dictionary or hash table data structure
  • Converting an object (class, struct, or dict) to JSON and converting JSON back to the things
  • Open a file to write or read anything. A common task is dumping a dictionary contents to disk and reading back

Workshop

NOTE: I don't have any workshops scheduled shortly. Follow me on Twitter for updates. Drop me an email if you wish to arrange a workshop for your team/company.

CaskDB comes with a full test suite and a wide range of tools to help you write a database quickly. A Github action is present with an automated tests runner. Fork the repo, push the code, and pass the tests!

Throughout the workshop, you will implement the following:

  • Serialiser methods take a bunch of objects and serialise them into bytes. Also, the procedures take a bunch of bytes and deserialise them back to the things.
  • Come up with a data format with a header and data to store the bytes on the disk. The header would contain metadata like timestamp, key size, and value.
  • Store and retrieve data from the disk
  • Read an existing CaskDB file to load all keys

Tasks

  1. Read the paper. Fork this repo and checkout the start-here branch
  2. Implement the fixed-sized header, which can encode timestamp (uint, 4 bytes), key size (uint, 4 bytes), value size (uint, 4 bytes) together
  3. Implement the key, value serialisers, and pass the tests from format_test.go
  4. Figure out how to store the data on disk and the row pointer in the memory. Implement the get/set operations. Tests for the same are in disk_store_test.go
  5. Code from the task #2 and #3 should be enough to read an existing CaskDB file and load the keys into memory

Run make test to run the tests locally. Push the code to Github, and tests will run on different OS: ubuntu, mac, and windows.

Not sure how to proceed? Then check the hints file which contains more details on the tasks and hints.

Hints

  • Not sure how to come up with a file format? Read the comment in the format file

What next?

I often get questions about what is next after the basic implementation. Here are some challenges (with different levels of difficulties)

Level 1:

  • Crash safety: the bitcask paper stores CRC in the row, and while fetching the row back, it verifies the data
  • Key deletion: CaskDB does not have a delete API. Read the paper and implement it
  • Instead of using a hash table, use a data structure like the red-black tree to support range scans
  • CaskDB accepts only strings as keys and values. Make it generic and take other data structures like int or bytes.

Level 2:

  • Hint file to improve the startup time. The paper has more details on it
  • Implement an internal cache which stores some of the key-value pairs. You may explore and experiment with different cache eviction strategies like LRU, LFU, FIFO etc.
  • Split the data into multiple files when the files hit a specific capacity

Level 3:

  • Support for multiple processes
  • Garbage collector: keys which got updated and deleted remain in the file and take up space. Write a garbage collector to remove such stale data
  • Add SQL query engine layer
  • Store JSON in values and explore making CaskDB as a document database like Mongo
  • Make CaskDB distributed by exploring algorithms like raft, paxos, or consistent hashing

Line Count

$ tokei -f format.go disk_store.go
===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 Go                      2          320          133          168           19
-------------------------------------------------------------------------------
 format.go                          111           35           67            9
 disk_store.go                      209           98          101           10
===============================================================================
 Total                   2          320          133          168           19
===============================================================================

Contributing

All contributions are welcome. Please check CONTRIBUTING.md for more details.

Community Contributions

Author Feature PR
PaulisMatrix Delete Op #6
PaulisMatrix Checksum #7

License

The MIT license. Please check LICENSE for more details.

go-caskdb's People

Contributors

avinassh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

go-caskdb's Issues

Implementing Level1 tasks.

  1. Support for deletion: Since its an append only log file, you don't actually go and update the value for the key but rather write a record with the same key but a sort of tombstone value. Once you load this in the KeyDir, the key will updated with the offset of the recent write(your tombstone value) and you can just check for this value.
    Related PR: https://github.com/avinassh/go-caskdb/pull/6/files

  2. CRC: Checksum is calculated based on the value that is being encoded and stored in the header. On requesting a kv record, we just check if the checksum matches i.e value is not corrupted.

  3. Support for expiry: Have also added SetX() call which takes in an expiry time (time.Duration) after which the key will expire. We just note the expiry time in the header and check if they key has expired after X mins/secs.

  4. Multi Data types for values: You can store different data types as values besides strings. This type information is stored in a separate header field called ValueType and check accordingly while decoding the value.
    Note: Faced few issues while encoding a rune and int32 since both are of same type( rune is just an alias for int32) so have to define a custom type UglyRune. So if users want to store rune they need to use this custom type instead.

At this point, our total headersize is 24bytes: CRC(4B) + ExpiryTime(4B) + TimeStamp(4B) + KeySize(4B) + ValueSize(4B) + ValueType(4B)

Refactored bits:
Separate Record struct for kv pair: https://github.com/avinassh/go-caskdb/pull/3/files#diff-4e36647f41c8fcc4fe2791fec1c379ac314053900486e3eb80e116b1c20ea731R101

Different Operations:

type Store interface {
	Get(key string) string
	Set(key string, value interface{}) error
	SetX(key string, value interface{}, expiry time.Duration) error
	Delete(key string) error
	ListKeys(string) <-chan Record
	Close() bool
} 

DiskStore{} implements this interface.

Range scans is not implemented yet since it requires using a whole different ds(balanced trees like red black or avl trees) for storing the kv. Will do it later once I get a whole clarity on it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.