Code Monkey home page Code Monkey logo

b-plus-tree's Introduction

B+Tree implementation in Kotlin

Overview

This project is my learning project for B+Tree (also my first Kotlin experience). It is heavily inspired by InnoDB. So far the list below is implemented in total 963 lines of code:

  • Basic get/put/delete/scan
  • Split page at the middle
  • Read page from memory if it exists, otherwise read from the file
  • Write page to the file synchronously

I stopped here since I had learned a lot. The list below is lots of TODO:

  • Merge pages
  • Logical delete
  • Split by different strategy
  • Asynchronous write + WAL
  • MVCC
  • etc.

Resources

Schema

This B+Tree implementation is mostly generic/schema free. You can use any byte array representation for key and value, but I use Apache Avro for demo purpose since it is sortable even in byte array form (i.e. without decoding)

Demo

First, initialize the db file:

$ ./runsample init foo.db

Then, you can get/put/delete/scan the B+Tree (The default schema is under /sample directory):

$ ./runsample open foo.db
COMMAND: put
KEY: {"name": "foo", "timestamp": 1}
VALUE: {"value": "bbb"}

COMMAND: get
KEY: {"name": "foo", "timestamp": 1}
{"name": "foo", "timestamp": 1} {"value": "bbb"}

COMMAND: put
KEY: {"name": "foo", "timestamp": 10}
VALUE: {"value": "bbb"}

COMMAND: scan
START KEY: {"name": "foo", "timestamp": 0}
END KEY: {"name": "foo", "timestamp": 100}
{"name": "foo", "timestamp": 1} {"value": "bbb"}
{"name": "foo", "timestamp": 10} {"value": "bbb"}

COMMAND: put
KEY: {"name": "bar", "timestamp": 10}
VALUE: {"value": "bbb"}

COMMAND: put
KEY: {"name": "bar", "timestamp": 1}
VALUE: {"value": "bbb"}

COMMAND: scan
START KEY: {"name": "foo", "timestamp": 0}
END KEY: {"name": "foo", "timestamp": 100}
{"name": "foo", "timestamp": 1} {"value": "bbb"}
{"name": "foo", "timestamp": 10} {"value": "bbb"}

COMMAND: scan
START KEY: {"name": "bar", "timestamp": 0}
END KEY: {"name": "foo", "timestamp": 100}
{"name": "bar", "timestamp": 1} {"value": "bbb"}
{"name": "bar", "timestamp": 10} {"value": "bbb"}
{"name": "foo", "timestamp": 1} {"value": "bbb"}
{"name": "foo", "timestamp": 10} {"value": "bbb"}

Internal

Tree

This is a class for B+Tree itself. It supports the operations below:

  • get(key)
    1. Find appropriate LeafNode
    2. Get KeyValue from the LeafNode
  • put(key, value)
    1. Find appropriate LeafNode
    2. Put KeyValue into the LeafNode and sync with the file
    3. If the Page of the LeafNode is full, split it
      • Add the new LeafNode to the same parent
    4. If any InternalNode is also full, split it
      • Add the new InternalNode to the same parent
      • Sync changed pages with the file
    5. If RootNode is full, split it
      • Add new Nodes as children
        • Left node's minimum key is logical minimum key (0 bytes)
      • Sync changed pages with the file
  • delete(key)
    1. Find appropriate LeafNode
    2. Delete KeyValue from the LeafNode and sync with the file
  • scan(startKey, endKey)
    1. Find appropriate LeafNode for the startKey and endKey
    2. Filter KeyValues inside the LeafNode between startKey and endKey
    3. Traverse linked list of LeafNodes until the LeafNode for endKey

Node

This is a (set of) class for Node of B+Tree.

LeafNode

This class stores all KeyValue on backend Page.

  • get(key)
    1. Loop KeyValue list on Page
    2. Find exactly matching KeyValue
  • put(key, value)
    1. Loop KeyValue list on Page
    2. If there is exactly matching KeyValue, update it
    3. If not, insert it with sorted order
  • delete(key)
    1. Loop KeyValue list on Page
    2. Delete exactly matching KeyValue

InternalNode (extends LeafNode)

This class stores minimum key of each child and its Page id as KeyValue.

  • findChildPageId(key)
    1. Loop KeyValue list on Page
    2. Find a child for the key
      • If k1 <= key < k2, return the value of k1
      • If key < k-left-edge, return the value of k-left-edge
  • addChildNode(node)
    1. Fetch minimum key of the given node
    2. Put the minimum key and node id as KeyValue

RootNode (extends InternalNode)

This class is for the RootNode. RootNode is LeafNode for the first time, then it becomes InternalNode after the first split. So, it inherits both.

  • splitRoot()
    1. Create a new Page for left LeafNode and move all KeyValue there
    2. Split the left LeafNode i.e. a new right LeafNode is created

Page

This is a class for Page of B+Tree. Page is backend data structure for all Node classes. PageData is Avro record:

{
  "name": "PageData",
  "type": "record",
  "fields": [
    {"name": "id", "type": "PageId"},
    {"name": "nodeType", "type": "NodeType"},
    {"name": "previousId", "type": "PageId"},
    {"name": "nextId", "type": "PageId"},
    {"name": "records", "type": {"type": "array", "items": "KeyValue"} }
  ]
}

So, the Page is encoded/decoded as Avro's binary format (single-object encoding).

It supports mutational operations below and calculates encoded byte size to check whether the Page is full:

  • insert(index, keyvalue)
  • update(index, keyvalue)
  • delete(index)

FileManager

This class is for managing file read and write. The file has fixed size (128 bytes) metadata at the beginning, then each Page (256 bytes) is stored ordered by Page id.

  • allocate()
    1. Fetch next free Page id and update it
      • So far, this is always the next of last Page. If merge is implemented, merged pages are reclaimed.
    2. Create a new Page for the id on memory
  • read(id)
    1. Seek to the first byte for the Page id
    2. Read all bytes for the Page (1KB) and return decoded Page object
  • write(page)
    1. Seek to the first byte for the Page id
    2. Write encoded Page

PageManager

This class is for managing Page lifecycle. It has memory pool for pages (currently just a Map).

  • get(id)
    1. If it is in the pool, return it
    2. If not, read from FileManager and also cache it
  • allocate()
    1. Allocate a new Page from FileManager and cache it
  • split(page)
    1. Find split point (currently middle of size)
    2. Delete right side KeyValue list
    3. Allocate a new Page with deleted KeyValue list
    4. Connect Linked list between Nodes
    5. Write to the file

Not optimized split

Currently, there is only one split strategy: middle of size. This is the best approach for random insert since a new KeyValue comes random place and half split pages could be filled later.

However, if the insert is in order (both ascending and descending), it is not optimized. Because a new KeyValue is always inserted to one node, rest of nodes are left with half usage.

InnoDB has an optimization to solve this problem (insertion order). See this great post.

Random insert

RootNode(id:0 size=93 records=14)
    key=
    LeafNode(id:1 size=163 keys=[00:01, 00:02, 00:03])
    key=00:04
    LeafNode(id:11 size=163 keys=[00:04, 00:05, 00:06])
    key=00:07
    LeafNode(id:7 size=209 keys=[00:07, 00:08, 00:09, 00:0a])
    key=00:0b
    LeafNode(id:9 size=163 keys=[00:0b, 00:0c, 00:0d])
    key=00:0e
    LeafNode(id:5 size=209 keys=[00:0e, 00:0f, 00:10, 00:11])
    key=00:12
    LeafNode(id:10 size=163 keys=[00:12, 00:13, 00:14])
    key=00:15
    LeafNode(id:3 size=163 keys=[00:15, 00:16, 00:17])
    key=00:18
    LeafNode(id:14 size=163 keys=[00:18, 00:19, 00:1a])
    key=00:1b
    LeafNode(id:2 size=255 keys=[00:1b, 00:1c, 00:1d, 00:1e, 00:1f])
    key=00:20
    LeafNode(id:12 size=209 keys=[00:20, 00:21, 00:22, 00:23])
    key=00:24
    LeafNode(id:4 size=163 keys=[00:24, 00:25, 00:26])
    key=00:27
    LeafNode(id:13 size=163 keys=[00:27, 00:28, 00:29])
    key=00:2a
    LeafNode(id:6 size=209 keys=[00:2a, 00:2b, 00:2c, 00:2d])
    key=00:2e
    LeafNode(id:8 size=255 keys=[00:2e, 00:2f, 00:30, 00:31, 00:32])

Ascending ordered insert

RootNode(id:0 size=103 records=16)
    key=
    LeafNode(id:1 size=163 keys=[00:01, 00:02, 00:03])
    key=00:04
    LeafNode(id:2 size=163 keys=[00:04, 00:05, 00:06])
    key=00:07
    LeafNode(id:3 size=163 keys=[00:07, 00:08, 00:09])
    key=00:0a
    LeafNode(id:4 size=163 keys=[00:0a, 00:0b, 00:0c])
    key=00:0d
    LeafNode(id:5 size=163 keys=[00:0d, 00:0e, 00:0f])
    key=00:10
    LeafNode(id:6 size=163 keys=[00:10, 00:11, 00:12])
    key=00:13
    LeafNode(id:7 size=163 keys=[00:13, 00:14, 00:15])
    key=00:16
    LeafNode(id:8 size=163 keys=[00:16, 00:17, 00:18])
    key=00:19
    LeafNode(id:9 size=163 keys=[00:19, 00:1a, 00:1b])
    key=00:1c
    LeafNode(id:10 size=163 keys=[00:1c, 00:1d, 00:1e])
    key=00:1f
    LeafNode(id:11 size=163 keys=[00:1f, 00:20, 00:21])
    key=00:22
    LeafNode(id:12 size=163 keys=[00:22, 00:23, 00:24])
    key=00:25
    LeafNode(id:13 size=163 keys=[00:25, 00:26, 00:27])
    key=00:28
    LeafNode(id:14 size=163 keys=[00:28, 00:29, 00:2a])
    key=00:2b
    LeafNode(id:15 size=163 keys=[00:2b, 00:2c, 00:2d])
    key=00:2e
    LeafNode(id:16 size=255 keys=[00:2e, 00:2f, 00:30, 00:31, 00:32])

Descending ordered insert

RootNode(id:0 size=103 records=16)
    key=
    LeafNode(id:1 size=255 keys=[00:01, 00:02, 00:03, 00:04, 00:05])
    key=00:06
    LeafNode(id:16 size=163 keys=[00:06, 00:07, 00:08])
    key=00:09
    LeafNode(id:15 size=163 keys=[00:09, 00:0a, 00:0b])
    key=00:0c
    LeafNode(id:14 size=163 keys=[00:0c, 00:0d, 00:0e])
    key=00:0f
    LeafNode(id:13 size=163 keys=[00:0f, 00:10, 00:11])
    key=00:12
    LeafNode(id:12 size=163 keys=[00:12, 00:13, 00:14])
    key=00:15
    LeafNode(id:11 size=163 keys=[00:15, 00:16, 00:17])
    key=00:18
    LeafNode(id:10 size=163 keys=[00:18, 00:19, 00:1a])
    key=00:1b
    LeafNode(id:9 size=163 keys=[00:1b, 00:1c, 00:1d])
    key=00:1e
    LeafNode(id:8 size=163 keys=[00:1e, 00:1f, 00:20])
    key=00:21
    LeafNode(id:7 size=163 keys=[00:21, 00:22, 00:23])
    key=00:24
    LeafNode(id:6 size=163 keys=[00:24, 00:25, 00:26])
    key=00:27
    LeafNode(id:5 size=163 keys=[00:27, 00:28, 00:29])
    key=00:2a
    LeafNode(id:4 size=163 keys=[00:2a, 00:2b, 00:2c])
    key=00:2d
    LeafNode(id:3 size=163 keys=[00:2d, 00:2e, 00:2f])
    key=00:30
    LeafNode(id:2 size=163 keys=[00:30, 00:31, 00:32])

b-plus-tree's People

Contributors

riywo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.