borgstad / btree Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 0.0 105 KB

Makefile 2.73% C 97.27%

hacktoberfest

btree's People

Contributors

Stargazers

Watchers

btree's Issues

Binary search instead of linear

To minimize the worst case running time of internal node search, use binary search which runs in O(log n), instead of linear search which runs in O(n).

Adjust nodes total in-memory size to OS pagesize

For a performance optimization set the in-momory node size to the pagesize of the underlying OS. This should ensure only one read per node and therefore the fewest toal number of reads.

Hashmap collision resolution

In order to keep track of inserted elements, a simple hashmap is used.
This hashmap is by default initialized to having 10,000,000 entries.
Given the probablity of a hash collision: $\frac{k^2}{2N} = p$ , when k = 1000, then p < 0.05. At k=2000, p < 0.2.

Thus, it is nescessary to use a hashing scheme that deals with collisions.

Rename test files to have prefix `test` instead of postfix

The C files are names *_test.c, and should be named test_*.c

Add documentation for functions

Add docstrings to all functions

Btree does not need to rely on hashtable for lookup

Instead of using a generated id for each node in the tree and look this up in a hashtable, simply use the offset on disk (perhaps name this blockNumber), as the id of a node:

diskRead does not need a lookup in a hashtable, as it can use the id directly.
diskWrite needs to ask the storage object for the next blockNumber

Also, a storage structure should keep track of nodes that has been freed due to deletion.

Ensure free memory when btree operations return

When the "public functions" (btreeInsert, btreeSearch, etc.) returns, make sure that used memory has been freed.

Add unit tests

Add tests, preferably with fixtures.

The tests now relies on C assert. Look into Unity for C.

Restructure build process

The build process (Make) should create a bin dir, containing test files, that are all run.

The object files should land in bin/.obj and bin/.obj/tests for application object files and tests respectively.

Insertion is way too slow

The following code inserts 1000 elements, and measures clocks used for computatoins, and wall clock time.

  time_t start = time(NULL);
  clock_t begin = clock();
  for (int i = 0; i < nInsertions; i++)
  {
    bt = btreeInsert(bt, i);
    bt.root = diskRead(bt.id, maxDegree);
  }
  printf("%f\n", (double)((time(NULL) - start)));
  clock_t end = clock();
  double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
  printf("%f\n", time_spent);
  // inorderTraversal(bt.root, maxDegree);
  for (int i = 0; i < maxDegree + 25; i++)
  {
    // printf("%i\n", i);
    assert(btreeSearch(bt.root, i).ok);
  }
}

Wall clock time: 15 sec.
Clock time: 1.546875 sec.

It takes a whooping 15 sec to insert 1000 elements! Anything over 1 sec is useless.

Perhaps experiment with different reading/writing functions in DiskRead. The OS should cache data before flushing to disk, how to take advantage of that?

Add buffer pool functionality

Whenever a Node is read through diskRead, memory is being allocated. Some structure must keep track of what is in memory.

The in-memory content could be represented by a hashmap structure, mapping Node Ids to their offset on disk. The hashmap should also have a list containing Nodes that has been deleted, and therefore their offset freed. The hashing functionality should be controlled through hash.c, and have the following API:

createHashTable: Creating a hash table which amounts to initalizing the struct defined below.
hashPut: Hash a Node Id, with the value being the current offset + 1.
hashGet: Get offset value from Node Id.

HashTable struct to initialized by createHashTable

typedef struct HashTable HashTable;

struct HashTable
{
  int *table;
  int tableSize;
  int curOffset;
  int unallocatedSize;
  int *unallocated;
  int unallocatedN;
};

Likewise, a structure, possibly the DiskNode struct, should maintain a list of Node Ids currently in memory. When the number of Ids exceed some amount, the Node should be written to disk, and removed from memory. This list should be maintained in btree.c, and totally transparent to any calls to diskRead or diskWrite.

Test inserts of random values

The tests only reflects the case where values are inserted in sequential order.

Create test(s) where random values are inserted.

General refactor: Put .c in src/ and .h in include/ and rename functions

Functions should be renamed, such that a function name is always prefixed with the abstraction it represents:
example: initializeCache -> cacheInitialize

Implement cache with LRU for optimizing read/write

Implement a structure like:

typedef struct 
{
    Node *cache;
    HashTable *nodeMemStatus;
    LinkedList lru;
    int nodesInMem;
} Cache

The cache should be initialized to be of a variable size (also depending on the size of the node, see #10). This also implies that calls to diskRead and diskWrite should look in the cache before reading from disk.

Whenever the nodesInMem exceeds a user-defined threshold, nodes should be written to disk.

Idea 1: It would be interesting to keep simple statistics of how many writes/reads are done in a btree, and optimize for that.
Idea 2: Would it make sense to have a thread which functions as a garbage collector, managing nodes in memory?

borgstad / btree Goto Github PK

btree's People

Contributors

Stargazers

Watchers

btree's Issues

Recommend Projects

Recommend Topics

Recommend Org