btree's People
btree's Issues
Binary search instead of linear
To minimize the worst case running time of internal node search, use binary search which runs in O(log n), instead of linear search which runs in O(n).
Adjust nodes total in-memory size to OS pagesize
For a performance optimization set the in-momory node size to the pagesize of the underlying OS. This should ensure only one read per node and therefore the fewest toal number of reads.
Hashmap collision resolution
In order to keep track of inserted elements, a simple hashmap is used.
This hashmap is by default initialized to having 10,000,000 entries.
Given the probablity of a hash collision: , when k = 1000, then p < 0.05. At k=2000, p < 0.2.
Thus, it is nescessary to use a hashing scheme that deals with collisions.
Rename test files to have prefix `test` instead of postfix
The C files are names *_test.c
, and should be named test_*.c
Add documentation for functions
Add docstrings to all functions
Btree does not need to rely on hashtable for lookup
Instead of using a generated id for each node in the tree and look this up in a hashtable, simply use the offset on disk (perhaps name this blockNumber
), as the id of a node:
diskRead
does not need a lookup in a hashtable, as it can use the id directly.diskWrite
needs to ask the storage object for the nextblockNumber
Also, a storage structure should keep track of nodes that has been freed due to deletion.
Ensure free memory when btree operations return
When the "public functions" (btreeInsert
, btreeSearch
, etc.) returns, make sure that used memory has been freed.
Add unit tests
Add tests, preferably with fixtures.
The tests now relies on C assert. Look into Unity for C.
Restructure build process
The build process (Make) should create a bin
dir, containing test files, that are all run.
The object files should land in bin/.obj
and bin/.obj/tests
for application object files and tests respectively.
Insertion is way too slow
The following code inserts 1000 elements, and measures clocks used for computatoins, and wall clock time.
time_t start = time(NULL);
clock_t begin = clock();
for (int i = 0; i < nInsertions; i++)
{
bt = btreeInsert(bt, i);
bt.root = diskRead(bt.id, maxDegree);
}
printf("%f\n", (double)((time(NULL) - start)));
clock_t end = clock();
double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("%f\n", time_spent);
// inorderTraversal(bt.root, maxDegree);
for (int i = 0; i < maxDegree + 25; i++)
{
// printf("%i\n", i);
assert(btreeSearch(bt.root, i).ok);
}
}
Wall clock time: 15 sec.
Clock time: 1.546875 sec.
It takes a whooping 15 sec to insert 1000 elements! Anything over 1 sec is useless.
Perhaps experiment with different reading/writing functions in DiskRead
. The OS should cache data before flushing to disk, how to take advantage of that?
Add buffer pool functionality
Whenever a Node
is read through diskRead
, memory is being allocated. Some structure must keep track of what is in memory.
The in-memory content could be represented by a hashmap structure, mapping Node Ids to their offset on disk. The hashmap should also have a list containing Nodes that has been deleted, and therefore their offset freed. The hashing functionality should be controlled through hash.c
, and have the following API:
createHashTable
: Creating a hash table which amounts to initalizing the struct defined below.hashPut
: Hash a Node Id, with the value being the current offset + 1.hashGet
: Get offset value from Node Id.
HashTable struct to initialized by createHashTable
typedef struct HashTable HashTable;
struct HashTable
{
int *table;
int tableSize;
int curOffset;
int unallocatedSize;
int *unallocated;
int unallocatedN;
};
Likewise, a structure, possibly the DiskNode
struct, should maintain a list of Node Ids currently in memory. When the number of Ids exceed some amount, the Node should be written to disk, and removed from memory. This list should be maintained in btree.c
, and totally transparent to any calls to diskRead
or diskWrite
.
Test inserts of random values
The tests only reflects the case where values are inserted in sequential order.
Create test(s) where random values are inserted.
General refactor: Put .c in src/ and .h in include/ and rename functions
Functions should be renamed, such that a function name is always prefixed with the abstraction it represents:
example: initializeCache
-> cacheInitialize
Implement cache with LRU for optimizing read/write
Implement a structure like:
typedef struct
{
Node *cache;
HashTable *nodeMemStatus;
LinkedList lru;
int nodesInMem;
} Cache
The cache should be initialized to be of a variable size (also depending on the size of the node, see #10). This also implies that calls to diskRead
and diskWrite
should look in the cache before reading from disk.
Whenever the nodesInMem
exceeds a user-defined threshold, nodes should be written to disk.
Idea 1: It would be interesting to keep simple statistics of how many writes/reads are done in a btree, and optimize for that.
Idea 2: Would it make sense to have a thread which functions as a garbage collector, managing nodes in memory?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.