datacrypt-project / hitchhiker-tree Goto Github PK

View Code? Open in Web Editor NEW

1.2K 1.2K 64.0 393 KB

Functional, persistent, off-heap, high performance data structure

License: Eclipse Public License 1.0

Clojure 99.94% JavaScript 0.06%

hitchhiker-tree's People

Contributors

Stargazers

Watchers

Forkers

yogthos danboykis gfredericks casperc rlefebvre tolitius koolkt emidln ernestas tchen0123 applied-duality rockymeza louthy mikefaille devn dantodor lambder chlin501 glmeece lastk neuroradiology csm flashtony2005 arminius2 bcambel cddr bharath1097 deepakmohanakrishnan07 ricardojmendez silky fdserr leafgarland anujsrc gevg markaddleman solertis chrisrink10 whilo plumpmath tclamb sourceops zubairalam tiensonqin etsangsplk bobby mdib rads hkrishnan iomonad danieldroit px307 tommy-mor afcarl magemasher e7dal madhuri5279 theronic damienstanton reedho chrisbronkhorst mattalp standardgalactic diargot

hitchhiker-tree's Issues

Unable to load this code

I clone the project, start a REPL, but can't even load the code. Are there some external dependencies not described in either project.clj or the README?

user> (require '[hitchhiker.outboard :as ob])
nil
Exception in thread "redis rc refcounting expirer" 
clojure.lang.ExceptionInfo: Carmine connection error {}
	at clojure.core$ex_info.invokeStatic(core.clj:4617)
	at clojure.core$ex_info.invoke(core.clj:4617)
	at taoensso.carmine.connections$pooled_conn.invokeStatic(connections.clj:201)
	at taoensso.carmine.connections$pooled_conn.invoke(connections.clj:191)
	at hitchhiker.redis$start_expiry_thread_BANG_$fn__37582.invoke(redis.clj:108)
	at clojure.lang.AFn.run(AFn.java:22)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused: connect
	at java.net.DualStackPlainSocketImpl.waitForConnect(Native Method)
	at java.net.DualStackPlainSocketImpl.socketConnect(DualStackPlainSocketImpl.java:85)
	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
	at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:172)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:589)
	at taoensso.carmine.connections$make_new_connection.invokeStatic(connections.clj:73)
	at taoensso.carmine.connections$make_new_connection.invoke(connections.clj:55)
	at taoensso.carmine.connections$make_connection_factory$reify__31869.makeObject(connections.clj:106)
	at org.apache.commons.pool2.impl.GenericKeyedObjectPool.create(GenericKeyedObjectPool.java:1041)
	at org.apache.commons.pool2.impl.GenericKeyedObjectPool.borrowObject(GenericKeyedObjectPool.java:357)
	at org.apache.commons.pool2.impl.GenericKeyedObjectPool.borrowObject(GenericKeyedObjectPool.java:279)
	at taoensso.carmine.connections.ConnectionPool.get_conn(connections.clj:47)
	at taoensso.carmine.connections$pooled_conn.invokeStatic(connections.clj:196)
	... 4 more

I think that we could implement schema by starting every DB with the base schema, and then doing 100% of the validation before committing a transaction. Maybe we don't even need schema for the alpha (although it makes the DB easier to program)

Get rid of all global mutable state, allow clean reloading

I see a couple of defonce here and there, some global mutable state that should imho be encapsulated and allowed to be free'ed:
Encapsulate all the state in a session like "object". A user should be able to handle many of them if necessary (think different trees in different redis dbs and/or servers too), or should be able to have control over resources (such as refcount-expiry-thread, caches, etc).

This should be an argument to some functions imho:

(let [session (create-session! {:backend {:type :redis :port 6379 :host "redis" :db 1}})
      my-outboard (ob/create session "first-outboard-tree")]
      .... do stuff 
      (shutdown-session! session))

could also be integrated with with-open (IClosable).

Some pointers on what/where:

https://github.com/dgrnbrg/hitchhiker-tree/blob/298a0660a44aa86b3bd40b5ef45f7ea35c97154b/src/hitchhiker/redis.clj#L129-L131
https://github.com/dgrnbrg/hitchhiker-tree/blob/298a0660a44aa86b3bd40b5ef45f7ea35c97154b/src/hitchhiker/outboard.clj#L28-L30

Implement local disk storage

We could build a fast local KV storage for the hitchhiker tree, so that it can run as an embedded persistent Datomic-like DB for local applications.

Support for FoundationDB

Anyone wanna see this thing run on foundationdb?

It's a bit of a bear to build the language bindings at the moment so I doubt we'd want that complexity in here. I wonder if the core messaging would be useful as it's own as a library.

Create the write ahead log & index manager

A datomic database has the log of transactions plus 4 indices: :eavt, :aevt, :avet, and :vaet. We need to implement the module which takes a transaction (i.e. set of datoms) and does the following:

adds them to the log
adds them to the indices

In the background, the module needs to flush the indices when enough data is accumulated for the batch. We'll need to store the # of total transactions in the log, as well as the # of transactions durably added to each index, so that we can implement crash recovery.

The log should support pluggable implementations, so that today we could use a file or simple DB scheme, but in the future use Kafka or Redis.

Transaction manager

We need to implement the system that implements transaction functions, map-shaped transactions, and broadcasts transactions to all peers.

p.s. if we broadcast to peers with websockets, we could stream novelty directly to the browser.

Doc issue: No rebalance ?

Looks like in your doc you speak never about rebalancing the tree Does it means that it is never rebalanced ? If it is the case doesn't mean that in the case of insertion in sorted order, access to your tree is linear ?

garbage collection?

You don't describe how you handle the garbage collection, can you give any hint on how you do it?

Readying for production use

I watched the Strange Loop talk and played around with the Outboard API over the weekend. I am very impressed with the work being done here. Being able to treat the database as a persistent data-structure just as any other built-in Clojure data-structure brings a lot of benefits to the table. Even more compelling than the performance benefits are, for me at least, the advances in semantics. It makes writing production applications more simple. By eliminating the need for database interfaces, ORMs, query DSLs, ad-hoc synchronization mechanisms, all still commonplace in real-world Clojure, a lot of complexity is effectively reduced. In other words, there is great value here.

I would like to test this library in real-world applications as soon as possible, and help others do the same. I am planning to write a component for it for inclusion in the system library.

I have two immediate concerns that you might be able to address.

Make all operations safe. API calls will be made in threads, typical of web applications, so race conditions should be excluded. Here, for example.
Package the library and make it accessible on Clojars so that we can incorporate it in our build tools.

Again, I view the work done here as a fundamental piece completing the programming model promised by Clojure with regards to the functional paradigm. I see many reasons to prefer this project over Datomic, especially for small production applications. The API is brilliant and joyful to use. I would be thrilled to witness its usage spread out.

Thank you so much indeed.

Tracing GC for data segments

We need a tracing GC for data segments, so that we can know which blocks are no longer active and can be garbage collected. We'll build this as a storage middleware.

This will ensure the DB doesn't grow to use unbounded backend storage space.

Define the datascript integration points

In Datomic, it seems like seek-datoms is the only fundamental API that needs to be exposed to the query engine in order to calculate its answers. We need to decide what the API through which Datascript will interact with our code will be.

Once this is done, we'll be able to implement some kind of transaction & query API based on the index manager module (see #11).

Define the storage middleware pattern

We need to have standard examples of a storage middleware, which we'll use to add things like caching, encryption, and compression.

We'll define a simple cache middleware as a reference for more sophisticated ones.

DataHike integrates Hitchhicker trees and DataScript

From the README "I would love to see [Hitchicker] integration with DataScript for a fully open source Datomic."

Check out https://github.com/replikativ/datahike ;-)

Make forward-iterator async

Instead of using a lazy-seq, use go-routines and provide a buffered return channel as an argument so the caller can determine the prefetching size.

Choose a KV store to use for alpha

Integrating with the KV store will be trivial, since it only needs to support put and get with hinted ordered writes. Ordered writes are if a client which writes to the same key in a sequence, any other readers will see that same sequence appear--no writes can be seen out of order (n.b. Riak and Cassandra don't do this by default). Hinting means that we don't need every write to have that ordering property--only certain ones (the metadata update operations).

Ideally, we should choose a system easy to run locally or deploy scalably in the cloud.

Here are some choices we can consider:

Riak
Cassandra
Redis
RethinkDB
Any JDBC sql db
Accumulo
DynamoDB
CockroachDB