The aatree from laforge49

When opts should be the first argument on a function.

By convention, the options map is the last argument on a function, so that the first argument can drive type polymorphism. The exception should be functions which return an updated options map--in these cases the options map should be the first argument to facilitate threading.

This change impacts the API, so it needs to wait for release 0.6.0.

closer mixin

As we dynamically aggregate various mixins, we do need to track any required close functions. A closer mixin can do this with a stack.

For the closer mixin, we need two methods,

(on-close f opts)
(do-close opts)

The on-close method creates the closer mixin, if not already present, and adds f to the stack of close functions. This method is called when another mixin has a close function.
The do-close method then unwinds the stack, calling the various functions (or closures) in reverse order.

go vs dynamic and yearling

While I suspect you can use dynamic variables with clojure async, I doubt that dynamic variables would work well in a go block. So even though I do not see a problem with using the yearling with a async channel component, I think it is time to get rid of those dynamic variables. Volatiles in the options map should be much easier to use, as well as being a tad less worrisome.

A Disk Space Manager

A database needs to be able to manage its available disk space before it can manage more data than can fit in memory.

The Yearling database will be the Calf database but with disk space management capabilities. The result will be rather dysfunctional, but it is a step in the right direction.

Managing large updates

The sheer quantity of data updated by a single transaction may exceed memory. So we may want to proactively write blocks to disk before the update completes. The catch is that these blocks may be replaced even before the update completes, potentially giving rise to a block leak.

We can however track all the blocks that have been allocated in the current update and when the update completes, review the updated blocks and un-allocate the unused blocks. Expensive, but unlikely to be used often.

add uber map to calf

By adding uber map to calf, we begin to unify the internals of calf and yearling.

In yearling, the uber map is held by the db state and in turn holds only pending and app map.

In calf, uber map will hold only the app map.

Process Pending

Process pending should be called at the start of each update, with age and transaction count taken from opts, and for which there are default values.

API, parameter name change

Lets rename the resources parameter to opts, to match the opts parameter on read-string.

And lets document the api as well.

lazy deserialization/reserialization

When pulling data from a file, it is best to work with large blocks. But deserializing a large block is a real bottleneck. And all to often, you only need to make a small change and then reserialize the whole thing and write it back to disk. This explains why it is so difficult to write a fast database in Java or Clojure.

Enter partial deserialization/reserialization. If we can deserialize just the root node of a tree, then we can access any node in a tree by deserializing log n nodes. This is a substantial improvement when only a small part of a block needs to be accessed/updated.

This is easily achieved by adding a wrapper for each node which has an atomic reference to its serialized and deserialized content. But before implementing this, we need to rework the existing code base (again). For one thing, we can move the map comparator out of the nodes and into AAMap, which allows us to unify map and vector nodes. For another, we need accessors for the content of each node--accessing the content of a wrapper then forces the content to be deserialized as needed.

migration

Migrate this project to the aatree community: https://github.com/aatree

soft nodes should not have hard references to buffers

Currently nodes are not small at all, as they often have references to blocks. Once we have a block cache, this needs to be changed. And this change needs to be done BEFORE making node references soft, or risk having out-of-memory errors.

A Block Cache

Without caching or soft references, you will likely be re-reading the same disk blocks over and over. Soft references are not a solution here, as blocks are intentionally huge and using soft block references will likely result in out-of-memory exceptions. So we should add a block cache.

AASet

AASet should be provided for completeness, for both basic and lazy sets.

file save and load

Add support for simple file load and save. Include examples with and without checksums.

Logging

Is time we added logging to aatree.

Transparent Conversions

Copying data structures from one database to another is a potential source of errors, as it is difficult to convert all the nested data structures tied to one database into data structures tied to the other. Better to do this transparently. Similarly, we can convert standard Clojure data structures as they are added to the database.

expose db state to updates

Updates to calf and yearling should work with the db state, not the app map. Further, the api for queries should work with db state rather than the opts map. This does 2 things:

The api for queries and updates are unified, and
The query api (like get transaction count) now work with past state.

A small database

Time to introduce Copy-On-Write.

Virtual AA-Trees

We already have basic immutable aatree-based structures and lazy structures which provide rediculously fast incremental deserialization and re-serialization. The next step is virtual structures which operate as ordinary sorted maps, vectors and sorted sets but which can be larger than will fit in memory.

These virtual structures will make use of the disk space management support provided by the Yearling database.

Weak Node References and node ids Needed

One problem with virtual nodes is that you can end up deserializing more than fits in memory. We can use weak references in the virtual node to reference its content, but for this to work we need a cache. And for an LRU cache to work, we need to assign a durable id to each value of each node. I.E. The id should only change when a node is replaced by another with a different value.

VirtualNodes will now get garbage collected. Normally not a problem, as their serialized form is always accessible. The exceptions are the new and updated nodes which have not yet been serialized. VirtualNodes then must keep hard references to their content until they are serialized.

Drop the sorted distinction in transcribe?

Things would be a lot more convenient in transcribe if we did not distinguish between hash and sorted sets and maps. I'll note EDN nodes not.

rework options page

The options page in the wiki has gotten large. And we will be adding even more.

The use of options is the key to or composition mechanism. For now, we can break it down by function, but keep everything in a single map. If we go to separate maps, then we would want to go with Stuart's components or something that builds on it. But lets start with just a documentation change for now.

Invalid release block

A block is released with an invalid position. This happens after several updates creating 100K entries each.

Race condition in CountedSequence

CountedSequence.first method has a bug:

(swap! v #(if (= s %) (.next (iter s))))

The swap! function takes a function, in this case #(if (= s %) (.next (iter s))), which must be free of side-effects, and it is not.

robustness

When a transaction fails, things just stop working. Instead, an exception should be sent to the requestor and the database rolled back.

One complication here is BitSet, which is mutable. So the array used to create the BitSet at the beginning of the transaction needs to be used to rebuild the allocation bitset when a transaction fails.

Default values for block size and db size

Rather than pass block size and db size as parameters to the open of yearling, these values should be taken from opts with a default value provided.

AASet nth not-found option

The AASet -nth doesn't work with the not-found option. I think the fix would be more like the AAMap implementation. First example below works as expected; the second fails mysteriously.

(nth (conj (aa/new-sorted-set (aa/basic-opts)) 1 2 3) 2 :not-found)
3

(nth (conj (aa/new-sorted-set (aa/basic-opts)) 1 2 3) 4 :not-found)
ClassCastException   [trace missing]

Unique timestamps for transaction ids

I've always used a modified long timestamp (+ unique (<< timestamp 10)) as a transaction id, but was seduced by datomic's simple counter and a separate timestamp. But now I am implementing a pipeline so that transactions can be logged and sync'd before being applied. And I want the transaction id in the log. This is difficult unless I make the timestamp generator durable. Yuck!

So it is back to the tried and tested modified timestamp for a transaction id. As long as you do not have multiple restarts in the same millisecond, we are assured that the transaction ids will be unique. (Use an atom to generate that unique, resetting the count every time the millisecond changes.)

Yearling support for node ids

Yearling needs to provide a durably unique id on request. This is to allow virtual nodes to be cached.

Virtual trees are way too slow

The speed now is so bad I consider it a bug. I noticed that things really got slow when I added core.cache, though part of the blame lies with how I added core.cache.

Time to switch to guava cache and hard code it as well.

conversion to records

Traits will now be converted to records, allowing them to be extended with protocols.

(count (seq)) bug

After printing a seq, count returns 0:

(println (count s1)); -> 4
(println s1); -> (0 2 20 3)
(println s1); -> (0 2 20 3)
(println (count s1)); -> 0

nesting

Lazy vectors and lazy maps should nest recursively and retain lazy deserialization/reserialization.

delete deprecated

Going forward, the deprecated code will no longer work. So it is time to drop it.

Monolithic Updates

The API for agents/atoms/etc directly supports monolithic updates. To achieve separation of concerns in large transactions running in a single thread you need to have a reference to state that is used to do multiple small updates. You can not nest such updates easily, so it is best to check for nesting and just throw an exception.

The idea then is to have a function like update-db-state which can be used a bit like get-db-state, but be restricted to running inside a transaction scope. Then we can get rid of those dynamics in Yearling.

So we can put a volatile for db-state in the options map. Normal value would be nil, which would prevent using update-db-state. The value would only be non-nil when inside a transaction scope and not inside the scope of a call to update-db-state.

all updates lost on error

All updates appear to be lost when an exception is thrown. Rather, the file needs to be closed properly on exception.

Decouple implementation

Application code should not mix basic and lazy aatree structures, so some decoupling is needed. This will also open the door for additional variations.

And yes, this is an API change. So we can do it in stages. Initially, we will just deprecate the old API.

agent and channel mixins

Calf and Yearling both use an agent for processing updates. Heifer will use a channel so that logging can be done in a separate thread and before the transaction is updated. But why not make agent and channel mixins?

The idea then is to develop a very small instance mixin for agent and then update calf and yearling to use it. The next step then would be to create an equivalent mixin for channel.

deserialization bug?

Yearling test (now disabled) dies during deserialization. Could this be a problem with lazy? Or just with yearling?

aavector

Implement aavector by building on the existing code. Of course, this means reworking the existing code first so that it is more reusable!

Heifer Database

Heifer is a Copy-On-Write database like Yearling, but with the addition of a transaction log. Updates then must be serializable. It is also time to add a recovery mechanism so that processing can continue after an update throws an exception without having to reopen the database or reread the root data. Updates then can be batched for greater efficiency.

file access

Seems to me there is some common code that could be extracted which handles file i/o. No point in repeating this code again when we write hefier!

db-agent component should expose db-state

Now that the structure of db-state is the same for calf and yearling, having db-agent expose it via a get-db-state function in core starts to get interesting.

This is a first step in unifying queries within and external to a transaction.

laforge49 / aatree Goto Github PK

aatree's People

Contributors

Stargazers

Watchers

Forkers

aatree's Issues

Recommend Projects

Recommend Topics

Recommend Org