laforge49 / aatree Goto Github PK
View Code? Open in Web Editor NEWImmutable AA Tree
License: Eclipse Public License 1.0
Immutable AA Tree
License: Eclipse Public License 1.0
By convention, the options map is the last argument on a function, so that the first argument can drive type polymorphism. The exception should be functions which return an updated options map--in these cases the options map should be the first argument to facilitate threading.
This change impacts the API, so it needs to wait for release 0.6.0.
As we dynamically aggregate various mixins, we do need to track any required close functions. A closer mixin can do this with a stack.
For the closer mixin, we need two methods,
(on-close f opts)
(do-close opts)
The on-close method creates the closer mixin, if not already present, and adds f to the stack of close functions. This method is called when another mixin has a close function.
The do-close method then unwinds the stack, calling the various functions (or closures) in reverse order.
While I suspect you can use dynamic variables with clojure async, I doubt that dynamic variables would work well in a go block. So even though I do not see a problem with using the yearling with a async channel component, I think it is time to get rid of those dynamic variables. Volatiles in the options map should be much easier to use, as well as being a tad less worrisome.
A database needs to be able to manage its available disk space before it can manage more data than can fit in memory.
The Yearling database will be the Calf database but with disk space management capabilities. The result will be rather dysfunctional, but it is a step in the right direction.
The sheer quantity of data updated by a single transaction may exceed memory. So we may want to proactively write blocks to disk before the update completes. The catch is that these blocks may be replaced even before the update completes, potentially giving rise to a block leak.
We can however track all the blocks that have been allocated in the current update and when the update completes, review the updated blocks and un-allocate the unused blocks. Expensive, but unlikely to be used often.
By adding uber map to calf, we begin to unify the internals of calf and yearling.
In yearling, the uber map is held by the db state and in turn holds only pending and app map.
In calf, uber map will hold only the app map.
Process pending should be called at the start of each update, with age and transaction count taken from opts, and for which there are default values.
Lets rename the resources parameter to opts, to match the opts parameter on read-string.
And lets document the api as well.
When pulling data from a file, it is best to work with large blocks. But deserializing a large block is a real bottleneck. And all to often, you only need to make a small change and then reserialize the whole thing and write it back to disk. This explains why it is so difficult to write a fast database in Java or Clojure.
Enter partial deserialization/reserialization. If we can deserialize just the root node of a tree, then we can access any node in a tree by deserializing log n nodes. This is a substantial improvement when only a small part of a block needs to be accessed/updated.
This is easily achieved by adding a wrapper for each node which has an atomic reference to its serialized and deserialized content. But before implementing this, we need to rework the existing code base (again). For one thing, we can move the map comparator out of the nodes and into AAMap, which allows us to unify map and vector nodes. For another, we need accessors for the content of each node--accessing the content of a wrapper then forces the content to be deserialized as needed.
Migrate this project to the aatree community: https://github.com/aatree
Currently nodes are not small at all, as they often have references to blocks. Once we have a block cache, this needs to be changed. And this change needs to be done BEFORE making node references soft, or risk having out-of-memory errors.
Without caching or soft references, you will likely be re-reading the same disk blocks over and over. Soft references are not a solution here, as blocks are intentionally huge and using soft block references will likely result in out-of-memory exceptions. So we should add a block cache.
AASet should be provided for completeness, for both basic and lazy sets.
Add support for simple file load and save. Include examples with and without checksums.
Is time we added logging to aatree.
Copying data structures from one database to another is a potential source of errors, as it is difficult to convert all the nested data structures tied to one database into data structures tied to the other. Better to do this transparently. Similarly, we can convert standard Clojure data structures as they are added to the database.
Updates to calf and yearling should work with the db state, not the app map. Further, the api for queries should work with db state rather than the opts map. This does 2 things:
Time to introduce Copy-On-Write.
We already have basic immutable aatree-based structures and lazy structures which provide rediculously fast incremental deserialization and re-serialization. The next step is virtual structures which operate as ordinary sorted maps, vectors and sorted sets but which can be larger than will fit in memory.
These virtual structures will make use of the disk space management support provided by the Yearling database.
One problem with virtual nodes is that you can end up deserializing more than fits in memory. We can use weak references in the virtual node to reference its content, but for this to work we need a cache. And for an LRU cache to work, we need to assign a durable id to each value of each node. I.E. The id should only change when a node is replaced by another with a different value.
VirtualNodes will now get garbage collected. Normally not a problem, as their serialized form is always accessible. The exceptions are the new and updated nodes which have not yet been serialized. VirtualNodes then must keep hard references to their content until they are serialized.
Things would be a lot more convenient in transcribe if we did not distinguish between hash and sorted sets and maps. I'll note EDN nodes not.
The options page in the wiki has gotten large. And we will be adding even more.
The use of options is the key to or composition mechanism. For now, we can break it down by function, but keep everything in a single map. If we go to separate maps, then we would want to go with Stuart's components or something that builds on it. But lets start with just a documentation change for now.
A block is released with an invalid position. This happens after several updates creating 100K entries each.
CountedSequence.first method has a bug:
(swap! v #(if (= s %) (.next (iter s))))
The swap! function takes a function, in this case #(if (= s %) (.next (iter s))), which must be free of side-effects, and it is not.
When a transaction fails, things just stop working. Instead, an exception should be sent to the requestor and the database rolled back.
One complication here is BitSet, which is mutable. So the array used to create the BitSet at the beginning of the transaction needs to be used to rebuild the allocation bitset when a transaction fails.
Rather than pass block size and db size as parameters to the open of yearling, these values should be taken from opts with a default value provided.
The AASet -nth doesn't work with the not-found option. I think the fix would be more like the AAMap implementation. First example below works as expected; the second fails mysteriously.
(nth (conj (aa/new-sorted-set (aa/basic-opts)) 1 2 3) 2 :not-found)
3
(nth (conj (aa/new-sorted-set (aa/basic-opts)) 1 2 3) 4 :not-found)
ClassCastException [trace missing]
I've always used a modified long timestamp (+ unique (<< timestamp 10)) as a transaction id, but was seduced by datomic's simple counter and a separate timestamp. But now I am implementing a pipeline so that transactions can be logged and sync'd before being applied. And I want the transaction id in the log. This is difficult unless I make the timestamp generator durable. Yuck!
So it is back to the tried and tested modified timestamp for a transaction id. As long as you do not have multiple restarts in the same millisecond, we are assured that the transaction ids will be unique. (Use an atom to generate that unique, resetting the count every time the millisecond changes.)
Yearling needs to provide a durably unique id on request. This is to allow virtual nodes to be cached.
The speed now is so bad I consider it a bug. I noticed that things really got slow when I added core.cache, though part of the blame lies with how I added core.cache.
Time to switch to guava cache and hard code it as well.
Traits will now be converted to records, allowing them to be extended with protocols.
After printing a seq, count returns 0:
(println (count s1)); -> 4
(println s1); -> (0 2 20 3)
(println s1); -> (0 2 20 3)
(println (count s1)); -> 0
Lazy vectors and lazy maps should nest recursively and retain lazy deserialization/reserialization.
Going forward, the deprecated code will no longer work. So it is time to drop it.
The API for agents/atoms/etc directly supports monolithic updates. To achieve separation of concerns in large transactions running in a single thread you need to have a reference to state that is used to do multiple small updates. You can not nest such updates easily, so it is best to check for nesting and just throw an exception.
The idea then is to have a function like update-db-state which can be used a bit like get-db-state, but be restricted to running inside a transaction scope. Then we can get rid of those dynamics in Yearling.
So we can put a volatile for db-state in the options map. Normal value would be nil, which would prevent using update-db-state. The value would only be non-nil when inside a transaction scope and not inside the scope of a call to update-db-state.
All updates appear to be lost when an exception is thrown. Rather, the file needs to be closed properly on exception.
Application code should not mix basic and lazy aatree structures, so some decoupling is needed. This will also open the door for additional variations.
And yes, this is an API change. So we can do it in stages. Initially, we will just deprecate the old API.
Calf and Yearling both use an agent for processing updates. Heifer will use a channel so that logging can be done in a separate thread and before the transaction is updated. But why not make agent and channel mixins?
The idea then is to develop a very small instance mixin for agent and then update calf and yearling to use it. The next step then would be to create an equivalent mixin for channel.
Yearling test (now disabled) dies during deserialization. Could this be a problem with lazy? Or just with yearling?
Implement aavector by building on the existing code. Of course, this means reworking the existing code first so that it is more reusable!
Heifer is a Copy-On-Write database like Yearling, but with the addition of a transaction log. Updates then must be serializable. It is also time to add a recovery mechanism so that processing can continue after an update throws an exception without having to reopen the database or reread the root data. Updates then can be batched for greater efficiency.
Seems to me there is some common code that could be extracted which handles file i/o. No point in repeating this code again when we write hefier!
Now that the structure of db-state is the same for calf and yearling, having db-agent expose it via a get-db-state function in core starts to get interesting.
This is a first step in unifying queries within and external to a transaction.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.