pumpkindb / pumpkindb Goto Github PK
View Code? Open in Web Editor NEWImmutable Ordered Key-Value Database Engine
Home Page: http://pumpkindb.org
License: Mozilla Public License 2.0
Immutable Ordered Key-Value Database Engine
Home Page: http://pumpkindb.org
License: Mozilla Public License 2.0
Here's an example. When we do a storage transaction, it'd be nice to have something like this:
["Key" "Value" ASSOC COMMIT] WRITE
However, ASSOC and COMMIT need to have a reference to their transaction. Even though we can make this transaction serializable to a byte array and push it onto the stack before evaluating the closure, my concern is that this will unnecessarily mess up the stack and make reading it really difficult.
Hence, there needs to be a mechanism in script::VM
to provide context to the code that's being evaluated.
Error::Code(-30799, 'MDB_KEYEXIST: Key/data pair already exists’)
Since PumpkinDB doesn't allow to override keys, one has to write to a new key value every time something is recorded. We have at least one primitive for that right now, and that is HLC
. Since it is guaranteed to generate unique and monotonically growing values, they can be used to sequence any collection of values. All one has to do is to compose the key this way:
["key" HLC CONCAT "value" ASSOC COMMIT] WRITE
Similarly, things like journalling events can be done in the same way:
["journal" HLC CONCAT "id" "value" 2 WRAP ASSOC COMMIT] WRITE
However, how do we quickly find what's the last element in the collection?
Proposed solution: introduce a number of words to work with the concept. For now, lets refer to it as "ordered collection" (ORDCOLL), but we might want to find a better name.
I can think of some of the primitives:
ORDCOLL/PAIR : ROT ROT SWAP EVAL CONCAT SWAP
This is used to produce key value pairs for ordered collections:
>[[HLC] "testkey" 1 ORDCOLL/PAIR ASSOC COMMIT] WRITE
>[[HLC] "testkey" 2 ORDCOLL/PAIR ASSOC COMMIT] WRITE
>[[HLC] "testkey" 3 ORDCOLL/PAIR ASSOC COMMIT] WRITE
>[[HLC] "testkey" "Hello" ORDCOLL/PAIR ASSOC COMMIT] WRITE
The reason why it takes value
in is to make it more future proof (what if this or other collections will do something with the value, too?)
ORDCOLL/LAST : 2DUP CONCAT SWAP DROP CURSOR DUP ROT CURSOR/SEEK? [DUP CURSOR/CUR UNWRAP DROP ROT DUP LENGTH SWAP ROT ROT 0 SWAP SLICE EQUAL? [CURSOR/CUR] [CURSOR/PREV] IFELSE ] [CURSOR/LAST SWAP DROP] IFELSE UNWRAP SWAP DROP
This one returns the last element in the collection:
> ["testkey" HLC/MAX ORDCOLL/LAST] READ
"Hello"
I can also see ORDCOLL/FIRST
added for completeness (although it should be relatively trivial)
P.S. Keep in mind that the above implementations haven't been thoroughly tested!
It is not a problem right now, but soon enough multi-VM setups will be a thing (use all the cores!) and they need to synchronize their access to writing, since lmdb only permits one active writer
I've ran into this issue while working on a simple algorithm for finding the last item in an ordered collection. Creating a binary of N bytes (0xFF) wastes cycles and lots of allocation.
Proposed solution: size byte ALLOC
word, that would allocate size
bytes with the value of byte
in one go.
One has to make sure that they are hexadecimal and little-endian.
Proposed solution: prohibit words to start with - and 0..9 and allow these to be interpreted as bigint/biguints.
Right now, both keys and values are persisted with their size prefixes (according to the binary form data representation rules). This motivation for this was to avoid allocating memory to put a prefix in front of them upon retrieval.
However, when traversing a range of keys (CURSOR/SEEK and then /NEXT), it'll trip up the cursor when a "composite key" size changes, changing the very beginning of it.
Proposed solution: make Env stack store references to data without their size prefixes (as slices already have length) and hence we can move to persisting data as is in the database. Having stack that was is not big of a deal because we can write size prefixes on demand when we are sending data back, avoiding allocations there.
Proposed solution:
This means it'll try to reschedule all Envs until something is received. Ouch!
Proposed solution: tag Receivers with EnvId
Most of the time I end up just copying examples as is into the examples section.
Proposed solution: write actual doctests as a set of programs (builtins format) that should return 0 or 1 at the top of the stack
Something like
test_swap : 1 2 SWAP 2 WRAP 2 1 2 WRAP EQUAL?.
Eventually this can be used to run these automatically
This is very unknown when you want to write something large or copy-paste a multiliner.
Proposed solution: make pumpkindb-term read lines until it sees a period at the very end. But this should be in a way so that modifying lines above and below within that one input is possible (similar to how shell behaves when you type in \
or even in a better way)
I believe 64 bit frame size is just too much as 32 bits would have sufficed — is there any reason to send more than 4GB in a single frame?
Proposed solution: switch to 32-bit sizes.
Instead of collecting "query results" data on stack,
data needs to be sent off elsewhere.
First thing it will be useful for sending the resulting
stack over (see #29)
Proposed solution: implement a pubsub mechanism
where consumers can subscribe to certain keys and
scripts can send data over to those keys.
Something like
<data> <key> SEND
It leaves very little reserved space (123 and 127, because words start at 128) and makes no sense — why 120?
Proposed solution: decrease this range to 0..99, 100..110 represent 0..10 themselves, respectively. 121-123 prefixes move down to 111-113, and 114-127 become the new reserved pool. This is also related to #1
It is not clear how to market the project best these days and going further. How broad or how narrow should we get? How to get a clear, clicking message to attract the right kind of people?
They don't work well on numeric values
Proposed solution: introduce UINT/LT? and UINT/GT? words for unsigned integers.
Optional values are [multiple]values wrapped into a closure. SOME?
and NONE?
words refer to this concept.
We do, however, have CURSOR/*
family of words that don't indicate what they return. Unless you've read the documentation, you can't really know.
Proposed solution: introduce a naming convention for words that return an optional value.
For example, ?
as a prefix.
This way TRY
can become ?EVAL
> [DUP] ?EVAL
"Soft" cursor words:
> ?CURSOR/NEXT
This can be pronounced as "maybe-EVAL, maybe-CURSOR/NEXT". The boolean counterparts ("CURSOR/NEXT?" which return a boolean, can be pronounced as "is-CURSOR/NEXT")
Sometimes it's nice to annotate your code
Proposed solution: use Forth ( ... )
notation. Other ideas are also welcomed.
Proposed solution: extract them into a single macro
The most annoying case is the upcoming usage of "boolean" values — [0] and [1]. Right now they are encoded as [1, 0] and [1,1], respectively.
Proposed solution: allocate a band for very numbers, at the very least 0 and 1 (as these are the most expected small numbers). Under the current scheme, anything in [124u8...128u8 ] can be used.
This is important when (for example) comparing key prefixes
Proposed solution: SLICE word that would take a byte array, start index and end index and push back a slice of that array. Also, in conjunction, LENGTH word would be very useful so that it is easy to implement something like SLICE_FROM that would always slice to the end.
Introduced through #67
Problem: implementing words without subwords is difficult
Technically speaking, loaded words can define their own
words but they will leak into the remainder of the program,
which less than ideal.
Using stack alone requires a significant amount of juggling
that makes writing code incredibly frustrating. I personally
believe that the value of stack based programming languages
is in concatenative abilities (composition via stack), not
just being able to do everything by juggling items in the
stack.
Solution: make all SETs and DEFs done within any closure
local to that closure.
This means that if eventually we'll need to do code injection
that's not injecting what effectively amounts to closures,
we'll need to have a separate pass result type that can
indicate that.
CURSOR/CUR will return [key value]
Proposed solution: introduce CURSOR/CUR! that will produce two values on the stack
Proposed solution: EQUAL? word, pops two items from the stack, if they are equal, pushes [1] on top of the stack, otherwise pushes [0]
While following Forth to the letter isn't our goal (many stack operations can be implemented using the very basic ones), not relying on its experience isn't particularly wise.
Proposed solution:
Implement following words:
I'm personally less happy about PICK and ROLL as they make one feel like stack is an array, but are there good arguments to implement them?
It's impossible to do something if some condition is true or false.
Proposed solution: IF and IF_ELSE words.
IF would take cond code
and eval code
if cond
is [1]
IF_ELSE would take cond code code_else
and eval code
if cond
is [1] or eval code_else
otherwise.
Exit if false:
... NOT [EXIT] IF
Duplicate if true, drop if false
... [DUP] [DROP] IF_ELSE
It doesn't allow for streaming or compact communication.
Proposed solution: use binary-frames protocol with binary form encoding and write a CLI tool (pumpkindb-term
or whatever it should be called) to be the client REPL.
Also, it would make sense to switch to Mio from Tokio as I think it might fit into our patterns better. But that's to be decided.
Related to #29
There is a bug in the parser which will cause memory exhaustion:
$ cargo run --bin pumpkindb
Finished dev [unoptimized + debuginfo] target(s) in 0.0 secs
Running `target/debug/pumpkindb`
Available disk space is approx. 392Gb, setting database map size to it
Listening on 0.0.0.0:9981
fatal runtime error: out of memory
In another terminal
$ telnet localhost 9981
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
["script
Connection closed by foreign host.
Proposed solution: implement DEPTH word
Right now, the returned pointer is not checked for being a null pointer (which is an indication of failure) and the returned slice will point to unallocated memory, crashing the entire process once accessed.
Proposed solution: make Env.alloc()
and Env
initializers return a Result
and return an Err
if [re]allocation failed so the error can be caught and properly handled.
This is an upcoming problem. My current thinking that the server should only
be a binary packet streaming server, and those who want to play with PumpkinDB using
text form, should use a command-line PumpkinDB client (we can call it pumpkindb-term
) that will compile code to binary form and send it over.
The idea is that when a binary form is received by the server and evaluated the resulting stack is simply discarded as there's no way to know just how much of the stack the requestor actually want.
But stack will sometimes need to be captured. The only way to communicate back from the script to the requestor (or other parties, for that matter) is to use pubsub capabilities of the server (not in yet, just something I am thinking about — but I strongly believe this is the way to go). So lets imagine publishing to a pubsub channel would be something like
<data> <channel> SEND
So, what if need to send the entire stack? How do we capture it?
Proposed solution: introduce a STACK word that will capture the stack and put it as a binary on top of the stack, so that capturing the stack and sending it over will be as simple as
STACK <channel> SEND
parse("")
will return DecodingError
In lot of cases, events are serialized to JSON. We can't really process them for indexing or other needs right now.
Proposed solution: start a JSON collection of words. Here's the starting proposal:
null
)"Hello"
-> Hello
)TBC
Currently, Service
for server::PlainServer
resorts to using channel recv to get a result from the VM. While it's currently "fast enough", this is not great.
Proposed solution: switch it to futures.
In order to avoid storing the whole piece of data in the key (which is often not going to work because of key size limitation), equality indexing should write keys like:
[index][value hash] => [key]
Proposed solution: find out what hashing algorithms are typically used by databases for their HASH index types, and implement the most reasonable one.
(Previously, I used SHA-1 there, but I am not sure if it is the best candidate)
self => handle_set,
// storage
self.storage => handle_write,
self.storage => handle_read,
...
Because handle_set
checks Env
's dictionary for unknown words, it means all words checked below handle_set
entry, can be overridden.
Proposed solution: move it to the bottom of the list and add a comment saying it shouldn't be moved.
Right now, all words that evaluate given closures, parse them to ensure they are valid before injecting them into the scheduler. The reason for that is, once scheduled, it's impossible to extract them out due to possible misinterpretations and generally unknown size of the extension at the evaluation time. However, this carries a significant performance penalty.
Proposed solution: let scheduler operate not on a single Vector, but perhaps a vector of slices, this way enabling the invalid slice ejection (or, if we don't want to allocate code on stack — we didn't do this yet in passes for code, then Vec of Vecs). This way the error will still occur, but not invalid_error, but rather decoding_error, and the code will be ejected
Right now, because of how pubsub::PublisherAccessor.send
is implemented, it'll wait until the message has fully gone through, which will block the scheduler until then.
This is, of course, unacceptable!
Proposed solution: have a send_async that returns a receiver, if receiver's try_recv
on the next round is not yielding anything, keep rescheduling.
It requires a little bit of ceremony.
Proposed solution: introduce helper words:
CURSOR/DOWHILE : ['iterator SET 'closure SET 'c SET
[c closure EVAL [c iterator EVAL] [0] IFELSE] DOWHILE] EVAL/SCOPED.
?CURSOR/DOWHILE : ['iterator SET 'closure_ SET 'c SET
c [?CURSOR/CUR closure_ EVAL] iterator CURSOR/DOWHILE] EVAL/SCOPED.
?CURSOR/DOWHILE-PREFIXED : ['closure__ SET
'prefix SET
CURSOR 'c SET
c prefix CURSOR/SEEK?
[c [UNWRAP OVER 0 prefix LENGTH SLICE prefix EQUAL?
closure__ [0] IFELSE
] 'CURSOR/NEXT? ?CURSOR/DOWHILE]
IF] EVAL/SCOPED.
I am pretty sure there should be a more efficient way, but even if there isn't, it still should be wrapped in a macro.
This issue is mentioned in ad6da52
There was a realization that Env<'a>
wants all referenced elements on the stack to allow have
the same lifetime of 'a
. However, the lifetime of the
values extracted through lmdb is limited by the lifetime
of the transaction.
Potential solution:
This means that right now we have to resort to copying
values. However, I suspect there's still a chance we can
have an optimization here. We might be able to have a
temporary "transaction context" stack that has a different
lifetime. The feasibility and consequences of such an idea
are still to be researched. But if defined well, this
should definitely improve the overall quiality and performance
of PumpkinDB.
Connected to localhost.
Escape character is '^]'.
["script
Connection closed by foreign host.
On the console:
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Incomplete', /Users/rustbuild/src/rust-buildbot/slave/nightly-dist-rustc-mac/build/src/libcore/result.rs:868
Proposed solution: don't simply unwrap the result of script parsing.
Basically, it means that after a restart values returned by HLC
will be less than those
generated before.
Proposed solution: let timestamp module persist last generated HLC.
Letting it store in the same database is rather intricate: there might be no write txn, the write txn might never succeed, etc. So I propose that PumpkinDB maintains a separate "meta" database within the environment with its own transactions.
Any better solution (that would involve I/O penalty?)
First of all, they require error structure maintenance and it is not clear how one should send them back to the client.
A solution that fits errors into PumpkinScript native form of some kind is needed.
This is the key functionality required to implement timestamped data, indexing, etc.
Proposed solution: implement a set of CURSOR-related words
[...code...] CURSOR
Cursor-words:
When it tries to allocate, if the size doesn't fit into the remainder of the chunk, it allocates a new chunk and allocates there. Next time around, though, it only looks at the last chunk and therefore wastes all the space left in previous chunks.
Proposed solution: let allocator scan through chunks to see if any of them have enough space left. This way we can saturate chunks a lot better.
We can't really wait for the stack to be completed to send the results back (nor would it be efficient in any way), so we need to figure out a way to do this
Proposed solution:
Implement a YIELD
word (val --
) results of which can be collected in runtime, for example:
[ [ [... YIELD] CURSOR] READ] "queue_name" ->QUEUE
or
[ [ [... YIELD] CURSOR] READ] "queue_name" ->STACK
(the latter would collect values and put them into the stack at once).
This can be implemented as a field in Env (for example), holding the stack of YieldConsumer
s or something like this. A top one would be a simple discarding consumer.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.