peterbourgon / diskv Goto Github PK

View Code? Open in Web Editor NEW

1.4K 1.4K 104.0 138 KB

A disk-backed key-value store.

Home Page: http://godoc.org/github.com/peterbourgon/diskv

License: MIT License

Go 100.00%

diskv's People

Contributors

Stargazers

Watchers

Forkers

x2q goodsign xtao rayleyva aybabtme supr readmill niltonkummer fabriciojs poorman mattrobenolt ruiaylin dhamaniasad saromanov philips jonboulle gprggr 8legd amoncho johnkewforks silasb teslaluo blakewilson-wf mako-taco jmptrader archerbroler nkher cloudxtreme go-panton maniacs-db scalablypro alexshilucky birowo cycker kugoucode corebreaker apelisse schollz imladenovic theothertomelliott colorfuzzy maplebeats winstonprivacyinc epiclabs-io petnet chaossir decoqt markzhang90 developgo opsrampdeveloper alxerg iceber jiangnanandi martinzak-zaber xt0fer xrick nnazeer diamondburned timyinshi wb1t mchtech floren xuchengzhi mimicode showcode-shutup golang-killers jaysoncena holajiawei readall iforking k0nst4nt1n guillermo-menjivar fancybits yusufozturk hricha-deepsource jaydenwen123 orkunkirer rosegaller marioabz deepseawhale fn-code learn-knowlege suryatmodulus isgasho nj-eka standardgalactic ajunlonglive fitims modasi icodus mjudeikis gofxq makefunstuff wnz27 sumonst21 dvgol 4everland iq-scm vishalmcf kellemnegasi

diskv's Issues

Suggestion: Improve performance by maintaining structs

Currently diskv maintains in-memory values in []byte format. It would be excellent from a performance standpoint if arbitrary structs could be maintained instead as this would eliminate the overhead of deserialization. I suspect that the performance improvement would be 1-2 orders of magnitude depending on the application.

ensureCacheSpaceWithLock panics with variable sized keys

If variable sized keys are being used, then ensureCacheSpaceWithLock() can panic during the cache cleanup routine which loops through safe(). This occurs if the key being inserted is larger than the last key removed.

It would be better if either this key were inserted anyway (and thus exceeding the cache threshold) or the routine was modified so that it cleared some percent of the total cache size before proceeding.

Cannot install using go modules

v2.0.1 is the last tagged version that doesn’t include a go.mod file

The newer has a go.mod file but is invalid since the module needs to have a /v3 suffix then.

golang/go#31944 and blang/semver#63 are examples of same issue.

how large can the db be?

if i want to access a key from a db, does it read the whole file into mem or just that particular key section? just wondering if i need to have lots of mem available for this. 100GB per db size advisable etc?

ReadStream copies values into memory too soon

As currently implemented, Diskv.ReadStream is not a purely streaming reader, because it buffers an entire copy of the value (in the siphon) before attempting to put the value into the cache. Unfortunately with large values this is a recipe for memory exhaustion.

Ideally we would stream the value directly into the cache as the io.ReadCloser that ReadStream returns is consumed, checking the cache size as we go. I started to go down this path, but it creates another race condition because then writing into the cache is not atomic: we cannot know when the ReadCloser will be finished consuming the entry, and it's very possible for others to begin Reading the same key-value pair as we're still writing it. So the next step down that road is for readers to actually take a lock per cache entry (which would then get released once the caller Closes the ReadCloser). This quickly became a web of mutexes and synchronisation hacks which felt very unidiomatic golang.

Various simple "solutions" exist (e.g. prechecking the size of the file against the cache size before starting to siphon), but they are all inherently racy and could still lead to memory exhaustion under stressful conditions. (We could also just take a global write lock during reads, but that wouldn't be very nice to other readers, would it?)

@peterbourgon @philips thoughts?

CacheSizeMax ignored in WriteStream function

The WriteStream function or to be precise the writeStreamWithLock function does not check for the CacheSizeMax option, which results in too large files being stored. Is this intentional because it's a stream? I'm using the github.com/gregjones/httpcache library which uses the WriteStream function. Anything that can be done here? :)

completeFilename causes invalid memory address or nil pointer dereference

We're running a component in Kubernetes that uses diskv under the hood. The problem is that the process occasionally crashes when it attempts to remove the key from the store. Here is the relevant stack trace:

github.com/org/vendor/github.com/peterbourgon/diskv.(*Diskv).Erase(0xc42041c2d0, 0x0, 0x1b, 0x0, 0x0) /home/rabbit/org/vendor/github.com/peterbourgon/diskv/diskv.go:409 +0xe7
github.com/org/vendor/github.com/peterbourgon/diskv.(*Diskv).completeFilename(0xc42041c2d0, 0x0, 0x1b, 0x1b, 0x27fdf00) /home/rabbit/org/vendor/github.com/peterbourgon/diskv/diskv.go:525 +0x98
path/filepath.Join(0xc42110b970, 0x2, 0x2, 0xc4208e18c0, 0x11) /usr/lib/go/src/path/filepath/path.go:210
path/filepath.join(0xc42110b970, 0x2, 0x2, 0x0, 0x0) /usr/lib/go/src/path/filepath/path_unix.go:45 +0x96
strings.Join(0xc42110b970, 0x2, 0x2, 0x18d6236, 0x1, 0xc42110b918, 0x2) /usr/lib/go/src/strings/strings.go:424

Data directory is mounted as a regular host path (/opt/spm/agent) and file names are ksuid-compatible identifiers.

diskv is initialized with following configuration:

d := diskv.New(diskv.Options{
		BasePath:     c.Dir,
		Transform:    func(s string) []string { return []string{} },
		CacheSizeMax: 1024 * 1024,
})

Do you have any pointers or ideas why this would happen?

CAS feature request/offer?

I was looking for a CAS[1].

Seems like it could be a pretty small change to diskv. You would need a cryptographic hash function, like say skein or sha256. Then something like:

d := diskv.New(diskv.Options{
        BasePath:     "my-data-dir",
        Transform:    flatTransform,
        CryptoHash: sha256,
        CacheSizeMax: 1024 * 1024,
    })
(key, err) := d.put([]byte{'1', '2', '3'}))
...
(value,err) := d.get(key)

I use put/get because they imply (to me anyways) atomic operations, that read/write do not.

Opinions? Alternatives?

If I did something like the above and send a pull request, would you consider it?

[1] http://en.wikipedia.org/wiki/Content-addressable_storage

Feature Request: TTL

Would love to have a way to track TTLs and have items auto-expire.

diskv not compiling both on go1.03 and go1.1RC3

Even after installing the package
"github.com/petar/GoLLRB/llrb"

the command go get "github.com/peterbourgon/diskv" generates these errors. Any suggestion?

gocode/src/github.com/peterbourgon/diskv/index.go:28: undefined: llrb.Tree
gocode/src/github.com/peterbourgon/diskv/index.go:29: undefined: llrb.LessFunc
gocode/src/github.com/peterbourgon/diskv/index.go:36: defer requires function call, not conversion
gocode/src/github.com/peterbourgon/diskv/index.go:46: defer requires function call, not conversion
gocode/src/github.com/peterbourgon/diskv/index.go:56: defer requires function call, not conversion
gocode/src/github.com/peterbourgon/diskv/index.go:71: defer requires function call, not conversion
gocode/src/github.com/peterbourgon/diskv/index.go:121: undefined: llrb.LessFunc
gocode/src/github.com/peterbourgon/diskv/index.go:135: undefined: llrb.Tree
gocode/src/github.com/peterbourgon/diskv/index.go:135: undefined: llrb.LessFunc
gocode/src/github.com/peterbourgon/diskv/index.go:136: too many arguments in call to llrb.New
gocode/src/github.com/peterbourgon/diskv/index.go:136: too many errors

Add Cache Layer .. where?

Hi, the root docs say "Add a Cache layer" but none of the code and examples I could access have any implementation of the mentioned concept.

Sorry this isnt an issue and more a question, but I could not find any other Fourm to ask this either.
thanks

Canceling a Keys() walk

We need a done channel on the Keys and KeysPrefix iterator. That will be a breaking API to the library however.

I am happy to send a PR but want to get agreement on the API breakage problem first. Based on discussion from rkt/rkt#209 (comment)

License!

Hi Peter, could you please choose a license for this project? I'm considering using this in a service used internally at my company, but of course that decision depends on the license.

Question about many tiny files vs block size

This design creates a single file for every entry. If there are a lot of small entries, this wastes disk space as the default block size is often 4096 bytes or more. I was curious to know if there would be a way to combine the tiny writes into a single index and only break them out into their own files when they grew big enough (user input here?) to effectively be spread out over the file system.

Cache expiration

I've used this awesome library in a couple of projects now and I think it is a great tool. One thing I'm wondering about is a way to set a cache expiration policy. The policy would dictate when cache keys become stale and should expire and perhaps be erased. Have you considered any additional features to this library, or perhaps strategies to employ regarding this topic?

Thanks!

concurrent access

I've a use-case where I have two applications reading/writing to the disk cache. When reading from the disk cache, an application will first read from the in-memory cache, which in some scenario's will be dirty. Is it possible to always read from the physical disk first?

please tag formal releases

Please consider assigning version numbers and tagging releases. Tags/releases
are useful for downstream package maintainers (in Debian and other distributions) to export source tarballs, automatically track new releases and to declare dependencies between packages. Read more in the Debian Upstream Guide.

Versioning provides additional benefits to encourage vendoring of a particular (e.g. latest stable) release contrary to random unreleased snapshots.

Versioning provides safety margin for situations like "ops, we've made a mistake but reverted problematic commit in master". Presumably tagged version/release is better tested/reviewed than random snapshot of "master" branch.

Thank you.

Readme example has coding mistake?

It looks like your simple example has a mistake. You store a diskv pointer into variable "d" but your later code uses variable "s".

Could you help tag a new version?

Dear @peterbourgon

Latest version 3.0.0 was tagged on Apr 25, 2019, there're few fixes after that, doesn't it deserve a new release? :)

can we append files by DiskV

I have a request to persist buffer into local using diskv.
I wants to add a AppendStream method into https://github.com/peterbourgon/diskv/blob/master/diskv.go#L175

Why can't we add such a method before?
Is there any thinking?

While walking, InverseTransform is invoked for directories - is this a bug?

Hey @peterbourgon - loving this library and thank you for your stewardship!

Would love to have you take a look at the following lines:

diskv/diskv.go

Lines 597 to 601 in 2566386

    
           key := d.InverseTransform(pathKey) 
        
           if info.IsDir() || !strings.HasPrefix(key, prefix) { 
        
           	return nil // "pass" 
        
           }

By my reading, it would be preferable not to call InverseTransform for directories. In my case, I'm using the AdvancedTransform and its inverse in the README, and this is tickling the "panic" line because directories do not carry the expected extension.

What about switching the logic to this? In other words, pass on the directory BEFORE calling InverseTransform.

if info.IsDir() {
    return nil // "pass"
}

key := d.InverseTransform(pathKey)

if !strings.HasPrefix(key, prefix) {
    return nil // "pass"
}

Write() errors with "no such file or directory" if key contains slashes

To reproduce, simply attempt to store a value with a key of a URL and a transform function similar to:

func cacheDirTransform(key string) []string {
    fields, err := ParseURL(key)
    if err != nil {
        return []string{"_other", key}
    }
    return []string{fields.CategorySlug, fields.ArticleID}
}

edit: This also happens with no key transform and a key of fmt.Sprintf("%s/%s", categorySlug, articleID)

FUSE? would diskv work?

Could diskv be used on a upspin backend? any thoughts?

Would it work on upspin's FUSE filesystem interface?
Or would it have to be ported to the upspin Client & upspin.io/client/file package?

TIA,
-K

How to iterate []byte?

Actually i'm new to go so can anyone help me on that?

diskv: add a KeysPrefix()

In rocket we are looking at using sha512. The problem is that sha512 hashes are quite large. So, we will give users the option to truncate the hashes to 256 bits instead. Which means we have to find the matching key by prefix.

To that end It would be great if we could add a Prefix iterator to diskv so if I am given a hash that is shorter than 512bits I at least only have to iterate through keys that have matching directory prefixes.

Thoughts?

"Adding Caching"

An in-memory caching layer is provided by combining the BasicStore functionality with a simple map structure, and keeping it up-to-date as appropriate. Since the map structure in Go is not threadsafe, it's combined with a RWMutex to provide safe concurrent access.

It is unclear from that statement if that is activated by default,
and if it isn't,
how is this to be implemented?

is this used in production? any benchmark against redis etc?

Handling large volumes of writes

Hi there,

A limitation with the current implementation of diskv is that it doesn't work well for highly frequent updates (ie: hundreds or thousands per second). This can be problematic not only from a performance perspective, but it can seriously degrade the integrity of an SD card due to the amount of writes going on.

I've made some changes to my fork that allow a diskv client to write keys to memory only. The client can then call Persist() to flush these changes to disk. This would typically be done on a timed basis (ie: every 30-60 seconds) to minimize disk activity. The client is responsible for safely shutting down when using these routines, because keys in memory are not guaranteed to be backed by permanent storage.

I'd be interested in hearing some feedback about this. I'm not positive that these changes should go directly into diskv (though I haven't figured out how a client could manage it otherwise, or even if that would make any sense because it exposes the internals).

Anyway, we have been using this for months now in a production system and it has proven to be very solid.

Still maintained/working?

Just curious as to the status of this project. I like the implementation and things seem to work well. I am working this into a micro file system for a new distributed-decentralized internet project, want to make sure I am not using something that is effectively dead. Thx!

Use Nop write closer from stdlib

You might want to use this io.WriteCloser wrapper: https://golang.org/pkg/io/ioutil/#NopCloser
It seems to do the same as yours nopWriteCloser structure, except it's part of the standard library.

Best way to create multiple keys for same value?

@peterbourgon what's the best way to accomplish value 'aliases' or 'links' using multiple keys pointing to the same value? For example, I have a JSON file to store, it has a canonical name (the 'key'). Then, I might have several other aliases or keys, e.g. tags, that I would like to use to lookup the same value (meaning I could use the primary key or any number of secondary keys).

I was thinking this could be implemented using file systems symlinks for the secondary keys but wasn't sure how to support that in diskv. Thoughts?

This is a great library by the way, I love the idea of a KV store with a simple way to get access to the files. Excellent idea.

ReadStream with a very large value results in excessive memory use when cache is enabled

If the cache is enabled, readWithRLock always reads the file using a siphon.

The siphon code copies every byte it reads into a bytes.Buffer. When the full file has been read, that bytes.Buffer is used to update the cache.

However, if the underlying file is e.g. a gigabyte in size, the siphon will end up with a bytes.Buffer containing that entire gigabyte. Unless you've set your cache size to over a gigabyte, this gets thrown away as soon as the ReadStream is done.

The main reason we use ReadStream is so we can deal with very large items without having to stick the entire thing in memory at once. Having discovered this, we'll probably disable the cache, but there are cases where people may wish to have a cache enabled without blowing up their memory!

does diskv support range query?

I'm look for a embedded db, and I feel diskv might be a choise. But I need to do some range query.

my scene is: I have some files, about 50000, each file have properties like size, time, tags which stored in db. I need to range query them by time or tags.

Does diskv support this? or in key-value db, how should I achieve this ?

thank you.

Switch to actively maintained Red-Black tree implementation?

I've noticed that github.com/petar/GoLLRB have no activity since 2013 while github.com/HuKeping/rbtree appears to be actively maintained.

I wonder if it would be worth switching to actively maintained project?

http file stream

Hi all,

I just wanna ask how you would serve a file saved in diskv via http.
I have seen the http.ServeContent, but this needs an io.ReadSeeker.
For http.ServeContent, we need the file.
would you do a io.Copy?

Regards Chris

Suggest: Loop through all files

Would like a way to cycle through all files and retrieve contents

Diskv as a filesystem abstraction?

I wonder how Diskv might work as a filesystem abstraction for the storage and management of large files...

Feature request: Set owner and group for created files in `diskv.Options`

Dear @peterbourgon,

I have a daemon service and a command-line tool (CLI), both uses diskv. Daemon service is running as a non-privileged user (like nobody user).

The problem is, sysadmin may run the CLI tool as root user, in this case all files and directories created by the CLI are owned by the root user and group, the daemon service can not read the new files triggered by CLI.

It would be very useful if we can add new attribute in diskv.Options to set the owner and group for newly created file and directory.

Suggest: add key expiration

write key with expiration. while the expire time arrives, delete the key file.

[Question] Performance of writes

We use diskv in one of the services I help maintain at work. Attached is a screenshot of the write performance (in seconds) across a few instances. Have you ever seen this kind of behavior before?

Most of the time, the writes are extremely fast, but every once in a while, the write time blows up.

Please close the issue is this is not the right place, just not sure where else to post.

Diskv concurrent access for read-modify-write cycle

Diskv looks like a nice storage with good performance and simple API, thanks for writing it!

I'm looking for a simple+fast way to update a value in the store. Say, I want to read my object from the store, modify some of its properties and write it back. I want this whole sequence to be goroutine-safe so only one thread could modify the object at a time. And ideally I'd prefer not to use a single global lock. What would you recommend?

As for the global lock - I may use multiple diskv instances (like a pool) and for each key find a suitable instance first, then lock/unlock it. I see that Diskv structure already has a mutex (which is public) and that mutex is used internally to synchronize reads and writes. Since Go mutex is not reentrant - I think it's somewhat unsafe to expose this mutex. Am I right, that despite of being public - I should not use that mutex to lock/unlock Diskv instance?

ReadStream/WriteStream can lead to data races

If someone calls ReadStream, then proceeds to read from it slowly, then someone else calls WriteStream, the reader will start to get the new resource contents part-way through. This is because createKeyFileWithNoLock calls os.OpenFile with O_TRUNC set when updating a file, and the existing reader ends up pointing at the new data.

	key := d.InverseTransform(pathKey)

	if info.IsDir() \|\| !strings.HasPrefix(key, prefix) {
	return nil // "pass"
	}