Code Monkey home page Code Monkey logo

diskv's People

Contributors

daniel-nichter avatar floren avatar gprggr avatar iredmail avatar jonboulle avatar jpeletier avatar lennylinux avatar peterbourgon avatar theothertomelliott avatar tmthrgd avatar urandom2 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

diskv's Issues

Suggestion: Improve performance by maintaining structs

Currently diskv maintains in-memory values in []byte format. It would be excellent from a performance standpoint if arbitrary structs could be maintained instead as this would eliminate the overhead of deserialization. I suspect that the performance improvement would be 1-2 orders of magnitude depending on the application.

ensureCacheSpaceWithLock panics with variable sized keys

If variable sized keys are being used, then ensureCacheSpaceWithLock() can panic during the cache cleanup routine which loops through safe(). This occurs if the key being inserted is larger than the last key removed.

It would be better if either this key were inserted anyway (and thus exceeding the cache threshold) or the routine was modified so that it cleared some percent of the total cache size before proceeding.

how large can the db be?

if i want to access a key from a db, does it read the whole file into mem or just that particular key section? just wondering if i need to have lots of mem available for this. 100GB per db size advisable etc?

ReadStream copies values into memory too soon

As currently implemented, Diskv.ReadStream is not a purely streaming reader, because it buffers an entire copy of the value (in the siphon) before attempting to put the value into the cache. Unfortunately with large values this is a recipe for memory exhaustion.

Ideally we would stream the value directly into the cache as the io.ReadCloser that ReadStream returns is consumed, checking the cache size as we go. I started to go down this path, but it creates another race condition because then writing into the cache is not atomic: we cannot know when the ReadCloser will be finished consuming the entry, and it's very possible for others to begin Reading the same key-value pair as we're still writing it. So the next step down that road is for readers to actually take a lock per cache entry (which would then get released once the caller Closes the ReadCloser). This quickly became a web of mutexes and synchronisation hacks which felt very unidiomatic golang.

Various simple "solutions" exist (e.g. prechecking the size of the file against the cache size before starting to siphon), but they are all inherently racy and could still lead to memory exhaustion under stressful conditions. (We could also just take a global write lock during reads, but that wouldn't be very nice to other readers, would it?)

@peterbourgon @philips thoughts?

CacheSizeMax ignored in WriteStream function

The WriteStream function or to be precise the writeStreamWithLock function does not check for the CacheSizeMax option, which results in too large files being stored. Is this intentional because it's a stream? I'm using the github.com/gregjones/httpcache library which uses the WriteStream function. Anything that can be done here? :)

completeFilename causes invalid memory address or nil pointer dereference

Hi

We're running a component in Kubernetes that uses diskv under the hood. The problem is that the process occasionally crashes when it attempts to remove the key from the store. Here is the relevant stack trace:

github.com/org/vendor/github.com/peterbourgon/diskv.(*Diskv).Erase(0xc42041c2d0, 0x0, 0x1b, 0x0, 0x0) /home/rabbit/org/vendor/github.com/peterbourgon/diskv/diskv.go:409 +0xe7
github.com/org/vendor/github.com/peterbourgon/diskv.(*Diskv).completeFilename(0xc42041c2d0, 0x0, 0x1b, 0x1b, 0x27fdf00) /home/rabbit/org/vendor/github.com/peterbourgon/diskv/diskv.go:525 +0x98
path/filepath.Join(0xc42110b970, 0x2, 0x2, 0xc4208e18c0, 0x11) /usr/lib/go/src/path/filepath/path.go:210
path/filepath.join(0xc42110b970, 0x2, 0x2, 0x0, 0x0) /usr/lib/go/src/path/filepath/path_unix.go:45 +0x96
strings.Join(0xc42110b970, 0x2, 0x2, 0x18d6236, 0x1, 0xc42110b918, 0x2) /usr/lib/go/src/strings/strings.go:424

Data directory is mounted as a regular host path (/opt/spm/agent) and file names are ksuid-compatible identifiers.

diskv is initialized with following configuration:

d := diskv.New(diskv.Options{
		BasePath:     c.Dir,
		Transform:    func(s string) []string { return []string{} },
		CacheSizeMax: 1024 * 1024,
})

Do you have any pointers or ideas why this would happen?

CAS feature request/offer?

I was looking for a CAS[1].

Seems like it could be a pretty small change to diskv. You would need a cryptographic hash function, like say skein or sha256. Then something like:

d := diskv.New(diskv.Options{
        BasePath:     "my-data-dir",
        Transform:    flatTransform,
        CryptoHash: sha256,
        CacheSizeMax: 1024 * 1024,
    })
(key, err) := d.put([]byte{'1', '2', '3'}))
...
(value,err) := d.get(key)

I use put/get because they imply (to me anyways) atomic operations, that read/write do not.

Opinions? Alternatives?

If I did something like the above and send a pull request, would you consider it?

[1] http://en.wikipedia.org/wiki/Content-addressable_storage

diskv not compiling both on go1.03 and go1.1RC3

Even after installing the package
"github.com/petar/GoLLRB/llrb"

the command go get "github.com/peterbourgon/diskv" generates these errors. Any suggestion?

gocode/src/github.com/peterbourgon/diskv/index.go:28: undefined: llrb.Tree
gocode/src/github.com/peterbourgon/diskv/index.go:29: undefined: llrb.LessFunc
gocode/src/github.com/peterbourgon/diskv/index.go:36: defer requires function call, not conversion
gocode/src/github.com/peterbourgon/diskv/index.go:46: defer requires function call, not conversion
gocode/src/github.com/peterbourgon/diskv/index.go:56: defer requires function call, not conversion
gocode/src/github.com/peterbourgon/diskv/index.go:71: defer requires function call, not conversion
gocode/src/github.com/peterbourgon/diskv/index.go:121: undefined: llrb.LessFunc
gocode/src/github.com/peterbourgon/diskv/index.go:135: undefined: llrb.Tree
gocode/src/github.com/peterbourgon/diskv/index.go:135: undefined: llrb.LessFunc
gocode/src/github.com/peterbourgon/diskv/index.go:136: too many arguments in call to llrb.New
gocode/src/github.com/peterbourgon/diskv/index.go:136: too many errors

Add Cache Layer .. where?

Hi, the root docs say "Add a Cache layer" but none of the code and examples I could access have any implementation of the mentioned concept.

Sorry this isnt an issue and more a question, but I could not find any other Fourm to ask this either.
thanks

Canceling a Keys() walk

We need a done channel on the Keys and KeysPrefix iterator. That will be a breaking API to the library however.

I am happy to send a PR but want to get agreement on the API breakage problem first. Based on discussion from rkt/rkt#209 (comment)

License!

Hi Peter, could you please choose a license for this project? I'm considering using this in a service used internally at my company, but of course that decision depends on the license.

Question about many tiny files vs block size

This design creates a single file for every entry. If there are a lot of small entries, this wastes disk space as the default block size is often 4096 bytes or more. I was curious to know if there would be a way to combine the tiny writes into a single index and only break them out into their own files when they grew big enough (user input here?) to effectively be spread out over the file system.

Cache expiration

I've used this awesome library in a couple of projects now and I think it is a great tool. One thing I'm wondering about is a way to set a cache expiration policy. The policy would dictate when cache keys become stale and should expire and perhaps be erased. Have you considered any additional features to this library, or perhaps strategies to employ regarding this topic?

Thanks!

concurrent access

I've a use-case where I have two applications reading/writing to the disk cache. When reading from the disk cache, an application will first read from the in-memory cache, which in some scenario's will be dirty. Is it possible to always read from the physical disk first?

please tag formal releases

Please consider assigning version numbers and tagging releases. Tags/releases
are useful for downstream package maintainers (in Debian and other distributions) to export source tarballs, automatically track new releases and to declare dependencies between packages. Read more in the Debian Upstream Guide.

Versioning provides additional benefits to encourage vendoring of a particular (e.g. latest stable) release contrary to random unreleased snapshots.

Versioning provides safety margin for situations like "ops, we've made a mistake but reverted problematic commit in master". Presumably tagged version/release is better tested/reviewed than random snapshot of "master" branch.

Thank you.

See also

While walking, InverseTransform is invoked for directories - is this a bug?

Hey @peterbourgon - loving this library and thank you for your stewardship!

Would love to have you take a look at the following lines:

diskv/diskv.go

Lines 597 to 601 in 2566386

key := d.InverseTransform(pathKey)
if info.IsDir() || !strings.HasPrefix(key, prefix) {
return nil // "pass"
}

By my reading, it would be preferable not to call InverseTransform for directories. In my case, I'm using the AdvancedTransform and its inverse in the README, and this is tickling the "panic" line because directories do not carry the expected extension.

What about switching the logic to this? In other words, pass on the directory BEFORE calling InverseTransform.

if info.IsDir() {
    return nil // "pass"
}

key := d.InverseTransform(pathKey)

if !strings.HasPrefix(key, prefix) {
    return nil // "pass"
}

Write() errors with "no such file or directory" if key contains slashes

To reproduce, simply attempt to store a value with a key of a URL and a transform function similar to:

func cacheDirTransform(key string) []string {
    fields, err := ParseURL(key)
    if err != nil {
        return []string{"_other", key}
    }
    return []string{fields.CategorySlug, fields.ArticleID}
}

edit: This also happens with no key transform and a key of fmt.Sprintf("%s/%s", categorySlug, articleID)

diskv: add a KeysPrefix()

In rocket we are looking at using sha512. The problem is that sha512 hashes are quite large. So, we will give users the option to truncate the hashes to 256 bits instead. Which means we have to find the matching key by prefix.

To that end It would be great if we could add a Prefix iterator to diskv so if I am given a hash that is shorter than 512bits I at least only have to iterate through keys that have matching directory prefixes.

Thoughts?

"Adding Caching"

An in-memory caching layer is provided by combining the BasicStore functionality with a simple map structure, and keeping it up-to-date as appropriate. Since the map structure in Go is not threadsafe, it's combined with a RWMutex to provide safe concurrent access.


It is unclear from that statement if that is activated by default,
and if it isn't,
how is this to be implemented?

Handling large volumes of writes

Hi there,

A limitation with the current implementation of diskv is that it doesn't work well for highly frequent updates (ie: hundreds or thousands per second). This can be problematic not only from a performance perspective, but it can seriously degrade the integrity of an SD card due to the amount of writes going on.

I've made some changes to my fork that allow a diskv client to write keys to memory only. The client can then call Persist() to flush these changes to disk. This would typically be done on a timed basis (ie: every 30-60 seconds) to minimize disk activity. The client is responsible for safely shutting down when using these routines, because keys in memory are not guaranteed to be backed by permanent storage.

I'd be interested in hearing some feedback about this. I'm not positive that these changes should go directly into diskv (though I haven't figured out how a client could manage it otherwise, or even if that would make any sense because it exposes the internals).

Anyway, we have been using this for months now in a production system and it has proven to be very solid.

Still maintained/working?

Just curious as to the status of this project. I like the implementation and things seem to work well. I am working this into a micro file system for a new distributed-decentralized internet project, want to make sure I am not using something that is effectively dead. Thx!

Best way to create multiple keys for same value?

@peterbourgon what's the best way to accomplish value 'aliases' or 'links' using multiple keys pointing to the same value? For example, I have a JSON file to store, it has a canonical name (the 'key'). Then, I might have several other aliases or keys, e.g. tags, that I would like to use to lookup the same value (meaning I could use the primary key or any number of secondary keys).

I was thinking this could be implemented using file systems symlinks for the secondary keys but wasn't sure how to support that in diskv. Thoughts?

This is a great library by the way, I love the idea of a KV store with a simple way to get access to the files. Excellent idea.

ReadStream with a very large value results in excessive memory use when cache is enabled

If the cache is enabled, readWithRLock always reads the file using a siphon.

The siphon code copies every byte it reads into a bytes.Buffer. When the full file has been read, that bytes.Buffer is used to update the cache.

However, if the underlying file is e.g. a gigabyte in size, the siphon will end up with a bytes.Buffer containing that entire gigabyte. Unless you've set your cache size to over a gigabyte, this gets thrown away as soon as the ReadStream is done.

The main reason we use ReadStream is so we can deal with very large items without having to stick the entire thing in memory at once. Having discovered this, we'll probably disable the cache, but there are cases where people may wish to have a cache enabled without blowing up their memory!

does diskv support range query?

I'm look for a embedded db, and I feel diskv might be a choise. But I need to do some range query.

my scene is: I have some files, about 50000, each file have properties like size, time, tags which stored in db. I need to range query them by time or tags.

Does diskv support this? or in key-value db, how should I achieve this ?

thank you.

http file stream

Hi all,

I just wanna ask how you would serve a file saved in diskv via http.
I have seen the http.ServeContent, but this needs an io.ReadSeeker.
For http.ServeContent, we need the file.
would you do a io.Copy?

Regards Chris

Feature request: Set owner and group for created files in `diskv.Options`

Dear @peterbourgon,

I have a daemon service and a command-line tool (CLI), both uses diskv. Daemon service is running as a non-privileged user (like nobody user).

The problem is, sysadmin may run the CLI tool as root user, in this case all files and directories created by the CLI are owned by the root user and group, the daemon service can not read the new files triggered by CLI.

It would be very useful if we can add new attribute in diskv.Options to set the owner and group for newly created file and directory.

[Question] Performance of writes

We use diskv in one of the services I help maintain at work. Attached is a screenshot of the write performance (in seconds) across a few instances. Have you ever seen this kind of behavior before?
Screen Shot 2019-05-30 at 8 53 08 AM

Most of the time, the writes are extremely fast, but every once in a while, the write time blows up.

Please close the issue is this is not the right place, just not sure where else to post.

Diskv concurrent access for read-modify-write cycle

Diskv looks like a nice storage with good performance and simple API, thanks for writing it!

I'm looking for a simple+fast way to update a value in the store. Say, I want to read my object from the store, modify some of its properties and write it back. I want this whole sequence to be goroutine-safe so only one thread could modify the object at a time. And ideally I'd prefer not to use a single global lock. What would you recommend?

As for the global lock - I may use multiple diskv instances (like a pool) and for each key find a suitable instance first, then lock/unlock it. I see that Diskv structure already has a mutex (which is public) and that mutex is used internally to synchronize reads and writes. Since Go mutex is not reentrant - I think it's somewhat unsafe to expose this mutex. Am I right, that despite of being public - I should not use that mutex to lock/unlock Diskv instance?

ReadStream/WriteStream can lead to data races

If someone calls ReadStream, then proceeds to read from it slowly, then someone else calls WriteStream, the reader will start to get the new resource contents part-way through. This is because createKeyFileWithNoLock calls os.OpenFile with O_TRUNC set when updating a file, and the existing reader ends up pointing at the new data.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.