Code Monkey home page Code Monkey logo

Comments (6)

grantjenks avatar grantjenks commented on May 18, 2024

Another thought: could separating out read and write have the added benefit of supporting unicode and bytes files?

from python-diskcache.

grantjenks avatar grantjenks commented on May 18, 2024

I decided to add the read method but not the write. A quick survey showed that the different semantics of the read method are quite useful.

Committed at 204762d

from python-diskcache.

darkk avatar darkk commented on May 18, 2024

I'm interested in feedback from others. If you use this functionality, please contribute your opinion.

My case is the following. I have ~100k blobs of the same type in the cache and the entries have very different sizes due to their nature being binary diffs between some blobs (see below). 85% of the entries are smaller than 32 KiB but the corner-case is an entry being as large as ~30 MiB, so I decided to use read=True feature to process the blobs in streaming way. I've found out that read=True actually forces small objects to be stored as files and does not respect nice disk_min_file_size:

$ find /srv/proj/diskcache/?? -type f -ls | awk '{print $7}' | stats
min     204
25.0%   215
50.0%   566
75.0%   17520
90.0%   43320
95.0%   55581
max     24473760

Then I've seen this ticket and I've seen that you use reader.name in some code. So my questions are the following:

Would you consider patch that respects disk_min_file_size while using read=True pre-reading min_size bytes from the stream and deciding if the object should be stored internally or as an external file?

Should the patch add name and mode attributes to BytesIO in this case to preserve some kind of compatibility? These two attributes are the only two that are present in gzip.GzipFile, open, tempfile.TemporaryFile while being absent from BytesIO.

from python-diskcache.

grantjenks avatar grantjenks commented on May 18, 2024

I’m afraid patching BytesIO objects wouldn’t work for my use case. The motivation is described in the tutorial at the end of the DjangoCache section: http://www.grantjenks.com/docs/diskcache/tutorial.html#djangocache When stored in files, the blobs can be served directly from the file system by a web server. The performance of serving megabyte sized blobs is much improved. And I like the consistency. Could you maybe inherit from cache to provide the semantics you prefer?

I don’t see what problem you’re trying to solve. What does it matter to your code whether the blobs are stored in the database or file system?

from python-diskcache.

grantjenks avatar grantjenks commented on May 18, 2024

One more thought, have you looked at overriding the Disk object? That might be the right place to provide your optimization. You could read some chunk of the file handle and decide if it should be stored differently.

from python-diskcache.

darkk avatar darkk commented on May 18, 2024

Ah, thanks! I missed DjangoCache section. Now read=True forcing file-based store makes more sense to me. I'm still unsure if mixing read=True and read=False for the same key is possible and provides expected result, but that's not a big deal anyway.

The problem I am solving is ability to handle all my cached blobs in the same consistent way. I lean towards streaming handling as some of the blobs are quite big (~120 MiB decompressed).

However, most of the cached blobs are really small (75% under 20 KiB, 25% under 0.2 KiB). So I was surprised to see disk_min_file_size optimization being not respected for small files stored with read=True, so I decided to understand if it's a bug or a feature :-)

TBH, I don't think that it matters much for me if small files are stored as filesystem objects or sqlite BLOBs. I agree that overriding the Disk object looks like the right way to implement that under the intended read=True use-case.

Thanks for explanation!

from python-diskcache.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.