A couple things I've noticed in programming: If I store bytes

Another thought: could separating out read and <code

I decided to add the read method but not the <code cl

Improving Behavior When Using `read=True` ? about python-diskcache HOT 6 CLOSED

grantjenks commented on May 18, 2024

Improving Behavior When Using `read=True` ?

from python-diskcache.

Comments (6)

grantjenks commented on May 18, 2024

Another thought: could separating out read and write have the added benefit of supporting unicode and bytes files?

from python-diskcache.

grantjenks commented on May 18, 2024

I decided to add the read method but not the write. A quick survey showed that the different semantics of the read method are quite useful.

Committed at 204762d

from python-diskcache.

darkk commented on May 18, 2024

I'm interested in feedback from others. If you use this functionality, please contribute your opinion.

My case is the following. I have ~100k blobs of the same type in the cache and the entries have very different sizes due to their nature being binary diffs between some blobs (see below). 85% of the entries are smaller than 32 KiB but the corner-case is an entry being as large as ~30 MiB, so I decided to use read=True feature to process the blobs in streaming way. I've found out that read=True actually forces small objects to be stored as files and does not respect nice disk_min_file_size:

$ find /srv/proj/diskcache/?? -type f -ls | awk '{print $7}' | stats
min     204
25.0%   215
50.0%   566
75.0%   17520
90.0%   43320
95.0%   55581
max     24473760

Then I've seen this ticket and I've seen that you use reader.name in some code. So my questions are the following:

Would you consider patch that respects disk_min_file_size while using read=True pre-reading min_size bytes from the stream and deciding if the object should be stored internally or as an external file?

Should the patch add name and mode attributes to BytesIO in this case to preserve some kind of compatibility? These two attributes are the only two that are present in gzip.GzipFile, open, tempfile.TemporaryFile while being absent from BytesIO.

from python-diskcache.

grantjenks commented on May 18, 2024

I’m afraid patching BytesIO objects wouldn’t work for my use case. The motivation is described in the tutorial at the end of the DjangoCache section: http://www.grantjenks.com/docs/diskcache/tutorial.html#djangocache When stored in files, the blobs can be served directly from the file system by a web server. The performance of serving megabyte sized blobs is much improved. And I like the consistency. Could you maybe inherit from cache to provide the semantics you prefer?

I don’t see what problem you’re trying to solve. What does it matter to your code whether the blobs are stored in the database or file system?

from python-diskcache.

grantjenks commented on May 18, 2024

One more thought, have you looked at overriding the Disk object? That might be the right place to provide your optimization. You could read some chunk of the file handle and decide if it should be stored differently.

from python-diskcache.

darkk commented on May 18, 2024

Ah, thanks! I missed DjangoCache section. Now read=True forcing file-based store makes more sense to me. I'm still unsure if mixing read=True and read=False for the same key is possible and provides expected result, but that's not a big deal anyway.

The problem I am solving is ability to handle all my cached blobs in the same consistent way. I lean towards streaming handling as some of the blobs are quite big (~120 MiB decompressed).

However, most of the cached blobs are really small (75% under 20 KiB, 25% under 0.2 KiB). So I was surprised to see disk_min_file_size optimization being not respected for small files stored with read=True, so I decided to understand if it's a bug or a feature :-)

TBH, I don't think that it matters much for me if small files are stored as filesystem objects or sqlite BLOBs. I agree that overriding the Disk object looks like the right way to implement that under the intended read=True use-case.

Thanks for explanation!

from python-diskcache.

Improving Behavior When Using `read=True` ? about python-diskcache HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent