Code Monkey home page Code Monkey logo

Comments (8)

FrancescAlted avatar FrancescAlted commented on August 25, 2024

Yes, for Blosc1 compressed chunks, I think this makes total sense. The proposal would be to add a new function for safe decompression, something like:

int blosc2_decompress_safe_ctx(blosc2_context* context, const void* src, size_t srcsize, void* dest, size_t destsize);

Also, I was thinking in optionally adding a (64-bit) hash at the end of frames (a new container which has been recently introduced in Blosc2). This could bring much higher safety for them. If a hash exists, functions like blosc2_frame_decompress_chunk() could double check that the contents of the frame are ok and safely proceed if that is the case. What do you think?

from c-blosc2.

asomers avatar asomers commented on August 25, 2024

I think that if blosc2 is already making backwards-incompatible changes to the API, then why bother to introduce a new decompression function? I think it would be simpler just to make the normal decompression function safe-by-default. An alternate version, however, would be useful for blosc1, which doesn't want to break backwards compatibility.

I think that a builtin checksum might be useful for some applications, but it doesn't help with the safety situation. A malicious chunk, after all, could still have a valid checksum. Checksums, even cryptographic ones, only protect against accidental damage.

Regarding bundled checksums, there is one benefit that really intrigues me: cache efficiency. Using a separate checksum layer atop of the compression layer requires an application to traverse a block of data twice. Later parts of the first pass could evict useful data from the CPU cache. But a combined hashcompress algorithm could do the whole thing in a single pass. I don't know how much of a performance impact that would have, but it sounds like it might be significant.

from c-blosc2.

FrancescAlted avatar FrancescAlted commented on August 25, 2024

I have put some serious thought on this, and I think that in scenarios where safety is important, the user can still call blosc_cbuffer_sizes() first to check whether the cbytes (i.e. the number of bytes to be read from src buffer) is safe for her. I have added a note about this in commit 2c9d598.

One might argue that the destsize argument could be removed for the same reason, but IMO, limiting the destination size is generally more useful, so I think it is fine keeping it.

from c-blosc2.

FrancescAlted avatar FrancescAlted commented on August 25, 2024

For reference, I have just filed a new ticket for implementing checksums.

from c-blosc2.

asomers avatar asomers commented on August 25, 2024

However, blosc_cbuffer_sizes works by interrogating the provided buffer. So it still doesn't protect against a corrupted buffer. Blosc's compression and decompression routines should be safe even if provided with a corrupted or garbage buffer. Even a maliciously crafted buffer shouldn't be allowed to cause an array overrun.

from c-blosc2.

FrancescAlted avatar FrancescAlted commented on August 25, 2024

Uh, I don't completely understand the difference between passing a size of the source buffer and taking a decision based on the blosc_cbuffer_sizes() feedback. Normally you have an educated guess for the size of the source buffer and this can be compared with the cbytes returned by blosc_cbuffer_sizes(); in case there is too much a difference you can distrust the buffer, right?

On the other hand, one cannot implement the limitation of size of the source buffer as you suggest because it happens that blosc compressed buffers are not decompressed strictly sequentially, but require to access the index of the different blocks inside the buffer that are at the end of the buffer, so we are enforced to read at minimum cbytes bytes.

For the record, the reason why the compressed buffers need the index at the end is because the parallelization creates blocks of different sizes, and the decoder needs this info in order to pass the starting points for every block for every decompressing thread.

from c-blosc2.

asomers avatar asomers commented on August 25, 2024

What I have in mind is an application that receives a blosc buffer from an untrusted source. The pseudocode would look like this:

let buffer = read_from_(net|disk|other);
let destlen = length_that_I_expect_based_on_metadata_stored_somewhere_else();
let dest = make_buffer(destlen);
let err = blosc2_decompress_safe_ctx(ctx, buffer.base_address, buffer.size, dest.base_address, destlen);

Even if the received buffer is malicious, then blosc2_decompress_safe_ctx won't overrun the source buffer, because it knows the source length. You seem to suggest that I should do the following instead:

let buffer = read_from_(net|disk|other);
let destlen = length_that_I_expect_based_on_metadata_stored_somewhere_else();
let dest = make_buffer(destlen);
let (nbytes, cbytes, blocksize) = blosc_cbuffer_sizes(buffer.base_address);
if (cbytes != buffer.len)
    return EINVAL;
if (nbytes != destlen)
    return EINVAL;
let err = blosc2_decompress_safe_ctx(ctx, buffer.base_address, dest.base_address, destlen);

Not only does that strategy require additional error-prone code in every blosc consumer, but the output of blosc_cbuffer_sizes cannot be trusted if the buffer came from an untrusted source.

Why can't blosc2_decompress_safe_ctx enforce the length of the source buffer? It shouldn't be a problem that the buffer is read non-sequentially. All that is required is that it never read beyond buffer.base_address + buffer.size.

from c-blosc2.

FrancescAlted avatar FrancescAlted commented on August 25, 2024

Yes, I am suggesting exactly that. I agree that you cannot trust the output of blosc_cbuffer_sizes(), but you can trust your buffer.lenand if they do not match, then you should not proceed decompressing the buffer. I concede that this is a bit more work for users (not that much, just one more call), but this should be a fair price to pay for people that are worried about source trustyness.

from c-blosc2.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.