Code Monkey home page Code Monkey logo

Comments (14)

martindurant avatar martindurant commented on August 26, 2024 1

I should have said: this isn't related to fastparquet, since parquet files are almost never globally compressed (no point!). More appropriate is, say, an astro FITS file, which contains "extensions" concatenated, and the whole thing compressed (usually gzip!). For my kind of catalog-access, it's really interesting to be able to either "load extension 3" or some specific C-array buffer contained therein by knowing the bytes range in the uncompressed data and the set of (uncompressed offset, compressed offset) in the original file where it's OK to start reading. (I understand that you can never just start reading, you also need some header stuff...).

Seems pretty unlikely to be able to persuade the data archives to start producing zstd or xr(block) output, but it is also independently interesting to demonstrate how you might write massive compressed data with near random access and parallel/distributed extraction.

from cramjam.

milesgranger avatar milesgranger commented on August 26, 2024

I don't know really. I know at least snappy's Framed decoding would fail at any offset which didn't start with the header, but for the raw implementation, I suppose the inverse, plus as you say, if starting the wrong offset again would fail. As for the other implementations, I'm unsure; but would assume as each lib has a slightly different API, it's unlikely all would have some poke-able interface for this. Will take a bit to look more into this though as time permits. πŸ‘

from cramjam.

milesgranger avatar milesgranger commented on August 26, 2024

Would the opposite work? As in, taking the chunk of uncompressed data, compressing it, then looking through the entire compressed buffer for a match? That'd give us the offset and length of the block. Assuming in the typical use case, the iteration of matching would be faster than attempting decompression at each offset.

Ideally, what API were you thinking it should look like btw?

from cramjam.

martindurant avatar martindurant commented on August 26, 2024

I'm pretty sure that if you compress a block of bytes A, and compress a larger block B that contains A, then the compressed version of A will not, in general appear in the compressed version of B.

What I am after

in the project fsspec-reference-maker, and ReferenceFileSystem for fsspec, it is possible to refer to some block of bytes at a URL as if it were a file, and so expose the internal C-buffers of a file format like HDF5. HDF5 has internal compression of its blocks, so we just know to un-gzip after loading. Other formats compress the whole file, but I want to be able to read as little of the compressed file as possible to get the C-buffer.

So if I scan the compressed file, and know that bytes 1000-1100 (uncompressed) is the output I want, it ought to be possible to say: scan to the nearest block before byte 1000 (uncompressed), and start the gzipfile there; read X bytes of junk, then the next 100 bytes will be what you are after. Yes, gzipfile would also need some sort of header.

With some formats like bzip2, there are block markers you can find. For others like snappy, each frame tells you how big it is, so you can quickly find them all. gzip is far more common, though...

In addition to the use case I've drawn, this would also allow parallel reading of massive, compressed gzip CSV files (also very common). Right now you can't do this, because to read to a position in a gzip file is to start from the beginning and scan through every time. If we could scan just once and know where the blocks are, or have a way to start from any byte and find the next block in the stream, we would be good.

from cramjam.

milesgranger avatar milesgranger commented on August 26, 2024

Okay, I had a misunderstanding then that one may have, and want to find, the entirety of a block.

Anyway, thanks for the thorough explanation. I'll take a stab at it in the next few days.

from cramjam.

martindurant avatar martindurant commented on August 26, 2024

I'll take a stab at it in the next few days.

Wow, that would be great! I'm not even certain it's possible for gzip, so don't devote too much effort :). Here is a brief description for bzip2: https://en.wikipedia.org/wiki/Bzip2#File_format , "bzip2 files can be decompressed in parallel" (although it isn't quite trivial!).

from cramjam.

milesgranger avatar milesgranger commented on August 26, 2024

From what I can tell so far, is if we are working with multiple gzip streams in a file, (which it appears gzip will still treat the same during decompression) then it is pretty straightforward to accomplish this. GZIP has good header and footer values to find the delimited streams within the entire data.

I am certain that's not what you want, rather finding the delimited DEFLATE blocks between the header and footer GZIP uses. This option seems like it's not possible. Although DEFLATE has a header it's 3 bits of every combination that one would find anyway when looking at the bytes as bits; so it'll just look like every offset is a start of a new block. To properly delimited the blocks we'd need to process/parse the huffman tree used for the block's compression and maybe by that time we're losing the overall objective.

I may poke a bit more, but natively, it seems GZIP won't support this as you suspected. It would need to be compressed with multiple streams or with some metadata regarding block placement to achieve the efficiency we're looking for.

from cramjam.

martindurant avatar martindurant commented on August 26, 2024

from cramjam.

milesgranger avatar milesgranger commented on August 26, 2024

With two streams like this:

>>> data = gzip.compress(b'hello')
>>> data += gzip.compress(b', world')
# Note two streams as indicated by repeated headers, specifically, b'\x1f\x8b'
# the actual compressed 'blocks' are DEFLATE
>>> data
b'\x1f\x8b\x08\x00\x03\x85\x01a\x02\xff\xcbH\xcd\xc9\xc9\x07\x00\x86\xa6\x106\x05\x00\x00\x00\x1f\x8b\x08\x00\r\x85\x01a\x02\xff\xd3Q(\xcf/\xcaI\x01\x00\xfeo\x88n\x07\x00\x00\
x00'
>>> gzip.decompress(data)
b'hello, world'
>>> 

Would be fairly straightforward to find where each stream starts and ends, b/c the gzip header and footer contain enough information to find the offsets of the streams, namely b/c of the magic gzip specific 2-byte signature indicating the start of the stream. And as you can see, if a file/dataset contains multiple streams, it reacts the same when asked to decompress as normal.

However, most files are likely a single stream so I'm not sure how much this would actually help in your situation. Because the file(s) would need to have multiple streams and probably when seeking in your example, you'd need to ensure you at least grabbed one entire stream.

from cramjam.

milesgranger avatar milesgranger commented on August 26, 2024

examples of "stream" size verus "block" size for some data?

The stream size would be the entire length of bytes if a single stream or until the next offset as you probably have guessed. Block size, would need to be derived from using the tree encoded in the DEFLATE block which would hold the symbol indicating the end of the block.

from cramjam.

martindurant avatar martindurant commented on August 26, 2024

As, so if a user just ran gzip on the command line for some arbitrary file, you always get just the one stream?

Block size, would need to be derived from using the tree encoded in the DEFLATE block which would hold the symbol indicating the end of the block.

So it can, in theory, be done, but it's hard!

from cramjam.

milesgranger avatar milesgranger commented on August 26, 2024

As, so if a user just ran gzip on the command line for some arbitrary file, you always get just the one stream?

Correct.

So it can, in theory, be done, but it's hard!

Seems like it; it's anyway interesting. But ya, at least a separate project I think as it's outside the scope of this one, but could be integrated if it worked. Also, bzip2 seems at least equally tricky. πŸ˜…

from cramjam.

martindurant avatar martindurant commented on August 26, 2024

I happened to read up on Zstd's format, and its internal frames and blocks are easily the sanest and most easily processed so far.

  • Zstd: good support, could find all blocks and probably get decent random-access
  • xz (lzma): blocks format is optional at compression time, but easy to find if present
  • bzip2: has block markers which it should be possible to find, but they are not byte-aligned (i.e., 8 possible bit offsets)
  • snappy: consists of frames, which can be found, but perhaps only one frame per file
  • gz: very hard
  • brotli: not used for files?

The more I read, the more I think that cramjam is probably not too interested in contemplating file-specific operations, doing de/comp well and limiting scope.

from cramjam.

milesgranger avatar milesgranger commented on August 26, 2024

Nice we started with the hardest one. πŸ˜†
I agree, although I'm interested in the process itself, it does feel out of scope from the target objectives of this project. It would anyway be interesting to see if it could be integrated in another way or something. I'll give it some thought, feel free to make suggestions. Maybe it's isolated enough to just do it within fastparquet; suppose one could still use cramjam for handling the individual blocks/frames once they're found?

from cramjam.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.