Code Monkey home page Code Monkey logo

Comments (31)

jbms avatar jbms commented on May 15, 2024 2

I do think "zip" support is a very useful addition to tensorstore, and I'm sure many other users would also make use of it.

But a key design principle of Tensorstore is to equally support both the local storage and cloud storage. The issue isn't about adding support for specific cloud storage protocols to the zip driver, like https, gcs, s3. Rather, issue is that the zip kvstore driver should be an adapter on top of a base kvstore, like the neuroglancer_uint64_sharded or ocdbt kvstores, and be agnostic to the underlying storage:

https://google.github.io/tensorstore/kvstore/neuroglancer_uint64_sharded/index.html
https://google.github.io/tensorstore/kvstore/ocdbt/index.html

For example, to access "path/in/zip" within a zip file at c:\my\zip\file.zip would use a spec of:

{"driver": "zip", "path": "path/in/zip", "base": {"driver":"file", "path": "c:\\my\\zip\\file.zip"}}

while to access gcs, it would be:

{"driver": "zip", "path": "path/in/zip", "base": {"driver":"gcs", "bucket": "my-bucket", "path": "path/to/file.zip"}}

The zip driver itself would merely use the kvstore::Driver interface and need not concern itself with what the underlying storage is at all.

Your current implementation is a great proof of concept for adding zip support to tensorstore, but broadly the issues with it currently are:

  • The JSON spec does not match other kvstore in tensorstore that serve as adapters on some underlying kvstore, and doesn't lend itself to support other underlying storage in the future.
  • The implementation is also specifically designed around accessing the local filesystem directly, and can't really be incrementally changed to do all I/O through the kvstore::Driver interface.

from tensorstore.

laramiel avatar laramiel commented on May 15, 2024 2

With commit 4556546 I added rudimentary reading of ZIP files; rudimentary because some compression formats not supported, nor did I implement PKAES support.

There don't appear to be ZIP libraries which support nice async interfaces, so most of this is bespoke.

from tensorstore.

jbms avatar jbms commented on May 15, 2024 1

This commit (not pushed to the main branch) shows how it can be done:

jbms@5d8ce40

This seems to work on Linux, I haven't tested on Windows but you can probably get it to work with some minor tweaks if necessary. It could use some refinement for the additional optional functionality but should be enough for you to get started.

from tensorstore.

jbms avatar jbms commented on May 15, 2024 1

I just pushed a new version to this branch https://github.com/jbms/tensorstore/tree/add-minizip-ng

I had forgotten to specify hdrs and strip_include_prefix.

from tensorstore.

jbms avatar jbms commented on May 15, 2024 1

The minizip_ng.BUILD.bazel file needed to be updated to add the bzip2 and lzma dependencies. I pushed an updated version.

from tensorstore.

dzenanz avatar dzenanz commented on May 15, 2024

Can I do something to help this along?

from tensorstore.

jbms avatar jbms commented on May 15, 2024

You could certainly take a stab at implementing it. The closest existing driver is the neuroglancer_uiny64_sharded driver:

https://github.com/google/tensorstore/blob/master/tensorstore/kvstore/neuroglancer_uint64_sharded/neuroglancer_uint64_sharded.cc

And we'd be happy to provide guidance.

Due to the need to operate asynchronously, most likely existing zip libraries would not be directly usable. Personally I would probably use the Python standard library zipfile.py implementation as a reference.

The simplest way to implement write support would be the same as for neuroglancer_uiny64_sharded: just rewrite the entire zip file for each update. Using tensorstore transactions you can ensure the zip file is only written once. Append support could be added later but would be an added complication.

from tensorstore.

dzenanz avatar dzenanz commented on May 15, 2024

As I believe that typical use for zip files is to write once, "rewrite the entire zip file for each update" should be totally fine.

I will first deal with JSON metadata (image origin, spacing, etc.), before I can pay attention to ZIP support. I never dealt with Bazel build system before, so I know nothing about it. minizip-ng uses CMake, is that a problem for integrating into tensorstore? With extra compression methods disabled, it should not need further dependencies.

from tensorstore.

jbms avatar jbms commented on May 15, 2024

TensorStore already depends on zlib, bzip, and liblzma, so supporting those compression methods would not require any additional dependencies.

To integrate minizip-ng we would need a bazel build for it, see e.g. the build for zlib:
https://github.com/google/tensorstore/tree/master/third_party/net_zlib

There we have listed the source files explicitly, but in many cases you can just use the glob function (https://bazel.build/reference/be/functions#glob).

Then you can run https://github.com/google/tensorstore/blob/master/third_party/update_third_party_bzl.py.

For our cmake build we can either use the native cmake build provided by the package, if there is one, or just use the bazel one converted to cmake via our bazel_to_cmake tool. In most cases it works better to just use the bazel build even if there is a native cmake build, because most cmake builds aren't designed to work properly as subprojects.

I'm not sure how helpful minizip-ng will be, though, given that it appears to be based on a synchronous stream interface but for tensorstore we will need an asynchronous interface in order to be able to efficiently read multiple archive members in parallel. For writing, assuming we write the entire file at once, it could probably be used.

from tensorstore.

dzenanz avatar dzenanz commented on May 15, 2024

The only other library I know of is libzip. It also uses CMake, so I expect it will be no easier to integrate. I doubt it is asynchronous either. Is there a more suitable option for tensorstore?

from tensorstore.

jbms avatar jbms commented on May 15, 2024

What I had in mind was to just implement the zip support "from scratch" (but still using zlib, etc. for the actual compression).

Basically for reading:

  • Locate and decode the central directory, which contains the list of all archive members. This would be cached using tensorstore's AsyncCache framework. There is some complexity in supporting zip64, etc. but overall I don't think it is too complicated to decode, especially when using e.g. the existing Python zipfile.py as a reference.
  • When there is a request to read a key, find it in the central directory, read the corresponding byte range, and then decode it if compressed.

For writing:

  • Using an in-memory buffer, write each key, either compressed or not.
  • Append the central directory and related footers at the end.

from tensorstore.

dzenanz avatar dzenanz commented on May 15, 2024

Implementing zip "from scratch" seems like a lot of work. Would it make sense to add a basic, low-performance support first, and if there are resources later replace it by a better performing implementation?

zlib build setup looks simple. Libraries with dependencies look more complicated, e.g. libAOM or gRPC. A reasonable starting help for me would be for you to add minizip-ng into tensorstore's build system, and point me to places in existing code where I need to make changes to add zip support.

from tensorstore.

jbms avatar jbms commented on May 15, 2024

Yes, you could make a simpler version that always reads the entire zip file into memory. Then you could easily use an existing library like minizip-ng.

To add support to tensorstore, you could start by duplicating kvstore/neuroglancer_uint64_sharded to kvstore/zip and then modifying it to use the zip format. The non-simple version would be very similar indeed to neuroglancer_uint64_sharded --- the simple version would be simpler.

Agreed that adding a third-party dependency can be challenging --- not sure if we will have time to do that for minizip-ng though.

from tensorstore.

dzenanz avatar dzenanz commented on May 15, 2024

You don't need to add all third party dependencies. Just configure-enable the ones which are already bundled (zlib, zstd etc) and correctly point minizip-ng to them.

from tensorstore.

dzenanz avatar dzenanz commented on May 15, 2024

I would like to get started on this, as a way to accomplish InsightSoftwareConsortium/ITKIOOMEZarrNGFF#24. Any chance @jbms or @laramiel will add minizip-ng to tensorstore's build system? We don't need a lot of the stuff from it.

from tensorstore.

dzenanz avatar dzenanz commented on May 15, 2024

I am giving it a try.

from tensorstore.

dzenanz avatar dzenanz commented on May 15, 2024

This does not seem to work. I tried adding #include "mz.h" in neuroglancer_uint64_sharded.cc, which produces:

dzenan@corista:~/zarr/ts-deb$ ninja && ctest -j3
[1/11] Building CXX object CMakeFiles/tensorstore_kvstore_neuroglancer_uint64_...nsorstore/kvstore/neuroglancer_uint64_sharded/neuroglancer_uint64_sharded.cc.o
FAILED: CMakeFiles/tensorstore_kvstore_neuroglancer_uint64_sharded.dir/tensorstore/kvstore/neuroglancer_uint64_sharded/neuroglancer_uint64_sharded.cc.o 
/usr/bin/c++ -DCHROMIUM_ZLIB_NO_CHROMECONF -I/home/dzenan/zarr/ts-deb -I/home/dzenan/zarr/tensorstore -I/home/dzenan/zarr/ts-deb/_deps/absl-src -I/home/dzenan/zarr/ts-deb/_deps/riegeli-build -I/home/dzenan/zarr/ts-deb/_deps/riegeli-src -I/home/dzenan/zarr/ts-deb/_deps/nlohmann_json-build -I/home/dzenan/zarr/ts-deb/_deps/nlohmann_json-build/include -I/home/dzenan/zarr/ts-deb/_deps/nlohmann_json-src -I/home/dzenan/zarr/ts-deb/_deps/nlohmann_json-src/include -I/home/dzenan/zarr/ts-deb/_deps/half-build -I/home/dzenan/zarr/ts-deb/_deps/half-build/include -I/home/dzenan/zarr/ts-deb/_deps/half-src -I/home/dzenan/zarr/ts-deb/_deps/half-src/include -I/home/dzenan/zarr/ts-deb/_deps/zlib-build -I/home/dzenan/zarr/ts-deb/_deps/zlib-build/. -I/home/dzenan/zarr/ts-deb/_deps/zlib-src -I/home/dzenan/zarr/ts-deb/_deps/zlib-src/. -I/home/dzenan/zarr/ts-deb/_deps/zlib-build/contrib/optimizations -I/home/dzenan/zarr/ts-deb/_deps/zlib-src/contrib/optimizations -g -fPIC -Wno-deprecated-declarations -Wno-sign-compare -Wno-unused-but-set-parameter -Wno-maybe-uninitialized -Wno-unknown-warning-option -fsized-deallocation -MD -MT CMakeFiles/tensorstore_kvstore_neuroglancer_uint64_sharded.dir/tensorstore/kvstore/neuroglancer_uint64_sharded/neuroglancer_uint64_sharded.cc.o -MF CMakeFiles/tensorstore_kvstore_neuroglancer_uint64_sharded.dir/tensorstore/kvstore/neuroglancer_uint64_sharded/neuroglancer_uint64_sharded.cc.o.d -o CMakeFiles/tensorstore_kvstore_neuroglancer_uint64_sharded.dir/tensorstore/kvstore/neuroglancer_uint64_sharded/neuroglancer_uint64_sharded.cc.o -c /home/dzenan/zarr/tensorstore/tensorstore/kvstore/neuroglancer_uint64_sharded/neuroglancer_uint64_sharded.cc
/home/dzenan/zarr/tensorstore/tensorstore/kvstore/neuroglancer_uint64_sharded/neuroglancer_uint64_sharded.cc:49:10: fatal error: mz.h: No such file or directory
   49 | #include "mz.h"
      |          ^~~~~~
compilation terminated.
ninja: build stopped: subcommand failed.

Error message on Windows is similar.

from tensorstore.

dzenanz avatar dzenanz commented on May 15, 2024

Also, there is no mention of minizip in target list or in the _deps sub-directory inside build directory.

from tensorstore.

jbms avatar jbms commented on May 15, 2024

You need to add "@minizip_ng//:minizip_ng" to the deps list of the specific target (see the BUILD file) in order for it to be available.

from tensorstore.

dzenanz avatar dzenanz commented on May 15, 2024

Doing that gets me further. This add MINIZIP_minizip_ng target to the build system. But it doesn't build:

Build started...
1>------ Build started: Project: MINIZIP_minizip_ng, Configuration: Release x64 ------
1>mz_strm_bzip.c
1>C:\Misc\tensorstore-vs22\_deps\minizip-src\mz_strm_bzip.c(15,10): fatal  error C1083: Cannot open include file: 'bzlib.h': No such file or directory
1>mz_strm_lzma.c
1>C:\Misc\tensorstore-vs22\_deps\minizip-src\mz_strm_lzma.c(15,10): fatal  error C1083: Cannot open include file: 'lzma.h': No such file or directory
1>Generating Code...
1>Done building project "MINIZIP_minizip_ng.vcxproj" -- FAILED.
========== Build: 0 succeeded, 1 failed, 4 up-to-date, 0 skipped ==========
========== Elapsed 00:01.176 ==========

Adding

        "//tensorstore/internal/compression:bzip2_compressor",
        "//tensorstore/internal/compression:zlib_compressor",

to deps does not help. See my fork.

from tensorstore.

dzenanz avatar dzenanz commented on May 15, 2024

I decided to base my zip kvstore drive off of memory. It is way simpler.

I now a have a version of code which can read a zarr file from a zip store. Now is a great time for feedback.

from tensorstore.

jbms avatar jbms commented on May 15, 2024

Great that you have an initial working version --- I know the tensorstore codebase is rather complicated.

A few high-level comments:

  • We would want a zip driver in tensorstore to work with any underlying kvstore (file, http, gcs, memory, etc.), for accessing the zip file, rather than just directly accessing the zip file using the local filesystem and only supporting that.
  • The memory driver stores its data in the Context because there is no other place to persist the data, and the whole point is to store the data in memory. However, I see that you are using a similar pattern for the zip driver: https://github.com/dzenanz/tensorstore/blob/81156bf313a8d26028420e76e5a3e237c097f5a0/tensorstore/kvstore/zip/zip_key_value_store.cc#L347 Was that intended to solve a specific problem you ran into, or did you just keep that because you based the driver off of the memory kvstore driver?

from tensorstore.

dzenanz avatar dzenanz commented on May 15, 2024

I do want to support in-memory zip file. But I don't know how to support all the kv store backends.

There are still many leftovers from the memory driver, but I intend to remove them over time (the values member of ZipEncapsulator and related methods). Judging by the comments in tensorstore, I thought that context/resource is a recommended pattern.

You are very welcome to make simplifications/improvements after I have a functional zip support. It would be even better if you took over this work now 😄

from tensorstore.

dzenanz avatar dzenanz commented on May 15, 2024

I have basic write support now, via dzenanz@9b51dee. It needs to be coupled with deleting an existing zip file to work properly.

from tensorstore.

dzenanz avatar dzenanz commented on May 15, 2024

I now have an implementation in my fork which allows ITK zarr IO to function. It can read and write zips, both from disk and in memory. It does not support update and deletion of keys. I assume it also begs for simplification and refactoring, as I was trying to avoid any complication from tensorstore machinery (which I am clueless about).

Can someone take a look at it here: dzenanz@03143ad. How close or far away it is from being accepted as a contribution? What are the most urgent cleanups?

from tensorstore.

jbms avatar jbms commented on May 15, 2024

The key missing piece for us to be able to accept it as a contribution is for it to work with any base kvstore, rather than accessing the filesystem directly --- that would allow, for example, someone to use it with a zip file stored on an HTTP server, GCS, etc.

Ideally, it would only read the portions of the zip file that are actually required, e.g. the central directory and the specific entries that are requested. However, I expect that will be quite challenging to support with minizip-ng.

A simpler alternative would be to always read the entire zip file into memory, and then manipulate it with minizip-ng as an in-memory zip. That would mean that the interface allows an arbitrary base kvstore, and in the future we could add partial I/O support.

I suppose the key question is whether this would actually serve your needs, or if the partial I/O, as minzip-ng provides for the local filesystem, is important. Potentially we could use mmap with the file kvstore driver, to effectively support partial I/O while still treating the entire zip as in-memory.

from tensorstore.

dzenanz avatar dzenanz commented on May 15, 2024

While reading an entire zip into memory would be fine for smaller files, it would totally defeat the point of supporting large, multi-resolution images. I don't know whether minizip-ng would support files larger than 2GB (in general). It would not support them for in-memory operations. So the point of large images might be moot.

I still haven't started looking into supporting cloud. What are the additional restrictions zip kvstore would need to have to support layering on top of GCS or HTTP server?

from tensorstore.

dzenanz avatar dzenanz commented on May 15, 2024

Incremental approach would work for us. Start with what we have (now it is in https://github.com/InsightSoftwareConsortium/tensorstore), and expand support for more kv stores (e.g. HTTP, GCS, S3) later.

A possible later improvement would include reading the central directory and json files into memory, and chunks on-demand via file offsets (assuming no zip compression is applied).

We already have cloud support on our TODO list, here.

from tensorstore.

dzenanz avatar dzenanz commented on May 15, 2024

I would like to do a bit of cleanup and create a pull request. @jbms @laramiel Can you make some comments? Earlier questions up in the thread: #57 (comment). We already merged the changes in ITK's Zarr module.

from tensorstore.

dzenanz avatar dzenanz commented on May 15, 2024

Thank you - now it is a lot clearer to me what is desired.

from tensorstore.

dzenanz avatar dzenanz commented on May 15, 2024

That sounds like a lot of additional effort. In the foreseeable future, we will keep these changes in our fork: https://github.com/InsightSoftwareConsortium/tensorstore. Thank you for support during this development!

from tensorstore.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.