Comments (31)
I do think "zip" support is a very useful addition to tensorstore, and I'm sure many other users would also make use of it.
But a key design principle of Tensorstore is to equally support both the local storage and cloud storage. The issue isn't about adding support for specific cloud storage protocols to the zip driver, like https, gcs, s3. Rather, issue is that the zip
kvstore driver should be an adapter on top of a base kvstore, like the neuroglancer_uint64_sharded
or ocdbt
kvstores, and be agnostic to the underlying storage:
https://google.github.io/tensorstore/kvstore/neuroglancer_uint64_sharded/index.html
https://google.github.io/tensorstore/kvstore/ocdbt/index.html
For example, to access "path/in/zip" within a zip file at c:\my\zip\file.zip
would use a spec of:
{"driver": "zip", "path": "path/in/zip", "base": {"driver":"file", "path": "c:\\my\\zip\\file.zip"}}
while to access gcs, it would be:
{"driver": "zip", "path": "path/in/zip", "base": {"driver":"gcs", "bucket": "my-bucket", "path": "path/to/file.zip"}}
The zip driver itself would merely use the kvstore::Driver
interface and need not concern itself with what the underlying storage is at all.
Your current implementation is a great proof of concept for adding zip support to tensorstore, but broadly the issues with it currently are:
- The JSON spec does not match other kvstore in tensorstore that serve as adapters on some underlying kvstore, and doesn't lend itself to support other underlying storage in the future.
- The implementation is also specifically designed around accessing the local filesystem directly, and can't really be incrementally changed to do all I/O through the
kvstore::Driver
interface.
from tensorstore.
With commit 4556546 I added rudimentary reading of ZIP files; rudimentary because some compression formats not supported, nor did I implement PKAES support.
There don't appear to be ZIP libraries which support nice async interfaces, so most of this is bespoke.
from tensorstore.
This commit (not pushed to the main branch) shows how it can be done:
This seems to work on Linux, I haven't tested on Windows but you can probably get it to work with some minor tweaks if necessary. It could use some refinement for the additional optional functionality but should be enough for you to get started.
from tensorstore.
I just pushed a new version to this branch https://github.com/jbms/tensorstore/tree/add-minizip-ng
I had forgotten to specify hdrs
and strip_include_prefix
.
from tensorstore.
The minizip_ng.BUILD.bazel file needed to be updated to add the bzip2 and lzma dependencies. I pushed an updated version.
from tensorstore.
Can I do something to help this along?
from tensorstore.
You could certainly take a stab at implementing it. The closest existing driver is the neuroglancer_uiny64_sharded driver:
And we'd be happy to provide guidance.
Due to the need to operate asynchronously, most likely existing zip libraries would not be directly usable. Personally I would probably use the Python standard library zipfile.py implementation as a reference.
The simplest way to implement write support would be the same as for neuroglancer_uiny64_sharded: just rewrite the entire zip file for each update. Using tensorstore transactions you can ensure the zip file is only written once. Append support could be added later but would be an added complication.
from tensorstore.
As I believe that typical use for zip files is to write once, "rewrite the entire zip file for each update" should be totally fine.
I will first deal with JSON metadata (image origin, spacing, etc.), before I can pay attention to ZIP support. I never dealt with Bazel build system before, so I know nothing about it. minizip-ng uses CMake, is that a problem for integrating into tensorstore? With extra compression methods disabled, it should not need further dependencies.
from tensorstore.
TensorStore already depends on zlib, bzip, and liblzma, so supporting those compression methods would not require any additional dependencies.
To integrate minizip-ng we would need a bazel build for it, see e.g. the build for zlib:
https://github.com/google/tensorstore/tree/master/third_party/net_zlib
There we have listed the source files explicitly, but in many cases you can just use the glob function (https://bazel.build/reference/be/functions#glob).
Then you can run https://github.com/google/tensorstore/blob/master/third_party/update_third_party_bzl.py.
For our cmake build we can either use the native cmake build provided by the package, if there is one, or just use the bazel one converted to cmake via our bazel_to_cmake tool. In most cases it works better to just use the bazel build even if there is a native cmake build, because most cmake builds aren't designed to work properly as subprojects.
I'm not sure how helpful minizip-ng will be, though, given that it appears to be based on a synchronous stream interface but for tensorstore we will need an asynchronous interface in order to be able to efficiently read multiple archive members in parallel. For writing, assuming we write the entire file at once, it could probably be used.
from tensorstore.
The only other library I know of is libzip. It also uses CMake, so I expect it will be no easier to integrate. I doubt it is asynchronous either. Is there a more suitable option for tensorstore?
from tensorstore.
What I had in mind was to just implement the zip support "from scratch" (but still using zlib, etc. for the actual compression).
Basically for reading:
- Locate and decode the central directory, which contains the list of all archive members. This would be cached using tensorstore's AsyncCache framework. There is some complexity in supporting zip64, etc. but overall I don't think it is too complicated to decode, especially when using e.g. the existing Python zipfile.py as a reference.
- When there is a request to read a key, find it in the central directory, read the corresponding byte range, and then decode it if compressed.
For writing:
- Using an in-memory buffer, write each key, either compressed or not.
- Append the central directory and related footers at the end.
from tensorstore.
Implementing zip "from scratch" seems like a lot of work. Would it make sense to add a basic, low-performance support first, and if there are resources later replace it by a better performing implementation?
zlib
build setup looks simple. Libraries with dependencies look more complicated, e.g. libAOM or gRPC. A reasonable starting help for me would be for you to add minizip-ng
into tensorstore's build system, and point me to places in existing code where I need to make changes to add zip support.
from tensorstore.
Yes, you could make a simpler version that always reads the entire zip file into memory. Then you could easily use an existing library like minizip-ng.
To add support to tensorstore, you could start by duplicating kvstore/neuroglancer_uint64_sharded
to kvstore/zip
and then modifying it to use the zip format. The non-simple version would be very similar indeed to neuroglancer_uint64_sharded --- the simple version would be simpler.
Agreed that adding a third-party dependency can be challenging --- not sure if we will have time to do that for minizip-ng though.
from tensorstore.
You don't need to add all third party dependencies. Just configure-enable the ones which are already bundled (zlib, zstd etc) and correctly point minizip-ng
to them.
from tensorstore.
I would like to get started on this, as a way to accomplish InsightSoftwareConsortium/ITKIOOMEZarrNGFF#24. Any chance @jbms or @laramiel will add minizip-ng
to tensorstore's build system? We don't need a lot of the stuff from it.
from tensorstore.
I am giving it a try.
from tensorstore.
This does not seem to work. I tried adding #include "mz.h"
in neuroglancer_uint64_sharded.cc
, which produces:
dzenan@corista:~/zarr/ts-deb$ ninja && ctest -j3
[1/11] Building CXX object CMakeFiles/tensorstore_kvstore_neuroglancer_uint64_...nsorstore/kvstore/neuroglancer_uint64_sharded/neuroglancer_uint64_sharded.cc.o
FAILED: CMakeFiles/tensorstore_kvstore_neuroglancer_uint64_sharded.dir/tensorstore/kvstore/neuroglancer_uint64_sharded/neuroglancer_uint64_sharded.cc.o
/usr/bin/c++ -DCHROMIUM_ZLIB_NO_CHROMECONF -I/home/dzenan/zarr/ts-deb -I/home/dzenan/zarr/tensorstore -I/home/dzenan/zarr/ts-deb/_deps/absl-src -I/home/dzenan/zarr/ts-deb/_deps/riegeli-build -I/home/dzenan/zarr/ts-deb/_deps/riegeli-src -I/home/dzenan/zarr/ts-deb/_deps/nlohmann_json-build -I/home/dzenan/zarr/ts-deb/_deps/nlohmann_json-build/include -I/home/dzenan/zarr/ts-deb/_deps/nlohmann_json-src -I/home/dzenan/zarr/ts-deb/_deps/nlohmann_json-src/include -I/home/dzenan/zarr/ts-deb/_deps/half-build -I/home/dzenan/zarr/ts-deb/_deps/half-build/include -I/home/dzenan/zarr/ts-deb/_deps/half-src -I/home/dzenan/zarr/ts-deb/_deps/half-src/include -I/home/dzenan/zarr/ts-deb/_deps/zlib-build -I/home/dzenan/zarr/ts-deb/_deps/zlib-build/. -I/home/dzenan/zarr/ts-deb/_deps/zlib-src -I/home/dzenan/zarr/ts-deb/_deps/zlib-src/. -I/home/dzenan/zarr/ts-deb/_deps/zlib-build/contrib/optimizations -I/home/dzenan/zarr/ts-deb/_deps/zlib-src/contrib/optimizations -g -fPIC -Wno-deprecated-declarations -Wno-sign-compare -Wno-unused-but-set-parameter -Wno-maybe-uninitialized -Wno-unknown-warning-option -fsized-deallocation -MD -MT CMakeFiles/tensorstore_kvstore_neuroglancer_uint64_sharded.dir/tensorstore/kvstore/neuroglancer_uint64_sharded/neuroglancer_uint64_sharded.cc.o -MF CMakeFiles/tensorstore_kvstore_neuroglancer_uint64_sharded.dir/tensorstore/kvstore/neuroglancer_uint64_sharded/neuroglancer_uint64_sharded.cc.o.d -o CMakeFiles/tensorstore_kvstore_neuroglancer_uint64_sharded.dir/tensorstore/kvstore/neuroglancer_uint64_sharded/neuroglancer_uint64_sharded.cc.o -c /home/dzenan/zarr/tensorstore/tensorstore/kvstore/neuroglancer_uint64_sharded/neuroglancer_uint64_sharded.cc
/home/dzenan/zarr/tensorstore/tensorstore/kvstore/neuroglancer_uint64_sharded/neuroglancer_uint64_sharded.cc:49:10: fatal error: mz.h: No such file or directory
49 | #include "mz.h"
| ^~~~~~
compilation terminated.
ninja: build stopped: subcommand failed.
Error message on Windows is similar.
from tensorstore.
Also, there is no mention of minizip
in target list or in the _deps
sub-directory inside build directory.
from tensorstore.
You need to add "@minizip_ng//:minizip_ng" to the deps
list of the specific target (see the BUILD file) in order for it to be available.
from tensorstore.
Doing that gets me further. This add MINIZIP_minizip_ng
target to the build system. But it doesn't build:
Build started...
1>------ Build started: Project: MINIZIP_minizip_ng, Configuration: Release x64 ------
1>mz_strm_bzip.c
1>C:\Misc\tensorstore-vs22\_deps\minizip-src\mz_strm_bzip.c(15,10): fatal error C1083: Cannot open include file: 'bzlib.h': No such file or directory
1>mz_strm_lzma.c
1>C:\Misc\tensorstore-vs22\_deps\minizip-src\mz_strm_lzma.c(15,10): fatal error C1083: Cannot open include file: 'lzma.h': No such file or directory
1>Generating Code...
1>Done building project "MINIZIP_minizip_ng.vcxproj" -- FAILED.
========== Build: 0 succeeded, 1 failed, 4 up-to-date, 0 skipped ==========
========== Elapsed 00:01.176 ==========
Adding
"//tensorstore/internal/compression:bzip2_compressor",
"//tensorstore/internal/compression:zlib_compressor",
to deps
does not help. See my fork.
from tensorstore.
I decided to base my zip
kvstore drive off of memory
. It is way simpler.
I now a have a version of code which can read a zarr file from a zip store. Now is a great time for feedback.
from tensorstore.
Great that you have an initial working version --- I know the tensorstore codebase is rather complicated.
A few high-level comments:
- We would want a zip driver in tensorstore to work with any underlying kvstore (file, http, gcs, memory, etc.), for accessing the zip file, rather than just directly accessing the zip file using the local filesystem and only supporting that.
- The
memory
driver stores its data in theContext
because there is no other place to persist the data, and the whole point is to store the data in memory. However, I see that you are using a similar pattern for the zip driver: https://github.com/dzenanz/tensorstore/blob/81156bf313a8d26028420e76e5a3e237c097f5a0/tensorstore/kvstore/zip/zip_key_value_store.cc#L347 Was that intended to solve a specific problem you ran into, or did you just keep that because you based the driver off of the memory kvstore driver?
from tensorstore.
I do want to support in-memory zip file. But I don't know how to support all the kv store backends.
There are still many leftovers from the memory driver, but I intend to remove them over time (the values
member of ZipEncapsulator
and related methods). Judging by the comments in tensorstore, I thought that context/resource is a recommended pattern.
You are very welcome to make simplifications/improvements after I have a functional zip support. It would be even better if you took over this work now 😄
from tensorstore.
I have basic write support now, via dzenanz@9b51dee. It needs to be coupled with deleting an existing zip file to work properly.
from tensorstore.
I now have an implementation in my fork which allows ITK zarr IO to function. It can read and write zips, both from disk and in memory. It does not support update and deletion of keys. I assume it also begs for simplification and refactoring, as I was trying to avoid any complication from tensorstore machinery (which I am clueless about).
Can someone take a look at it here: dzenanz@03143ad. How close or far away it is from being accepted as a contribution? What are the most urgent cleanups?
from tensorstore.
The key missing piece for us to be able to accept it as a contribution is for it to work with any base kvstore, rather than accessing the filesystem directly --- that would allow, for example, someone to use it with a zip file stored on an HTTP server, GCS, etc.
Ideally, it would only read the portions of the zip file that are actually required, e.g. the central directory and the specific entries that are requested. However, I expect that will be quite challenging to support with minizip-ng.
A simpler alternative would be to always read the entire zip file into memory, and then manipulate it with minizip-ng as an in-memory zip. That would mean that the interface allows an arbitrary base kvstore, and in the future we could add partial I/O support.
I suppose the key question is whether this would actually serve your needs, or if the partial I/O, as minzip-ng provides for the local filesystem, is important. Potentially we could use mmap
with the file
kvstore driver, to effectively support partial I/O while still treating the entire zip as in-memory.
from tensorstore.
While reading an entire zip into memory would be fine for smaller files, it would totally defeat the point of supporting large, multi-resolution images. I don't know whether minizip-ng would support files larger than 2GB (in general). It would not support them for in-memory operations. So the point of large images might be moot.
I still haven't started looking into supporting cloud. What are the additional restrictions zip kvstore
would need to have to support layering on top of GCS or HTTP server?
from tensorstore.
Incremental approach would work for us. Start with what we have (now it is in https://github.com/InsightSoftwareConsortium/tensorstore), and expand support for more kv stores (e.g. HTTP, GCS, S3) later.
A possible later improvement would include reading the central directory and json files into memory, and chunks on-demand via file offsets (assuming no zip compression is applied).
We already have cloud support on our TODO list, here.
from tensorstore.
I would like to do a bit of cleanup and create a pull request. @jbms @laramiel Can you make some comments? Earlier questions up in the thread: #57 (comment). We already merged the changes in ITK's Zarr module.
from tensorstore.
Thank you - now it is a lot clearer to me what is desired.
from tensorstore.
That sounds like a lot of additional effort. In the foreseeable future, we will keep these changes in our fork: https://github.com/InsightSoftwareConsortium/tensorstore. Thank you for support during this development!
from tensorstore.
Related Issues (20)
- Where is the change log? HOT 2
- Registry Check fails in external package using Tensorstore as a dependency - Windows Python Wheel HOT 3
- Concatenating multiple archives HOT 8
- tensorstore cannot open vlen UTF8 string written with Zarr-Python HOT 1
- Bad Request error to access H01 dataset on a local machine HOT 2
- `zarr` driver fails to load quoted floating point data for `fill_value` HOT 1
- Can't copy or deepcopy Python TensorStore objects
- TensorStore does not compile with latest Visual Studio HOT 19
- Master does not compile on Linux HOT 11
- Tensorstore fails to compile as a CMake subproject HOT 2
- Further S3 Support Umbrella Issue HOT 3
- Converting c-order array to fortran-order array HOT 1
- Updated `bazel_to_cmake` causes trouble HOT 6
- Reading data from neuroglancer in the correct order HOT 3
- Generate `.pyi` files for type inference compatibility
- windows build failing in riegeli::EstimatedAllocatedSize HOT 16
- Writing to new Neuroglancer dataset in C++ HOT 4
- Replace deprecated `set-output` command with environment file HOT 1
- Any plans to implement ZEP0002 - Sharding codec? HOT 4
- Iterating over ts dataset using zarr driver does not parallelize HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tensorstore.