Comments (6)
What is happening when overwriting a shard is that tensorstore assumes there is no existing shard (this saves one read request in the common case of no existing shard), performs a write conditioned on the key not existing (which still requires a full upload of the data), gets back an error, reads the existing shard and merges the changes, then rewrites it. Since that involves 2 uploads and 1 download of the entire shard, I would expect it may take 3 times as long.
What you suggest as far as detecting that the entire shard is being rewritten, and then performing the write unconditionally sounds like the best solution. I will look into implementing that, hopefully in the next couple days.
The first of the two uploads could be eliminated by instead checking if there is existing data first, but that would introduce one extra read operation in the common case of no existing data and would still be 2x the normal cost.
There isn't an api at the moment for getting the shard path for a given chunk. In the future I do plan to add apis for retrieving the preferred grid for reading and writing a volume, which would make it easier to perform shard-aligned writes.
from tensorstore.
As an update, I am still working on fixing this issue fully --- it turned out to be trickier than expected. I implemented the approach outlined in the prior comment of doing an unconditional write in the case that all chunks are being written (without any preconditions). An additional fix was needed to also handle the case where the volume was not an exact multiple of the chunk size --- previously, the resultant partial "edge" chunks were not eligible for unconditional writeback.
These fixes essentially resolved this specific issue, but while testing them I found that there was a race condition whereby writeback may start too early before all of the pending writes have been flushed from one cache to another, leading to a similar inefficiency, and there isn't really any way to reliably avoid that race. To address this problem in a clean way, I'm working on implementing a transaction system that would allow deferring writeback and then atomically commiting the writes to a shard.
from tensorstore.
Thanks for spending time on this ... it is definitely going to be immensely helpful for the workflow that I am building now.
Not sure if this will help, but one concept that has worked well for us in DVID ingestion is to have an 'unsafe' mode. Basically a flag that will disable a couple critical features for ensuring data consistency in favor of speed and with the assumption that only an expert would enable this flag.
from tensorstore.
Jeremy: does the current build have this half fix? (I was planning to do another ingestion run with my new EM ingestion service using your recent fix for the sharded/unsharded spec from another issue and wanted to know if I should expect side effects from this issue)
from tensorstore.
The current build does not have the half fix, though if you think it would be useful I can try to get that pushed out later today. I am making good progress on the full fix, though it is a larger change.
from tensorstore.
I don't think a week will matter too much if you will have it by then. Thanks again!!
from tensorstore.
Related Issues (20)
- Where is the change log? HOT 2
- Registry Check fails in external package using Tensorstore as a dependency - Windows Python Wheel HOT 3
- Concatenating multiple archives HOT 8
- tensorstore cannot open vlen UTF8 string written with Zarr-Python HOT 1
- Bad Request error to access H01 dataset on a local machine HOT 2
- `zarr` driver fails to load quoted floating point data for `fill_value` HOT 1
- Can't copy or deepcopy Python TensorStore objects
- TensorStore does not compile with latest Visual Studio HOT 19
- Master does not compile on Linux HOT 11
- Tensorstore fails to compile as a CMake subproject HOT 2
- Further S3 Support Umbrella Issue HOT 3
- Converting c-order array to fortran-order array HOT 1
- Updated `bazel_to_cmake` causes trouble HOT 6
- Reading data from neuroglancer in the correct order HOT 3
- Generate `.pyi` files for type inference compatibility
- windows build failing in riegeli::EstimatedAllocatedSize HOT 16
- Writing to new Neuroglancer dataset in C++ HOT 4
- Replace deprecated `set-output` command with environment file HOT 1
- Any plans to implement ZEP0002 - Sharding codec? HOT 4
- Iterating over ts dataset using zarr driver does not parallelize HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tensorstore.