Code Monkey home page Code Monkey logo

Comments (6)

laramiel avatar laramiel commented on May 30, 2024

When writing, sometimes a mask of values to overwrite is used. This error indicates that there is such a mask but there is no data.

This is only a fragment of the code. Can you reduce it to a minimal single function with all of the code necessary to reproduce it.

from tensorstore.

LarsKoeppel avatar LarsKoeppel commented on May 30, 2024

Hi,

here is an example where i tried to summerize what my code is doing.
It is split into two parts, because the config generation happens on the main node before multiple nodes write the dataset in parallel. The boundaries of each node should allways align with shard boundaries.

//generate multiscale neuroglancer config - done on main node
 tensorstore::Open({{"driver", "neuroglancer_precomputed"},
                                       {"kvstore", {
                                            {"driver", "file"},
                                            {"path", "<path_to_new_dataset>"}
                                        }},
                                       {"multiscale_metadata", {
                                            {"data_type", "uint8_t"}, 
                                            {"num_channels", numberChannels},
                                            {"type", "image"},
                                        }},
                                       {"scale_metadata", {
                                            {"resolution", {resolution[0], resolution[1], resolution[2]}},
                                            {"encoding", encoding},
                                            {"chunk_size", {chunkSize[0], chunkSize[1], chunkSize[2]}},
                                            {"size", {sizeX, sizeY, sizeZ}},
                                            {"sharding", {
                                                 {"@type", "neuroglancer_uint64_sharded_v1"},
                                                 {"preshift_bits", neuroBits.preShiftBits},
                                                 {"minishard_bits", neuroBits.miniShardBits},
                                                 {"shard_bits", neuroBits.shardBits},
                                                 {"minishard_index_encoding", "raw"},
                                                 {"hash", "identity"}
                                             }},
                                        }},
                                       {"scale_index", index},
                                      },
                                      context,
                                      tensorstore::OpenMode::create,
                                      tensorstore::RecheckCached{false},
                                      tensorstore::ReadWriteMode::write).value();
//reopen tensorstore - following code is done in parallel on multiple nodes
tensorstore::TensorStore store =
                    tensorstore::Open({{"driver", "neuroglancer_precomputed"},
                                       {"kvstore", {{"driver", "file"},
                                                    {"path", "<path_to_new_dataset>"}}},
                                       {"scale_index", index}},
                                      context,
                                      tensorstore::OpenMode::open,
                                      tensorstore::RecheckCached{true},
                                      tensorstore::ReadWriteMode::write).value();

//size and intervall of array should be aligned with shard
boost::multi_array<uint8_t,4> dataPrint(boost::extents[D4][endD3 - startD3][endD2 - startD2][endD1 - startD1]);
//load data into array - not tensorstore
//do stuff with array

//write data to 
std::vector<int64_t> shape = {endD1 - startD1, endD2 - startD2, endD3 - startD3, D4};
auto intervalD1 = tensorstore::Dims(0).HalfOpenInterval(startD1, endD1);
auto intervalD2 = tensorstore::Dims(1).HalfOpenInterval(startD2, endD2);
auto intervalD3 = tensorstore::Dims(2).HalfOpenInterval(startD3, endD3);
auto arr = tensorstore::Array(dataArray.data(), shape, tensorstore::fortran_order);  
auto writeFuture = tensorstore::Write(tensorstore::UnownedToShared(arr), store | intervalD1 | intervalD2 | intervalD3);   
writeFuture.commit_future.value();   
auto result = writeFuture.result();   
exceptional_assert(result.ok(), "Error while writing to disk");

Do you need more information?
Thank you for your help.

from tensorstore.

laramiel avatar laramiel commented on May 30, 2024

Let's try and get a self-contained repro case; I've built a guint test incorporating your spec.
Is it possible for you to edit this self-contained test so that it fails?

#include <cstdint>
#include <vector>

#include <gtest/gtest.h>

#include "absl/status/status.h"
#include "tensorstore/array.h"
#include "tensorstore/context.h"
#include "tensorstore/contiguous_layout.h"
#include "tensorstore/index_space/dim_expression.h"
#include "tensorstore/open.h"
#include "tensorstore/open_mode.h"
#include "tensorstore/staleness_bound.h"
#include "tensorstore/tensorstore.h"
#include "tensorstore/util/status_testutil.h"

// Boost
#include "boost/multi_array.hpp"

static constexpr int D4 = 1;

absl::Status CreateTensorstore(tensorstore::Context context) {
  int chunkSize[] = {16, 16, 16};
  int size[] = {1024, 1024, 1024};

  int preShiftBits = 2;
  int miniShardBits = 4;
  int shardBits = 8;

  return tensorstore::Open(
             {
                 {"driver", "neuroglancer_precomputed"},
                 {"kvstore",
                  {
                      {"driver", "memory"},
                      {"path", "prefix/"},
                  }},
                 {"multiscale_metadata",
                  {
                      {"data_type", "uint8"},  // not uint8_t
                      {"num_channels", D4},
                      {"type", "image"},
                  }},
                 {"scale_metadata",
                  {
                      {"resolution", {1.0, 1.0, 1.0}},
                      {"encoding", "raw"},
                      {"chunk_size",
                       {chunkSize[0], chunkSize[1], chunkSize[2]}},
                      {"size", {size[0], size[1], size[2]}},
                      {"sharding",
                       {{"@type", "neuroglancer_uint64_sharded_v1"},
                        {"preshift_bits", preShiftBits},
                        {"minishard_bits", miniShardBits},
                        {"shard_bits", shardBits},
                        {"minishard_index_encoding", "raw"},
                        {"hash", "identity"}}},
                  }},
                 {"scale_index", 0},
             },
             context, tensorstore::OpenMode::create,
             tensorstore::RecheckCached{false},
             tensorstore::ReadWriteMode::write)
      .status();
}

TEST(Issue155, Repro) {
  tensorstore::Context context = tensorstore::Context::Default();

  TENSORSTORE_ASSERT_OK(CreateTensorstore(context));

  int startD1 = 0;
  int endD1 = 64;
  int startD2 = 0;
  int endD2 = 64;
  int startD3 = 0;
  int endD3 = 64;

  // reopen tensorstore - following code is done in parallel on multiple nodes
  TENSORSTORE_ASSERT_OK_AND_ASSIGN(
      auto store,
      tensorstore::Open(
          {
              {"driver", "neuroglancer_precomputed"},
              {"kvstore",
               {
                   {"driver", "memory"},
                   {"path", "prefix/"},
               }},
              {"scale_index", 0},
          },
          context, tensorstore::OpenMode::open,
          tensorstore::RecheckCached{true}, tensorstore::ReadWriteMode::write)
          .result());

  // size and interval of array should be aligned with shard
  boost::multi_array<uint8_t, 4> dataArray(
      boost::extents[D4][endD3 - startD3][endD2 - startD2][endD1 - startD1]);
  // load data into array - not tensorstore
  // do stuff with array

  // write data to
  std::vector<int64_t> shape = {endD1 - startD1, endD2 - startD2,
                                endD3 - startD3, D4};
  auto intervalD1 = tensorstore::Dims(0).HalfOpenInterval(startD1, endD1);
  auto intervalD2 = tensorstore::Dims(1).HalfOpenInterval(startD2, endD2);
  auto intervalD3 = tensorstore::Dims(2).HalfOpenInterval(startD3, endD3);
  auto arr =
      tensorstore::Array(dataArray.data(), shape, tensorstore::fortran_order);
  auto writeFuture =
      tensorstore::Write(tensorstore::UnownedToShared(arr),
                         store | intervalD1 | intervalD2 | intervalD3);
  writeFuture.commit_future.Wait();

  TENSORSTORE_ASSERT_OK(writeFuture.commit_future.result());
}

You will need to put together a proper BUILD rule for this.

from tensorstore.

LarsKoeppel avatar LarsKoeppel commented on May 30, 2024

I don't think I can get this test case to crash.
The only time I have seen this error is when multiple workers run on multiple nodes at the same time and all open and write to the same dataset asynchronously. They should not interfere with each other because everyone should be working on their own shard.
But I have also observed, that it looks like tensorstore loads some surrounding data, even when it is not needed for the intended operation.

from tensorstore.

laramiel avatar laramiel commented on May 30, 2024

Do you attempt to create on every node? Or is the create completely independent?

We can make it a simple binary with a --create flag which takes start/end[1-3] as parameters.

You can try running your original code with verbose logging enabled. See https://github.com/google/tensorstore/blob/master/tensorstore/internal/log/verbose_flag.h

TENSORSTORE_VERBOSE_LOGGING=all

from tensorstore.

jbms avatar jbms commented on May 30, 2024

Thanks for reporting this.

I identified the bug, and have a fix that we can hopefully push out shortly.

I believe the specific case that would trigger this is:

  • Process 1: Write all zeros to just a portion of a chunk. Starts writeback of shard. Observes that the chunk is equal to the fill value (all zero) because the existing chunk is either not present, or all unmodified elements are zero. At this point, the data array is freed since it is equal to the fill value, but the mask remains as it was to indicate a partial modification.
  • Process 2: Concurrently modifies the shard.
  • Process 1: Writeback must be retried due to concurrent modification. When integrating the new contents of the shard, assertion is triggered due to unexpected combination of partial modification with no data array.

There are two important things to note, though:

  1. If this sequence is indeed what triggered the bug, then that means your writes are in fact not shard aligned as you thought they were. Even with the bug fixed, shard aligned writes will be much more efficient.
  2. The assertion only triggers in debug builds (with NDEBUG not defined). For production use, disabling assertions may make it significantly faster. Usually NDEBUG will be defined automatically in release builds, so you may want to check to confirm you are building with optimizations.

from tensorstore.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.