Code Monkey home page Code Monkey logo

Comments (4)

taiya avatar taiya commented on August 11, 2024

Open for discussion:

  1. We should consolidate all the klevr files in a single folder (klevr_worker, asset_source, dataset, zip, ...) rather than scattering them around the module tree? (see #TODO: implement the logic for when folder contains multiple kubric runs (i.e. subfolders))
  2. Where can we put the logic for copying resources to a centralized dataset bucket? gsutil cp -r local/path gs://remote/path
  3. If a dataset is generated from a collection of render passes, then you will have 1+ metadata.pkl. Now as you break the data into train/test sets, you will have to remember what metadata file a particular exemplar refers to. How to do this elegantly is TBD
  4. What the training output of the dataset will be (e.g. how to get class labels, etc...) → Sara

from kubric.

taiya avatar taiya commented on August 11, 2024

Discussed with Etienne from TFDS on how to address (4) above. Indeed, keeping the current loader as is (single folder) and then yielding examples from them seems to be the most effective solution (keeps code clean).

If high performance is needed, Beam seems to be the solution:

class ProcessDataset(beam.DoFn):

  def process(self, ds_path):
    builder = tfds.core.builder_from_directory(ds_path)
    for i, ex in enumerate(builder.as_dataset(split='train'))):
      yield f'{i}_{ds_path.name}', ex

def _generate_examples(self, all_dataset_path):
  return (
      beam.Create(all_dataset_path)
      | beam.ParDo(ProcessDataset())
  )

But considering rendering takes a significant of time.. not sure it's worth it.
So we discussed a simpler alternative... in the init phase something like:

builders = list()
for bucket_path in bucket_paths:
   # tfds.builder('MyDatesetBuilder', bucket_path=bucket_path) # equivalent to next line
   builder = MyDatesetBuilder(bucket_path=bucket_path)

And then, in the yield phase (note: shuffling is done by TFDS, so we can do in-order):

def _generate_examples(self, ...):
    for build in builders:
        ds = builders.as_dataset(split=...)
        for example in ds:
            yield example

Finally, Etienne mentioned since TFDS4.1.0 this code:

train_split = tfds.core.SplitGenerator(name=tfds.Split.TRAIN, gen_kwargs=dict(ids=ids_train))
test_split = tfds.core.SplitGenerator(name=tfds.Split.TEST, gen_kwargs=dict(ids=ids_test))
return [train_split, test_split]

is replaced by

return {
    'train': self._generate_examples(ids_train),
    'test': self._generate_examples(ids_test),
}

Finally, to pass arguments (path) to the master dataset: tfds.load(..., as_dataset_kwargs={})

Link to other resources:

from kubric.

taiya avatar taiya commented on August 11, 2024

Alright, a first attempt at building a "dataset-of-datasets" is ready:
https://github.com/google-research/kubric/blob/tfds/kubric/datasets/klevr.py

In this example I have something like:

gsutil ls -l gs://kubric/tfds/klevr
                                 gs://kubric/tfds/klevr/7508d33/
                                 gs://kubric/tfds/klevr/e31bf6a/

Yet at runtime you get AssertionError: Two records share the same hashed key!... and this is because despite the fact that you have a path specializer that creates two DIFFERENT datasets, when the second gets loaded, it gets "reused":

python3 /workspaces/kubric/kubric/datasets/klevr.py
INFO:absl:Load dataset info from /root/tensorflow_datasets/master_klevr/1.0.0
INFO:absl:Load dataset info from /root/tensorflow_datasets/klevr/1.0.0
INFO:root:Klevr(path=gs://kubric/tfds/klevr/7508d33)
INFO:absl:Reusing dataset klevr (/root/tensorflow_datasets/klevr/1.0.0)
INFO:absl:Load dataset info from /root/tensorflow_datasets/klevr/1.0.0
INFO:root:Klevr(path=gs://kubric/tfds/klevr/e31bf6a)
INFO:absl:Reusing dataset klevr (/root/tensorflow_datasets/klevr/1.0.0)
INFO:absl:Reusing dataset master_klevr (/root/tensorflow_datasets/master_klevr/1.0.0)

In other words, the reuse logic seems to be tied to name+version?
Please advise @Conchylicultor

from kubric.

taiya avatar taiya commented on August 11, 2024

@Conchylicultor ack'd the problem, and mentioned BuilderConfig as the workaround. Something like:

MyDataset(config=MyConfig())

Where the config name is the UUID used for caching the data on disk.

from kubric.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.