See draft for discussion in the tfds branch (will convert to PR once refactoring PR is

Open for discussion: We should consolidate all the klevr files

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

TFDS (Tensorflow Datasets) frontend about kubric HOT 4 CLOSED

google-research commented on August 11, 2024

TFDS (Tensorflow Datasets) frontend

from kubric.

Comments (4)

taiya commented on August 11, 2024

Open for discussion:

We should consolidate all the klevr files in a single folder (klevr_worker, asset_source, dataset, zip, ...) rather than scattering them around the module tree? (see #TODO: implement the logic for when folder contains multiple kubric runs (i.e. subfolders))
Where can we put the logic for copying resources to a centralized dataset bucket? gsutil cp -r local/path gs://remote/path
If a dataset is generated from a collection of render passes, then you will have 1+ metadata.pkl. Now as you break the data into train/test sets, you will have to remember what metadata file a particular exemplar refers to. How to do this elegantly is TBD
What the training output of the dataset will be (e.g. how to get class labels, etc...) → Sara

from kubric.

taiya commented on August 11, 2024

Discussed with Etienne from TFDS on how to address (4) above. Indeed, keeping the current loader as is (single folder) and then yielding examples from them seems to be the most effective solution (keeps code clean).

If high performance is needed, Beam seems to be the solution:

class ProcessDataset(beam.DoFn):

  def process(self, ds_path):
    builder = tfds.core.builder_from_directory(ds_path)
    for i, ex in enumerate(builder.as_dataset(split='train'))):
      yield f'{i}_{ds_path.name}', ex

def _generate_examples(self, all_dataset_path):
  return (
      beam.Create(all_dataset_path)
      | beam.ParDo(ProcessDataset())
  )

But considering rendering takes a significant of time.. not sure it's worth it.
So we discussed a simpler alternative... in the init phase something like:

builders = list()
for bucket_path in bucket_paths:
   # tfds.builder('MyDatesetBuilder', bucket_path=bucket_path) # equivalent to next line
   builder = MyDatesetBuilder(bucket_path=bucket_path)

And then, in the yield phase (note: shuffling is done by TFDS, so we can do in-order):

def _generate_examples(self, ...):
    for build in builders:
        ds = builders.as_dataset(split=...)
        for example in ds:
            yield example

Finally, Etienne mentioned since TFDS4.1.0 this code:

train_split = tfds.core.SplitGenerator(name=tfds.Split.TRAIN, gen_kwargs=dict(ids=ids_train))
test_split = tfds.core.SplitGenerator(name=tfds.Split.TEST, gen_kwargs=dict(ids=ids_test))
return [train_split, test_split]

is replaced by

return {
    'train': self._generate_examples(ids_train),
    'test': self._generate_examples(ids_test),
}

Finally, to pass arguments (path) to the master dataset: tfds.load(..., as_dataset_kwargs={})

Link to other resources:

from kubric.

taiya commented on August 11, 2024

Alright, a first attempt at building a "dataset-of-datasets" is ready:
https://github.com/google-research/kubric/blob/tfds/kubric/datasets/klevr.py

In this example I have something like:

gsutil ls -l gs://kubric/tfds/klevr
                                 gs://kubric/tfds/klevr/7508d33/
                                 gs://kubric/tfds/klevr/e31bf6a/

Yet at runtime you get AssertionError: Two records share the same hashed key!... and this is because despite the fact that you have a path specializer that creates two DIFFERENT datasets, when the second gets loaded, it gets "reused":

python3 /workspaces/kubric/kubric/datasets/klevr.py
INFO:absl:Load dataset info from /root/tensorflow_datasets/master_klevr/1.0.0
INFO:absl:Load dataset info from /root/tensorflow_datasets/klevr/1.0.0
INFO:root:Klevr(path=gs://kubric/tfds/klevr/7508d33)
INFO:absl:Reusing dataset klevr (/root/tensorflow_datasets/klevr/1.0.0)
INFO:absl:Load dataset info from /root/tensorflow_datasets/klevr/1.0.0
INFO:root:Klevr(path=gs://kubric/tfds/klevr/e31bf6a)
INFO:absl:Reusing dataset klevr (/root/tensorflow_datasets/klevr/1.0.0)
INFO:absl:Reusing dataset master_klevr (/root/tensorflow_datasets/master_klevr/1.0.0)

In other words, the reuse logic seems to be tied to name+version?
Please advise @Conchylicultor

from kubric.

taiya commented on August 11, 2024

@Conchylicultor ack'd the problem, and mentioned BuilderConfig as the workaround. Something like:

MyDataset(config=MyConfig())

Where the config name is the UUID used for caching the data on disk.

from kubric.

TFDS (Tensorflow Datasets) frontend about kubric HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent