The tftables from ghcollin

Queuing from multiple datasets?

Awesome package!!

Is it possible to load/dequeue data samples from multiple datasets (which maybe inside the same hdf5 file)? For example, lets say we have filename=/path/to/h5_file.h5 which contains two tables: /path/to/table/1 and /path/to/table/2. Both tables contain columns data and labels like on the main README example.

I can make a loader any individual table as suggested on the README:

 loader_dataset1 = tftables.load_dataset(filename='path/to/h5_file.h5',
                                   dataset_path='/path/to/table/1',
                                   input_transform=input_transform, ...)

But would I have to create an entirely different loader to handle the second table? Like this:

 loader_dataset2 = tftables.load_dataset(filename='path/to/h5_file.h5',
                                   dataset_path='/path/to/table/2',
                                   input_transform=input_transform, ...)

Then I would have to load the batches from each table separately and alternate on which to use on every iteration of training:

truth_batch1, data_batch1 = loader_dataset1.dequeue()
truth_batch2, data_batch2 = loader_dataset2.dequeue()

Is there a better way of doing this? I could imagine concatenating both tables into a single table (and thus use a single loader). For clarity, it would make sense to keep the tables separate but if this is the only solution, merging them together is certainly possible. Do you have any suggestions?

Code hangs on training

I'm reading multiple datasets from a single file and after a certain number of iterations the code hangs indefinitely (I let it go overnight just to be absolutely certain). I have to ctrl+C out of it and I get the following exception. Looks like a hang in multitables somewhere? Maybe from the queue not being populated quickly enough?

Traceback (most recent call last): │··
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap │··
self.run() │··
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run │··
self._target(*self._args, **self._kwargs) │··
File "/workspace-mount/Programs/tachotron2-implementations/barronalex-tachotron/lib/python3.5/site-packages/multitables.py", line 389, in _Streamer__read_process │··
with sync.do(cbuf.put_direct(), i, (i+read_size) % len(ary)) as put_ary: │··
File "/workspace-mount/Programs/tachotron2-implementations/barronalex-tachotron/lib/python3.5/site-packages/multitables.py", line 136, in enter │··
with self.sync.barrier_in.wait(*self.index): │··
File "/workspace-mount/Programs/tachotron2-implementations/barronalex-tachotron/lib/python3.5/site-packages/multitables.py", line 87, in enter │··
self.sync.cvar.wait() │··
File "/usr/lib/python3.5/multiprocessing/synchronize.py", line 262, in wait │··
return self._wait_semaphore.acquire(True, timeout)

what is the 'dataset_path' ?

def input_transform(tbl_batch):
    labels = tbl_batch['non_nodule']
    data = tbl_batch['nodule']

    return truth, data

loader = tftables.load_dataset(filename='path/to/h5_file.h5',
                               dataset_path='/internal/h5/path',
                               input_transform=input_transform,
                               batch_size=16)

h5_file -> 'nodule', 'non_nodule' is key ...
What does dataset_path mean?
If my hdf5 file name is taki.h5 and the path is /home/data/lunit/taki.h5

filename = taki.h5
dataset_path = /home/data/lunit/

or

filename = /home/data/lunit/taki.h5
dataset_path = nodule

What is the correct answer?

_pickle.PicklingError: Can't pickle

Hi, when I ran the test unit on my system, I got the following errors:

_pickle.PicklingError: Can't pickle <function Streamer.__read_process at 0x000002012024BF28>: attribute lookup Streamer.__read_process on multitables failed

Please help me.

I have some questions

I have some questions...

def input_transform(tbl_batch):
    labels = tbl_batch['non_nodule']
    data = tbl_batch['nodule']

    return truth, data

loader = tftables.load_dataset(filename='path/to/h5_file.h5',
                               dataset_path='/internal/h5/path',
                               input_transform=input_transform,
                               batch_size=16)

h5_file -> 'nodule', 'non_nodule' is key ...
What does dataset_path mean?
If my hdf5 file name is taki.h5 and the path is /home/data/lunit/taki.h5

filename = taki.h5
dataset_path = /home/data/lunit/

or

filename = /home/data/lunit/taki.h5
dataset_path = nodule

What is the correct answer?

def input_transform(tbl_batch):
    labels = tbl_batch['non_nodule']
    data = tbl_batch['nodule']

    return truth, data

loader = tftables.load_dataset(filename='path/to/h5_file.h5',
                               dataset_path='/internal/h5/path',
                               input_transform=input_transform,
                               batch_size=16)

result = my_network(self.x, self.y)
# self.x = tf.placeholder(tf.float32, [16,68,68,68,3])
# self.y = tf.placeholder(tf.float32, [16,2])

with tf.Session() as sess:
    with loader.begin(sess):
        for _ in range(num_iterations): 
            truth_batch, data_batch = loader.dequeue()
            feed_dict = { self.x : data_batch, self.y : truth_batch }

            sess.run(result, feed_dict=feed_dict)

Is this possible?
Please answer

Thank you !

Cannot use same FIFOLoader twice.

I get an error whenever I try to use the same FIFOLoader twice. The error, as far as I can tell, comes from these lines:

if self.monitor_thread is not None:
    raise Exception("This loader has already been started.")

and the fact that the monitor thread is only closed, but not set to None when loader.stop(sess) is called.

Multiple h5 files

My dataset is formatted as a few dozen h5 files instead of one h5 with internal directories. Is it possible to load them into one queue without merging them into one file?

Is preprocessing and augmentation possible?

Hi Thank you for this library. This is something exactly I was looking for. I have hdf5 file for PASCAL VOC dataset for object detection. In this file I have two dataset, one contains variable length images and other contains segmentation masks for each of the object category presented in the image. I wish to feed the data in a following way.
Read image and corresponding masks, resize and augment them using custom warp function and then feed them in the FIFO queue for training. You can find the full question here. Could you please help me with this problem?

HDF5 dataset path question

From @DolanDack :

Hi, I am brand new to python and I am trying to use your code to train on tensorflow using datasets other than mnist. When I am following your guides there is this line: array_batch_placeholder = reader.get_batch('/internal/h5_path/to/array')

I do not really get what this pathway refers to. Tensorflow is quite different from R and Matlab that I have been used to use until now and I cannot really check the variables by executing small batches of the code. The /internal/h5_path/to/array is it something I should provide? I know that in the: reader = tftables.open_file(filename='path/to/h5_file', batch_size=20) I use the h5 file path.

I tried to understand following the more applicable examples that you are offering but to be honest I feel quite lost.

ZeroDivisionError when `FileReader.get_batch` is called without `block_size`

I get a ZeroDivisionError when I try to read a batch without specifying what the block_size is. My HDF5 file is created with h5py and is a simple table.

Here is my code and the error message:

In [2]: reader = tftables.open_file('/home/yngve/table.h5', batch_size=5)

In [3]: a = reader.get_batch('/train/images', n_procs=3)
---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-3-05142b89f50c> in <module>()
----> 1 a = reader.get_batch('/train/images', n_procs=3)

~/anaconda3/envs/tf/lib/python3.6/site-packages/tftables.py in get_batch(self, path, **kw_args)
    264         block_size = queue.block_size
    265         # get an example for finding data types and row sizes.
--> 266         example = self.streamer.get_remainder(path, block_size)
    267         batch_type = example.dtype
    268         inner_shape = example.shape[1:]

~/anaconda3/envs/tf/lib/python3.6/site-packages/multitables.py in get_remainder(self, path, block_size)
    459         :return: A copy of the remainder elements as a numpy array.
    460         """
--> 461         return self.__get_batch(path, length=block_size, last=True)
    462 
    463     class Queue:

~/anaconda3/envs/tf/lib/python3.6/site-packages/multitables.py in __get_batch(self, path, length, last)
    444 
    445         if last:
--> 446             example = h5_node[length*(len(h5_node)//length):].copy()
    447         else:
    448             example = h5_node[:length].copy()

ZeroDivisionError: integer division or modulo by zero

The error is resolved if I specify block_size. The code

In [2]: reader = tftables.open_file('/home/yngve/table.h5', batch_size=5)

In [3]: a = reader.get_batch('/train/images', block_size=5,  n_procs=3)

does not give any errors. I have tested this code with the FIFOQueue and it does indeed give the expected result.

System info:
OS: Ubuntu 16.04
Python version: 3.6
Tensorflow version: 1.5.0
Tftables version: 1.1.2 (latest from pip)

Performance gain of ttables

Hi,
I have implemented a hdf5 stream based on your document. I want to know the performance gain based on ttables in general? Because i do not see too much gain in my case, i.e., the speed of my example is same as before.

Thanks

Shuffle data while cycling?

Do you see a possibility to shuffle the data while reading/cycling through it? Either on tftables or multitables-level?

As I don't see an option related to random access, I assume, you store your training data already shuffled?

hdf5 files created by h5py

Does tftables support hdf5 files created by h5py? I have such files that contain multiple datasets (that is, numpy arrays) in them.

It seems tftables is built on top of 'tables', which is different from h5py library.

repeated same data

It -seems- to send the same data into the graph multiple times. I used a data set from here: http://download.nexusformat.org/sphinx/examples/h5py/#id4 and don't know for sure what is in it, however, putting print statements in tftables' getbatch().readbatch() yield statements, outputs the same data several times. I don't see why this would be expected?

ghcollin / tftables Goto Github PK

tftables's People

Contributors

Stargazers

Watchers

Forkers

tftables's Issues

Recommend Projects

Recommend Topics

Recommend Org