ghcollin / tftables Goto Github PK
View Code? Open in Web Editor NEWHDF5 interface for Tensorflow.
License: MIT License
HDF5 interface for Tensorflow.
License: MIT License
Awesome package!!
Is it possible to load/dequeue data samples from multiple datasets (which maybe inside the same hdf5 file)? For example, lets say we have filename=/path/to/h5_file.h5
which contains two tables: /path/to/table/1
and /path/to/table/2
. Both tables contain columns data
and labels
like on the main README example.
I can make a loader any individual table as suggested on the README:
loader_dataset1 = tftables.load_dataset(filename='path/to/h5_file.h5',
dataset_path='/path/to/table/1',
input_transform=input_transform, ...)
But would I have to create an entirely different loader to handle the second table? Like this:
loader_dataset2 = tftables.load_dataset(filename='path/to/h5_file.h5',
dataset_path='/path/to/table/2',
input_transform=input_transform, ...)
Then I would have to load the batches from each table separately and alternate on which to use on every iteration of training:
truth_batch1, data_batch1 = loader_dataset1.dequeue()
truth_batch2, data_batch2 = loader_dataset2.dequeue()
Is there a better way of doing this? I could imagine concatenating both tables into a single table (and thus use a single loader). For clarity, it would make sense to keep the tables separate but if this is the only solution, merging them together is certainly possible. Do you have any suggestions?
I'm reading multiple datasets from a single file and after a certain number of iterations the code hangs indefinitely (I let it go overnight just to be absolutely certain). I have to ctrl+C out of it and I get the following exception. Looks like a hang in multitables somewhere? Maybe from the queue not being populated quickly enough?
Traceback (most recent call last): │··
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap │··
self.run() │··
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run │··
self._target(*self._args, **self._kwargs) │··
File "/workspace-mount/Programs/tachotron2-implementations/barronalex-tachotron/lib/python3.5/site-packages/multitables.py", line 389, in _Streamer__read_process │··
with sync.do(cbuf.put_direct(), i, (i+read_size) % len(ary)) as put_ary: │··
File "/workspace-mount/Programs/tachotron2-implementations/barronalex-tachotron/lib/python3.5/site-packages/multitables.py", line 136, in enter │··
with self.sync.barrier_in.wait(*self.index): │··
File "/workspace-mount/Programs/tachotron2-implementations/barronalex-tachotron/lib/python3.5/site-packages/multitables.py", line 87, in enter │··
self.sync.cvar.wait() │··
File "/usr/lib/python3.5/multiprocessing/synchronize.py", line 262, in wait │··
return self._wait_semaphore.acquire(True, timeout)
def input_transform(tbl_batch):
labels = tbl_batch['non_nodule']
data = tbl_batch['nodule']
return truth, data
loader = tftables.load_dataset(filename='path/to/h5_file.h5',
dataset_path='/internal/h5/path',
input_transform=input_transform,
batch_size=16)
h5_file -> 'nodule', 'non_nodule' is key ...
What does dataset_path
mean?
If my hdf5 file name is taki.h5
and the path is /home/data/lunit/taki.h5
filename = taki.h5
dataset_path = /home/data/lunit/
or
filename = /home/data/lunit/taki.h5
dataset_path = nodule
What is the correct answer?
Hi, when I ran the test unit on my system, I got the following errors:
_pickle.PicklingError: Can't pickle <function Streamer.__read_process at 0x000002012024BF28>: attribute lookup Streamer.__read_process on multitables failed
Please help me.
I have some questions...
def input_transform(tbl_batch):
labels = tbl_batch['non_nodule']
data = tbl_batch['nodule']
return truth, data
loader = tftables.load_dataset(filename='path/to/h5_file.h5',
dataset_path='/internal/h5/path',
input_transform=input_transform,
batch_size=16)
h5_file -> 'nodule', 'non_nodule' is key ...
What does dataset_path
mean?
If my hdf5 file name is taki.h5
and the path is /home/data/lunit/taki.h5
filename = taki.h5
dataset_path = /home/data/lunit/
or
filename = /home/data/lunit/taki.h5
dataset_path = nodule
What is the correct answer?
def input_transform(tbl_batch):
labels = tbl_batch['non_nodule']
data = tbl_batch['nodule']
return truth, data
loader = tftables.load_dataset(filename='path/to/h5_file.h5',
dataset_path='/internal/h5/path',
input_transform=input_transform,
batch_size=16)
result = my_network(self.x, self.y)
# self.x = tf.placeholder(tf.float32, [16,68,68,68,3])
# self.y = tf.placeholder(tf.float32, [16,2])
with tf.Session() as sess:
with loader.begin(sess):
for _ in range(num_iterations):
truth_batch, data_batch = loader.dequeue()
feed_dict = { self.x : data_batch, self.y : truth_batch }
sess.run(result, feed_dict=feed_dict)
Is this possible?
Please answer
Thank you !
I get an error whenever I try to use the same FIFOLoader twice. The error, as far as I can tell, comes from these lines:
if self.monitor_thread is not None:
raise Exception("This loader has already been started.")
and the fact that the monitor thread is only closed, but not set to None when loader.stop(sess)
is called.
My dataset is formatted as a few dozen h5 files instead of one h5 with internal directories. Is it possible to load them into one queue without merging them into one file?
Hi Thank you for this library. This is something exactly I was looking for. I have hdf5 file for PASCAL VOC dataset for object detection. In this file I have two dataset, one contains variable length images and other contains segmentation masks for each of the object category presented in the image. I wish to feed the data in a following way.
Read image and corresponding masks, resize and augment them using custom warp function and then feed them in the FIFO queue for training. You can find the full question here. Could you please help me with this problem?
From @DolanDack :
Hi, I am brand new to python and I am trying to use your code to train on tensorflow using datasets other than mnist. When I am following your guides there is this line:
array_batch_placeholder = reader.get_batch('/internal/h5_path/to/array')
I do not really get what this pathway refers to. Tensorflow is quite different from R and Matlab that I have been used to use until now and I cannot really check the variables by executing small batches of the code. The /internal/h5_path/to/array is it something I should provide? I know that in the:
reader = tftables.open_file(filename='path/to/h5_file', batch_size=20)
I use the h5 file path.I tried to understand following the more applicable examples that you are offering but to be honest I feel quite lost.
I get a ZeroDivisionError
when I try to read a batch without specifying what the block_size
is. My HDF5 file is created with h5py and is a simple table.
Here is my code and the error message:
In [2]: reader = tftables.open_file('/home/yngve/table.h5', batch_size=5)
In [3]: a = reader.get_batch('/train/images', n_procs=3)
---------------------------------------------------------------------------
ZeroDivisionError Traceback (most recent call last)
<ipython-input-3-05142b89f50c> in <module>()
----> 1 a = reader.get_batch('/train/images', n_procs=3)
~/anaconda3/envs/tf/lib/python3.6/site-packages/tftables.py in get_batch(self, path, **kw_args)
264 block_size = queue.block_size
265 # get an example for finding data types and row sizes.
--> 266 example = self.streamer.get_remainder(path, block_size)
267 batch_type = example.dtype
268 inner_shape = example.shape[1:]
~/anaconda3/envs/tf/lib/python3.6/site-packages/multitables.py in get_remainder(self, path, block_size)
459 :return: A copy of the remainder elements as a numpy array.
460 """
--> 461 return self.__get_batch(path, length=block_size, last=True)
462
463 class Queue:
~/anaconda3/envs/tf/lib/python3.6/site-packages/multitables.py in __get_batch(self, path, length, last)
444
445 if last:
--> 446 example = h5_node[length*(len(h5_node)//length):].copy()
447 else:
448 example = h5_node[:length].copy()
ZeroDivisionError: integer division or modulo by zero
The error is resolved if I specify block_size
. The code
In [2]: reader = tftables.open_file('/home/yngve/table.h5', batch_size=5)
In [3]: a = reader.get_batch('/train/images', block_size=5, n_procs=3)
does not give any errors. I have tested this code with the FIFOQueue and it does indeed give the expected result.
System info:
OS: Ubuntu 16.04
Python version: 3.6
Tensorflow version: 1.5.0
Tftables version: 1.1.2 (latest from pip)
Hi,
I have implemented a hdf5 stream based on your document. I want to know the performance gain based on ttables in general? Because i do not see too much gain in my case, i.e., the speed of my example is same as before.
Thanks
Do you see a possibility to shuffle the data while reading/cycling through it? Either on tftables or multitables-level?
As I don't see an option related to random access, I assume, you store your training data already shuffled?
Does tftables support hdf5 files created by h5py? I have such files that contain multiple datasets (that is, numpy arrays) in them.
It seems tftables is built on top of 'tables', which is different from h5py library.
It -seems- to send the same data into the graph multiple times. I used a data set from here: http://download.nexusformat.org/sphinx/examples/h5py/#id4 and don't know for sure what is in it, however, putting print statements in tftables' getbatch().readbatch() yield statements, outputs the same data several times. I don't see why this would be expected?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.