The observations from edwardlib

Porting 1000+ R datasets to observations

Hi Dustin,
I have written a python script to generate observations ready python files for over 1100 datasets available in R and related packages. The project page is https://github.com/Arvinds-ds/datasets_r2py. Could you kindly verify/comment on the following

Whether the generated python source files conform to edward/observationsrequirements, the files are automatically generated in 'observations/rdata/', you can look at these files. If you have changes, kindly let me know the modifications to the templated python files init_template.py, template.py or test_template.py
If we were to use the files, how will it be structured in edward.I currently generated files in observations/rdata folder with tests in observations/rdata/tests
Let me know.

Error when trying to download de mnist dataset

In file Util.py at line 129 (version 0.1.4):

file_size = int(response.headers.get('content-range').split('/')[1])

I get this error:

AttributeError: 'NoneType' object has no attribute 'split'

uci data sets

references

drop networkx specific version req; use read_edgelist

#49

nyu_depth_v2

http://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html

Extend maybe_download_and_extract with pause and maybe start from intermediate progress

network data sets

iam online handwriting

references

sklearn data sets

references

http://scikit-learn.org/stable/datasets/index.html
fetch_olivetti_faces
fetch_20newsgroups
fetch_lfw_people
- fetch_lfw_pairs
fetch_covtype
fetch_rcv1

print/logging messages in data loading functions

All functions are silent whether it be loading, preprocessing, or saving the data. Currently, the only step that prints to stdout is downloading. These other steps can somtimes be very expensive, such as the preprocessing in small32_imagenet.py.

We should consider adding stdout messages to the loading, preprocessing, and saving steps—depending on how long it may take. And we should establish a standard that applies across all data sets.

Travis failure due to data Connection errors

From @Arvinds-ds (#21 (comment)):

the build is failing for observations master repo due to random Connection errors (probably due to the volume of data hitting the user's github repo during travis testing). Is there a way to host these reliably. I can change the URLs in the master file and regenerate the files

https://travis-ci.org/edwardlib/observations/jobs/278431971

256x256 labelled imagenet

references

https://github.com/mila-udem/fuel/blob/master/fuel/datasets/imagenet.py

maybe_download_and_extract: write() argument must be str, not bytes

Change line 67 from
with open(extracted_filepath, 'w') as f:
to
with open(extracted_filepath, 'wb') as f:

for compatibility with Python3

deepmind data sets

https://deepmind.com/research/open-source/open-source-datasets/

extend maybe_download_and_extract with download speed and ETA

decide on standard for saving/loading data sets that fit in memory

For data sets like multi-MNIST and small ImageNet, we preprocess the data and cache by writing to disk so that future calls can load it into memory. More generally, we need to save and load data when its function requires preprocessing and the data fits in memory to be loaded.

We should decide on a specific option such as pickle, np.savez, or hdf5.

standardize use of open()

Searching open( in the repo shows that we use Python's open() function sometimes with the 'rb' arg, sometimes no arg, sometimes with w. We should make its usage consistent.

Include autogeneration scripts of R data sets in repo

Following discussion in #19.

audio data sets

blizzard
timit (https://catalog.ldc.upenn.edu/ldc93s1)
speech commands (https://research.googleblog.com/2017/08/launching-speech-commands-dataset.html)

omniglot

write 4-page paper for JMLR software track

We're currently in (pre-)alpha. Writing the 4-page paper will (a) announce the library for an official public release; (b) formalize design principles; (c) formalize design details such as how we handle various data domains, data structures, and data sizes; (d) provide statistics on Observations' collection of data sets.

All contributors are authors.

deeplearning.net list of data sets

A major goal of Observations is to support all standard ML data sets. We should make sure we cover all those listed by deeplearning.net (url).

vision data sets

256x256 imagenet
- https://github.com/mila-udem/fuel/blob/master/fuel/datasets/imagenet.py
omniglot
deepmind data sets
- https://deepmind.com/research/open-source/open-source-datasets/
nyu depth v2
- http://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html

tensorflow.contrib.data in observations

I am heavily using tf.contrib.data datasets api for image based tasks. With observations for images (LSUN/celebA) etc being no more than an downloader for these datasets, would it be worthwhile to return a tensorflow dataset something along the lines of

lsun_bedroom_x_train  = lsun('~/data',category='bedroom', set='training', 
                                                   batch_size=32, shuffle=True)
training_data = lsun_bedroom_x_train.make_one_shot_iterator()
.....
for i in range(inference.n_iter):
   x_batch = training_data.get_next()
   inference.update(....{x_ph: x_batch)

cross-domain data sets

mscoco
flickr30k
moviebook
https://nlp.stanford.edu/blog/a-new-multi-turn-multi-domain-task-oriented-dialogue-dataset

extend maybe_download_and_extract to google drive and dropbox dls

Remove CelebA's manual download function.

allow for optional filename

All functions check for the default filename in path. This doesn't allow the user to load from a renamed file. Enable an optional filename argument.

There must be some care when the filename is a group of files or a directory. I don't know how to handle the arbitrary case.

add API docs to edwardlib.org/api/observations

on TOC, add sidebar, then observations link with description in TOC linking to all other functions
references
- use [@] and @
- store bibtex in observations/ repo
- have edward link to it in pandoc command inside compile.sh
add link in README.md

imdb (see https://github.com/fchollet/keras/tree/master/keras/datasets)
reuters (see https://github.com/fchollet/keras/tree/master/keras/datasets)
bookcorpus
1b word benchmark (https://github.com/rafaljozefowicz/lm)
- https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/lm1b.py
- https://github.com/amirbar/rnn.wgan
https://archive.ics.uci.edu/ml/machine-learning-databases/20newsgroups-mld/
amazon reviews
yahoo answers
yelp15,etc. reviews
WMT14,15,etc. (https://github.com/tensorflow/models/tree/master/tutorials/rnn)

extend maybe_download_and_extract to manually downloaded data sets

It will only extract the files. Basically, extend the function to include url=None.

edwardlib / observations Goto Github PK

observations's People

Contributors

Stargazers

Watchers

Forkers

observations's Issues

Recommend Projects

Recommend Topics

Recommend Org