edwardlib / observations Goto Github PK
View Code? Open in Web Editor NEWTools for loading standard data sets in machine learning
License: Other
Tools for loading standard data sets in machine learning
License: Other
Hi Dustin,
I have written a python script to generate observations
ready python files for over 1100 datasets available in R and related packages. The project page is https://github.com/Arvinds-ds/datasets_r2py. Could you kindly verify/comment on the following
edward/observationsrequirements
, the files are automatically generated in 'observations/rdata/', you can look at these files. If you have changes, kindly let me know the modifications to the templated python files init_template.py
, template.py
or test_template.py
observations/rdata
folder with tests in observations/rdata/tests
In file Util.py at line 129 (version 0.1.4):
file_size = int(response.headers.get('content-range').split('/')[1])
I get this error:
AttributeError: 'NoneType' object has no attribute 'split'
references
references
All functions are silent whether it be loading, preprocessing, or saving the data. Currently, the only step that prints to stdout is downloading. These other steps can somtimes be very expensive, such as the preprocessing in small32_imagenet.py
.
We should consider adding stdout messages to the loading, preprocessing, and saving steps—depending on how long it may take. And we should establish a standard that applies across all data sets.
From @Arvinds-ds (#21 (comment)):
the build is failing for observations master repo due to random Connection errors (probably due to the volume of data hitting the user's github repo during travis testing). Is there a way to host these reliably. I can change the URLs in the master file and regenerate the files
Change line 67 from
with open(extracted_filepath, 'w') as f:
to
with open(extracted_filepath, 'wb') as f:
for compatibility with Python3
For data sets like multi-MNIST and small ImageNet, we preprocess the data and cache by writing to disk so that future calls can load it into memory. More generally, we need to save and load data when its function requires preprocessing and the data fits in memory to be loaded.
We should decide on a specific option such as pickle, np.savez, or hdf5.
Searching open(
in the repo shows that we use Python's open()
function sometimes with the 'rb'
arg, sometimes no arg, sometimes with w
. We should make its usage consistent.
Following discussion in #19.
We're currently in (pre-)alpha. Writing the 4-page paper will (a) announce the library for an official public release; (b) formalize design principles; (c) formalize design details such as how we handle various data domains, data structures, and data sizes; (d) provide statistics on Observations' collection of data sets.
All contributors are authors.
A major goal of Observations is to support all standard ML data sets. We should make sure we cover all those listed by deeplearning.net (url).
I am heavily using tf.contrib.data datasets
api for image based tasks. With observations for images (LSUN/celebA) etc being no more than an downloader for these datasets, would it be worthwhile to return a tensorflow dataset something along the lines of
lsun_bedroom_x_train = lsun('~/data',category='bedroom', set='training',
batch_size=32, shuffle=True)
training_data = lsun_bedroom_x_train.make_one_shot_iterator()
.....
for i in range(inference.n_iter):
x_batch = training_data.get_next()
inference.update(....{x_ph: x_batch)
Remove CelebA's manual download function.
All functions check for the default filename in path
. This doesn't allow the user to load from a renamed file. Enable an optional filename argument.
There must be some care when the filename is a group of files or a directory. I don't know how to handle the arbitrary case.
Use hash to verify if content exists and is correct.
The r subfolder does not get installed in site-packages
Testing each loading function requires downloading data from the url(s) and verifying the returned data objects. How do we test without having to download many files for each Travis build? Is storing every file on a Travis server feasible? (no)
It will only extract the files. Basically, extend the function to include url=None
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.