Code Monkey home page Code Monkey logo

dataget's Introduction

dataget

dataget is a Bash and Python library that helps you download popular, organize and process popular machine learning datasets. Its goal is to make readily available as many ML datasets as posible to users of any language. For Python users it also had the added benefits of exposing an interface to access the training-set and test-set as pandas dataframes or numpy arrays, and also random batch generators of the previous for large datasets. See

Getting Started

While dataget is intended for users of any language you need python and pip to install dataget itself and some dependencies required by each dataset.

Instalation

pip install dataget

Bash

mnist example

pip install $(dataget reqs mnist)
dataget get -c mnist dims=20x20 format=jpg

commands

> dataget --help

Usage: dataget [OPTIONS] COMMAND [ARGS]...

Options:
  -p, --path TEXT
  -g, --global_
  --help           Show this message and exit.

Commands:
  rm
  download
  extract
  get
  ls
  process
  rm_compressed
  rm_raw
  reqs

get

> dataget get --help

Usage: dataget get [OPTIONS] DATASET [KWARGS]...

  performs the operations download, extract, rm_compressed, processes
  and rm_raw, in sequence. KWARGS must be in the form: key=value, and
  are fowarded to all opeartions.

Options:
  -c, --rm        removes the dataset's folder (if it exists) before
                     downloading
  --keep-compressed  keeps the compressed files: skips rm_compressed
  --dont-process     skips process
  --keep-raw         keeps the raw/unprocessed files: skips rm_raw
  --help             Show this message and exit.

This is the primary command you will use, it will perform the common operations needed to get the data in a usable format. By default it will create a .dataget folder in the current directory unless specified by the dataget -g flag. The data will live in .dataget/data/{dataset}. The following example

dataget get -c mnist dims=20x20 format=png

is roughly equivalent to

dataget download -c mnist
dataget extract mnist
dataget rm_compressed mnist
dataget process mnist dims=20x20 format=png
dataget rm_raw mnist

ls

List all installed datasets

dataget ls

List all dataget datasets available for download

dataget ls -a

-g

Use dataget -g to perform operations on the global

dataget -g get mnist
dataget -g ls

Python

Contributing

Template

from dataget.dataset import DataSet, SubSet
from dataget.utils import get_file
import os, urllib, zipfile, sys, gzip

class MyDataSet(DataSet):

    def __init__(self, *args, **kwargs):
        super(MyDataSet, self).__init__(*args, **kwargs)

        # self.path
        # self.training_set
        # self.training_set.path
        # self.training_set.make_dirs()
        # self.test_set
        # self.test_set.path
        # self.test_set.make_dirs()


    @property
    def training_set_class(self):
        return TrainingSetMyDataSet

    @property
    def test_set_class(self):
        return TestSetMyDataSet

    @property
    def help(self):
        return "" # information for the help command

    def reqs(self, **kwargs):
        return "" # e.g. "numpy pandas pillow"


    def _download(self, **kwargs):
        # download the data, propably a compressed format
        self.training_set.make_dirs()
        self.test_set.make_dirs()

    def _extract(self, **kwargs):
        # extract the data
        pass

    def _rm_compressed(self, **kwargs):
        # remove the compressed files
        pass

    def _process(self, **kwargs):
        # process the data if needed
        pass

    def _rm_raw(self, **kwargs):
        # remove the raw data if needed
        pass


class MySetBase(SubSet):

    #self.path
    #self.make_dirs()

    def dataframe(self):
        # code
        return df


    def arrays(self):
        # code
        return features, labels


    def random_batch_dataframe_generator(self, batch_size):
        # code
        yield df


    def random_batch_arrays_generator(self, batch_size):
        # code
        yield features, labels


class TrainingSetMyDataSet(MySetBase):


       def __init__(self, dataset, **kwargs):
           super(TrainingSetMyDataSet, self).__init__(dataset, "training-set", **kwargs)
           #self.path
           #self.make_dirs()


class TestSetMyDataSet(MySetBase):

      def __init__(self, dataset, **kwargs):
          super(TestSetMyDataSet, self).__init__(dataset, "test-set", **kwargs)
          #self.path
          #self.make_dirs()

Example

Simple

Using bash

dataget load german-traffic-signs #download, extract and cleanup folder
dataget process german-traffic-signs #process (convert to 32x32 jpg)

Using python

from dataget import data
signs = data("german-traffic-signs") #download, extract and cleanup folder
signs.load().process() #process (convert to 32x32 jpg)

More control

Using bash

dataget download german-traffic-signs #download
dataget extract german-traffic-signs #extract
dataget remove-sources german-traffic-signs #cleanup folder
dataget process german-traffic-signs #process (convert to 32x32 jpg)

Using python

from dataget import data
signs = data("german-traffic-signs")
signs.download() #download
signs.extract() #extract
signs.remove_sources() #cleanup folder
signs.process() #process (convert to 32x32 jpg)

Params

Using bash

dataget process german-traffic-signs -p dims:40x40 #process (convert to 40x40 jpg)

Using python

from dataget import data
signs = data("german-traffic-signs") #download, extract and cleanup folder
signs.process(dims="40x40") #process (convert to 32x32 jpg)

Numpy and Pandas

Assuming you already downloaded the data, you can

from dataget import data
signs = data("german-traffic-signs") #download, extract and cleanup folder

df = signs.training_set.datafame()

dataget's People

Contributors

cgarciae avatar charlielito avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.