Code Monkey home page Code Monkey logo

metro-dvc-demo's Introduction

The Data Repository

This repository is meant to serve as a demo of a potential Data Repository. It contains meta-information about datasets, as well as DVC projects whith which access to the actual data is granted. The data itself is on a Google Cloud Storage managed by Visium. The specific user access is managed on Cloud Storage level.

The data repository not only allows to browse data sets, it also provides:

  • Meta information about data sets;
  • data loader to allow a standardized way to deal with data (even though the raw files may have differen formats);
  • DVC projects to keep track of all versions of a dataset;
  • a way to manage ownership of a dataset as well as an approval process to change datasets (via PR).

Available Datasets

├── Chromatograms
│   └── Octane                            
│
├── Raman Spectra               
│
└── images          
    ├── my_mnist
    ├── my_fashion_mnist
    └── augmented_mnist                          

Information concerning how to access a specific dataset can be found in the corresponding folder in the data repository.

DVC Data Lake documentation

The purpose of this file is to provide documentation on how to use data lake implementation with DVC.

Working with DVC

DVC is greatly documented on their own website. For the general workflows with DVC we refere there.

How to authenticate

For the purpose of this demo and simplicity, Google Cloud bucket has been chosen for the storage of the raw datas. This means that in order to play with the demo, one will need to have or create a Google account and install the google cloud command line tool gcloud.

In order to install the gcloud command, please follow the official instructions.

Once the cli is installed, you can authenticate yourself by using the following command:

gcloud auth application-default login

It should open a tab on your browser, asking you to login with your Google account.

How are datasets stored on the bucket

DVC works as a simple versioning tool for data, by itself it does not allow us to store the dataset in any other format than the raw format. In order to create a real dataset on which we can easily iterate on in python / tensorflow, we need to format our dataset in a particular way.

We choose to stick to the default way tensorflow is storing datasets: Tensorflow Datasets. By choosing this format, we will ensure nice integration with tensorflow as well as nice performance and easiness of dataset reusage/sharing.

Storing datasets in a nice format does come at a price. There is indeed some work to do in order to transform our raw data into a tensorflow-dataset. Nevertheless, doing so is well documented and is explained in nice details in the official documentation.

How to create/add a new dataset

In order to add a dataset into the main repository, it’s important to follow a predefined structure. First, you need to create a folder name after your dataset which contains:

  • init.py
    • File responsible for loading the Dataset Generator from the [dataset-name].py file.
  • README.md
    • Act as a documentation for the dataset. With the following mandatory sections
      • Description
      • How to use it
        • DVC command in order to integrate the dataset in a new project.
      • Accessing the data using python
        • Sample of code on how to iterate over the dataset.
      • Data Maintenance
        • Name of the responsible person for the dataset
  • [dataset-name].py
    • The python file which implements the python class inheriting from GeneratorBasedBuilder. It contains the code responsible for loading the raw data and transforming it in a format ready to be fed to the ML model.
    • The details on how to implement this class should be found in the official documentation.
  • data.dvc
    • Hash reference to the data on the bucket.
    • The below files are generated by DVC when adding the data folder into the git repository.
      • dvc add data/; dvc push -r remote-bucket;
      • WARNING: When pushing data to a bucket, make sure that you are pushing the data on a bucket where only authorized people are allowed to access it. Hint: list the remotes available in the project with dvc remote list

Please have a look at the demo repository for a concrete example.

When working with a centralized repository shared by multiple people, it’s important to follow good practice. Therefore when adding a new dataset, one should first create a PR on the dataset repository and obtain the approval before the PR is merged.

How to update an existing dataset

Updating a dataset is quite similar to creating a new dataset. A new PR on the central git dataset repo needs to be created with the desired modification.

For example if someone wants to add more sample in the dataset, this can be performed by combining the following commands:

dvc add data/new_sample1.mat
dvc add data/new_sample2.mat
dvc commit data
dvc push -r remote-bucket

How to create a processed version of a dataset

TBD

Python requirements

dvc[gs]
tensorflow_datasets

metro-dvc-demo's People

Contributors

redur avatar

Watchers

Pascal Rodriguez avatar Charles Gallay avatar  avatar Moritz Freidank avatar Jeremy Hottinger avatar Grégoire Clément avatar  avatar Gaétan Ramet avatar  avatar Dženita Đulović avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.