The Data Repository

This repository is meant to serve as a demo of a potential Data Repository. It contains meta-information about datasets, as well as DVC projects whith which access to the actual data is granted. The data itself is on a Google Cloud Storage managed by Visium. The specific user access is managed on Cloud Storage level.

The data repository not only allows to browse data sets, it also provides:

Meta information about data sets;
data loader to allow a standardized way to deal with data (even though the raw files may have differen formats);
DVC projects to keep track of all versions of a dataset;
a way to manage ownership of a dataset as well as an approval process to change datasets (via PR).

Available Datasets

├── Chromatograms
│   └── Octane                            
│
├── Raman Spectra               
│
└── images          
    ├── my_mnist
    ├── my_fashion_mnist
    └── augmented_mnist

Information concerning how to access a specific dataset can be found in the corresponding folder in the data repository.

DVC Data Lake documentation

The purpose of this file is to provide documentation on how to use data lake implementation with DVC.

Working with DVC

DVC is greatly documented on their own website. For the general workflows with DVC we refere there.

How to authenticate

For the purpose of this demo and simplicity, Google Cloud bucket has been chosen for the storage of the raw datas. This means that in order to play with the demo, one will need to have or create a Google account and install the google cloud command line tool gcloud.

In order to install the gcloud command, please follow the official instructions.

Once the cli is installed, you can authenticate yourself by using the following command:

gcloud auth application-default login

It should open a tab on your browser, asking you to login with your Google account.

How are datasets stored on the bucket

DVC works as a simple versioning tool for data, by itself it does not allow us to store the dataset in any other format than the raw format. In order to create a real dataset on which we can easily iterate on in python / tensorflow, we need to format our dataset in a particular way.

We choose to stick to the default way tensorflow is storing datasets: Tensorflow Datasets. By choosing this format, we will ensure nice integration with tensorflow as well as nice performance and easiness of dataset reusage/sharing.

Storing datasets in a nice format does come at a price. There is indeed some work to do in order to transform our raw data into a tensorflow-dataset. Nevertheless, doing so is well documented and is explained in nice details in the official documentation.

How to create/add a new dataset

In order to add a dataset into the main repository, it’s important to follow a predefined structure. First, you need to create a folder name after your dataset which contains:

init.py
- File responsible for loading the Dataset Generator from the [dataset-name].py file.
README.md
- Act as a documentation for the dataset. With the following mandatory sections
  - Description
  - How to use it
    - DVC command in order to integrate the dataset in a new project.
  - Accessing the data using python
    - Sample of code on how to iterate over the dataset.
  - Data Maintenance
    - Name of the responsible person for the dataset
[dataset-name].py
- The python file which implements the python class inheriting from GeneratorBasedBuilder. It contains the code responsible for loading the raw data and transforming it in a format ready to be fed to the ML model.
- The details on how to implement this class should be found in the official documentation.
data.dvc
- Hash reference to the data on the bucket.
- The below files are generated by DVC when adding the data folder into the git repository.
  - dvc add data/; dvc push -r remote-bucket;
  - WARNING: When pushing data to a bucket, make sure that you are pushing the data on a bucket where only authorized people are allowed to access it. Hint: list the remotes available in the project with dvc remote list

Please have a look at the demo repository for a concrete example.

When working with a centralized repository shared by multiple people, it’s important to follow good practice. Therefore when adding a new dataset, one should first create a PR on the dataset repository and obtain the approval before the PR is merged.

How to update an existing dataset

Updating a dataset is quite similar to creating a new dataset. A new PR on the central git dataset repo needs to be created with the desired modification.

For example if someone wants to add more sample in the dataset, this can be performed by combining the following commands:

dvc add data/new_sample1.mat
dvc add data/new_sample2.mat
dvc commit data
dvc push -r remote-bucket

How to create a processed version of a dataset

TBD

Python requirements

dvc[gs]
tensorflow_datasets

visiumch / metro-dvc-demo Goto Github PK

metro-dvc-demo's Introduction

The Data Repository

Available Datasets

DVC Data Lake documentation

Working with DVC

How to authenticate

How are datasets stored on the bucket

How to create/add a new dataset

How to update an existing dataset

How to create a processed version of a dataset

Python requirements

metro-dvc-demo's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent