This repository is meant to serve as a demo of a potential Data Repository. It contains meta-information about datasets, as well as DVC projects whith which access to the actual data is granted. The data itself is on a Google Cloud Storage managed by Visium. The specific user access is managed on Cloud Storage level.
The data repository not only allows to browse data sets, it also provides:
- Meta information about data sets;
- data loader to allow a standardized way to deal with data (even though the raw files may have differen formats);
- DVC projects to keep track of all versions of a dataset;
- a way to manage ownership of a dataset as well as an approval process to change datasets (via PR).
├── Chromatograms
│ └── Octane
│
├── Raman Spectra
│
└── images
├── my_mnist
├── my_fashion_mnist
└── augmented_mnist
Information concerning how to access a specific dataset can be found in the corresponding folder in the data repository.
The purpose of this file is to provide documentation on how to use data lake implementation with DVC.
DVC is greatly documented on their own website. For the general workflows with DVC we refere there.
For the purpose of this demo and simplicity, Google Cloud bucket has been chosen for the storage of the raw datas. This means that in order to play with the demo, one will need to have or create a Google account and install the google cloud command line tool gcloud
.
In order to install the gcloud command, please follow the official instructions.
Once the cli is installed, you can authenticate yourself by using the following command:
gcloud auth application-default login
It should open a tab on your browser, asking you to login with your Google account.
DVC works as a simple versioning tool for data, by itself it does not allow us to store the dataset in any other format than the raw format. In order to create a real dataset on which we can easily iterate on in python / tensorflow, we need to format our dataset in a particular way.
We choose to stick to the default way tensorflow is storing datasets: Tensorflow Datasets. By choosing this format, we will ensure nice integration with tensorflow as well as nice performance and easiness of dataset reusage/sharing.
Storing datasets in a nice format does come at a price. There is indeed some work to do in order to transform our raw data into a tensorflow-dataset. Nevertheless, doing so is well documented and is explained in nice details in the official documentation.
In order to add a dataset into the main repository, it’s important to follow a predefined structure. First, you need to create a folder name after your dataset which contains:
- init.py
- File responsible for loading the Dataset Generator from the [dataset-name].py file.
- README.md
- Act as a documentation for the dataset. With the following mandatory sections
- Description
- How to use it
- DVC command in order to integrate the dataset in a new project.
- Accessing the data using python
- Sample of code on how to iterate over the dataset.
- Data Maintenance
- Name of the responsible person for the dataset
- Act as a documentation for the dataset. With the following mandatory sections
- [dataset-name].py
- The python file which implements the python class inheriting from GeneratorBasedBuilder. It contains the code responsible for loading the raw data and transforming it in a format ready to be fed to the ML model.
- The details on how to implement this class should be found in the official documentation.
- data.dvc
- Hash reference to the data on the bucket.
- The below files are generated by DVC when adding the data folder into the git repository.
- dvc add data/; dvc push -r remote-bucket;
- WARNING: When pushing data to a bucket, make sure that you are pushing the data on a bucket where only authorized people are allowed to access it. Hint: list the remotes available in the project with
dvc remote list
Please have a look at the demo repository for a concrete example.
When working with a centralized repository shared by multiple people, it’s important to follow good practice. Therefore when adding a new dataset, one should first create a PR on the dataset repository and obtain the approval before the PR is merged.
Updating a dataset is quite similar to creating a new dataset. A new PR on the central git dataset repo needs to be created with the desired modification.
For example if someone wants to add more sample in the dataset, this can be performed by combining the following commands:
dvc add data/new_sample1.mat
dvc add data/new_sample2.mat
dvc commit data
dvc push -r remote-bucket
TBD
dvc[gs]
tensorflow_datasets