Code Monkey home page Code Monkey logo

kedro-docker's Introduction

Kedro Docker

Run a Kedro project in a Docker environment

Prerequisites

  • Docker
  • Kedro 0.16.6
  • Kedro-Docker 0.2.1
  • scikit-learn 0.23.0
  • pickle 0.0.11

Workflows

  • Read data from csv files and excel file as well, pre-process files and then save csv files
  • Split data and then save pickle files
  • Read pickle files, run train model and then save the regression model(pickle format)
  • Load the regression model and run Predict from the pickle model

Get data from S3

  • Set configs in the conf/*/credentials.yml
    dev_s3:
        client_kwargs:
            aws_access_key_id: token
            aws_secret_access_key: key
    

Build

  • Setup Kedro Environment

  • Install Kedro Docker

    pip install kedro-docker==0.2.1
    
  • Generate a Dockerfile

    example$ kedro docker init
    
  • Build a Docker image

    example$ kedro docker build
    

    It will create a Docker image with example:latest name

  • Run a Kedro project in a Docker Environment

    example$ kedro docker run
    

Usage (in Docker)with code from github(Don't install Kedro Environment)

  • Download code from github

  • Build a Docker image

    $cd example
    $docker build --tag=kedro-docker .
    
  • Run a docker image

    $docker run -it kedro-docker bash
    
  • Run a Kedro project

    $kedro run
    

Result

```
kedro@b792338c69b3:~$ kedro run
2020-12-28 07:27:43,250 - root - INFO - ** Kedro project kedro
#### Pipeline execution order ####
Inputs: companies, parameters, reviews, shuttles

preprocessing_companies
preprocessing_shuttles
master_table
split_data
train_model
predict

Outputs: None
##################################
fatal: not a git repository (or any of the parent directories): .git
2020-12-28 07:27:43,259 - kedro.versioning.journal -
WARNING - Unable to git describe /home/kedro
/usr/local/lib/python3.7/site-packages/fsspec/implementations/local.py:33: FutureWarning:
The default value of auto_mkdir=True has been deprecated and will be changed to
auto_mkdir=False by default in a future release.
FutureWarning,
2020-12-28 07:27:43,280 - kedro.io.data_catalog -
INFO - Loading data from `companies` (CSVDataSet)...
2020-12-28 07:27:43,338 - kedro.pipeline.node -
INFO - Running node: preprocessing_companies: preprocess_companies([companies]) -> [preprocessed_companies]
2020-12-28 07:27:43,406 - kedro.io.data_catalog -
INFO - Saving data to `preprocessed_companies` (CSVDataSet)...
2020-12-28 07:27:43,650 - kedro.runner.sequential_runner -
INFO - Completed 1 out of 6 tasks
2020-12-28 07:27:43,651 - kedro.io.data_catalog -
INFO - Loading data from `shuttles` (ExcelDataSet)...
2020-12-28 07:27:55,006 - kedro.pipeline.node -
INFO - Running node: preprocessing_shuttles: preprocess_shuttles([shuttles]) -> [preprocessed_shuttles]
2020-12-28 07:27:55,095 - kedro.io.data_catalog -
INFO - Saving data to `preprocessed_shuttles` (CSVDataSet)...
2020-12-28 07:27:55,487 - kedro.runner.sequential_runner -
INFO - Completed 2 out of 6 tasks
2020-12-28 07:27:55,487 - kedro.io.data_catalog -
INFO - Loading data from `preprocessed_shuttles` (CSVDataSet)...
2020-12-28 07:27:55,567 - kedro.io.data_catalog -
INFO - Loading data from `preprocessed_companies` (CSVDataSet)...
2020-12-28 07:27:55,603 - kedro.io.data_catalog -
INFO - Loading data from `reviews` (CSVDataSet)...
2020-12-28 07:27:55,682 - kedro.pipeline.node -
INFO - Running node: master_table:
create_master_table([preprocessed_companies,preprocessed_shuttles,reviews]) -> [master_table]
2020-12-28 07:27:59,069 - kedro.io.data_catalog -
INFO - Saving data to `master_table` (CSVDataSet)...
2020-12-28 07:28:07,982 - kedro.runner.sequential_runner - INFO - Completed 3 out of 6 tasks
2020-12-28 07:28:07,983 - kedro.io.data_catalog -
INFO - Loading data from `master_table` (CSVDataSet)...
2020-12-28 07:28:09,799 - kedro.io.data_catalog -
INFO - Loading data from `parameters` (MemoryDataSet)...
2020-12-28 07:28:09,800 - kedro.pipeline.node -
INFO - Running node: split_data: split_data([master_table,parameters]) -> [Xtest,Xtrain,Ytest,Ytrain]
2020-12-28 07:28:10,562 - kedro.io.data_catalog - INFO - Saving data to `Xtrain` (MemoryDataSet)...
2020-12-28 07:28:10,635 - kedro.io.data_catalog - INFO - Saving data to `Xtest` (MemoryDataSet)...
2020-12-28 07:28:10,645 - kedro.io.data_catalog - INFO - Saving data to `Ytrain` (MemoryDataSet)...
2020-12-28 07:28:10,646 - kedro.io.data_catalog - INFO - Saving data to `Ytest` (MemoryDataSet)...
2020-12-28 07:28:10,711 - kedro.runner.sequential_runner - INFO - Completed 4 out of 6 tasks
2020-12-28 07:28:10,711 - kedro.io.data_catalog -
INFO - Loading data from `Xtrain` (MemoryDataSet)...
2020-12-28 07:28:10,765 - kedro.io.data_catalog - INFO - Loading data from `Ytrain` (MemoryDataSet)...
2020-12-28 07:28:10,767 - kedro.pipeline.node -
INFO - Running node: train_model: train_model([Xtrain,Ytrain]) -> [regression_model]
2020-12-28 07:28:11,173 - kedro.io.data_catalog -
INFO - Saving data to `regression_model` (MemoryDataSet)...
2020-12-28 07:28:11,290 - kedro.runner.sequential_runner - INFO - Completed 5 out of 6 tasks
2020-12-28 07:28:11,290 - kedro.io.data_catalog - INFO - Loading data from `Xtest` (MemoryDataSet)...
2020-12-28 07:28:11,302 - kedro.io.data_catalog - INFO - Loading data from `Ytest` (MemoryDataSet)...
2020-12-28 07:28:11,303 - kedro.io.data_catalog -
INFO - Loading data from `regression_model` (MemoryDataSet)...
2020-12-28 07:28:11,304 - kedro.pipeline.node -
INFO - Running node: predict: predict([Xtest,Ytest,regression_model]) -> None
2020-12-28 07:28:11,367 - example.pipelines.data_science.nodes -
INFO - Model has a coefficient R^2 of 0.456.
2020-12-28 07:28:11,411 - kedro.runner.sequential_runner -
INFO - Completed 6 out of 6 tasks
```

Issues

  • Could not load Excel Data Set

    kedro.io.core.DataSetError: Failed while loading data from data set ExcelDataSet
    (filepath=/home/kedro/data/01_raw/shuttles.xlsx,
    load_args={'engine': xlrd}, protocol=file, save_args={'index': False},
    writer_args={'engine': xlsxwriter}).
    Excel xlsx file; not supported
    

    Fixed: xlrd==1.2.0

  • Load Data from AWS S3

    server_1  |     ds_name, ds_config, load_versions.get(ds_name), save_version
    webserver_1  |   File "/usr/local/lib/python3.7/site-packages/kedro/io/core.py",
    line 185, in from_config
    webserver_1  |     ) from err
    webserver_1  | kedro.io.core.DataSetError:
    webserver_1  | get_session() got an unexpected keyword argument 'aws_access_key_id'.
    webserver_1  | DataSet 'companies' must only contain arguments valid for the co
    
    File "/usr/local/lib/python3.7/site-packages/pluggy/callers.py", line 187, in _multicall
        res = hook_impl.function(*args)
    File "/home/kedro/src/example/hooks.py", line 78, in register_catalog
        catalog, credentials, load_versions, save_version, journal
    File "/usr/local/lib/python3.7/site-packages/kedro/io/data_catalog.py", line 328, in from_config
        ds_name, ds_config, load_versions.get(ds_name), save_version
    File "/usr/local/lib/python3.7/site-packages/kedro/io/core.py", line 185, in from_config
        ) from err
    kedro.io.core.DataSetError:
    create_client() got multiple values for keyword argument 'aws_access_key_id'.
    DataSet 'companies' must only contain arguments valid for
    the constructor of `kedro.extras.datasets.pandas.csv_dataset.CSVDataSet`.
    

    Fixed: install s3fs==0.4.0 and update credentials.yml https://discourse.kedro.community/t/how-do-i-pass-s3-credentials-to-my-datasets/156

    dev_s3:
        client_kwargs:
            aws_access_key_id: access_key
            aws_secret_access_key: secret_key
    

Reference

kedro-docker's People

Contributors

nhatthaiquang-agilityio avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.