Code Monkey home page Code Monkey logo

kale's Introduction

Kale Logo


Kale is a Python package that aims at automatically deploy a general purpose Jupyter Notebook as a running Kubeflow Pipelines instance, without requiring the use the specific KFP DSL.

The general idea of kale is to automatically arrange the cells included in a notebook, and transform them into a unified KFP-compliant pipeline. To do so, the user is only required to decide which cells correspond to which pipeline step, by the use of tags. In this way, a researcher can better focus on building and testing its code locally, and then scale it in a simple, organized and controlled way.

Getting started

Install Kale from PyPI and run it over one of the provided examples.

# install kale
pip install kubeflow-kale

# download a tagged example notebook
wget https://raw.githubusercontent.com/kubeflow-kale/examples/master/titanic-ml-dataset/titanic_dataset_ml.ipynb
# convert the notebook to a python script that defines a kfp pipeline
kale --nb titanic_dataset_ml.ipynb

This will generate generate kaggle-titanic.kfp.py containing a runnable pipeline defined using the KFP Python DSL. Have a look at the code to get a feeling of the magic Kale is performing under the hood.

Deploy the pipeline

In case you are running Kale in a Kubeflow Notebook Server, you can add the --run_pipeline flag to convert and run the pipeline automatically:

kale --nb titanic_dataset_ml.ipynb --run_pipeline

will convert the Notebook and start a new run. Switch over to the KFP UI under the Experiments tabs so the running pipeline.

Jupyter UI

The best way to exploit the potential of Kale is to run JupyterLab with the Jupyter Kale extension installed.

Tagging language

Jupyter provides a tagging feature out-of-the-box, that lets you associate each cells with custom defined tags.

The tags are used to tell Kale how to convert the notebook's code cells into an execution graph, by specifying the execution dependencies between the pipeline steps and which code cells to merge together.

The list of tags recognized by Kale:

Tag Description
block:<block_name> Assign the current cell to a pipeline step
prev:<block_name> Define a dependency of the current cell to other pipeline steps
imports Code to be added at the beginning of every pipeline step. This is particularly useful with cells containing import statements
functions Code to be added at the beginning of every pipeline step, but after import statements. This is particularly useful for functions or statements used in multiple pipeline steps
parameters To be used in cells that contain just variable assignment of primitive types (int, str, float, bool). These variables will be used as pipeline parameters, using the assigned values as defaults
skip Do not include the code of the cell in the pipeline

Note: <block_name> must consist of lower case alphanumeric characters or _, and can not start with a digit (matching regex: ^[_a-z][_a-z0-9]$).

Cell Merging

Multiple code cells performing a related task (e.g. some data processing) can be merged into a single pipeline step by tagging the first one with a block tag (e.g. block:data_processing) and leaving the below cells empty of any tags. Kale will merge any untagged cell with the first tagged cell above - always skipping cells tagged with the skip tag.

Cell can be merged even if not contiguous in the notebook, just by tagging them with the same block name - the order of cells in the notebook will be preserved in the resulting pipeline step.

Notebook Tags

The ell tags can be added to the tags section of a code cell metadata (see nbformat doc on cell metadata).

Notebook Metadata Spec

In order to deploy a pipeline to Kubeflow Pipeline, Kale needs several information like the name of the experiment and the pipeline, its description, volume mounts, etc...

All this information can be embedded into the Notebook in the metadata section (see the nbformat spec). Kale expects an entry in the metadata section named kubeflow_notebook with the following spec:

Key Required Description Spec
pipeline_experiment Yes Name of the KFP Experiment Free Text
pipeline_name Yes Name of the pipeline Alphanumeric characters or -
pipeline_description No Description of the pipeline Free Text
volumes No A list of volume specs See below the Volume spec
docker_image No Base docker image for pipeline steps -

Volume spec:

Key Required Description Spec
type Yes Type of Volume to be mounted One of pv, pvc, new_pvc
name Yes Name of the existing or new resource K8s compliant resource name
mount_point Yes The mount point in the pipeline step fs Valid Unix path
size Yes (for pv and new_pvc options Size of Volume Integer
size_type Yes (when defining size) Storage size One of Gi, Mi, Ki
snapshot Yes When true: snapshot volume at the end of pipeline Bool
snapshot_name Yes (when snapshot True) Name of the snapshot resource K8s compliant resource name

A sample Notebook metadata:

"kubeflow_notebook": {
    "experiment_name": "Titanic Experiment",
    "pipeline_name": "ml-comparison",
    "pipeline_description": "ML Pipeline predicting survival score of passengers of Titanic",
    "docker_image": "docker.io/kubeflow-kale/launcher:latest",
    "volumes": [
        {
            "type": "new_pvc",
            "name": "titanic-data-pvc",
            "mount_point": "/data",
            "size": "1",
            "size_type": "Gi",
            "snapshot": true,
            "snapshot_name": "titanic-data-snapshot"
        }
    ]
}

Data passing

When splitting a Notebook into separate execution steps (each pipeline step runs inside its own docker container) the data dependencies between the steps would not allow the proper execution of the pipeline.

Kale is able to provide a seamless execution without any user intervention by detecting these data dependencies that marshalling the necessary data between the steps.

Contribute

Just clone the repository to your local machine and install the package in a virtual environment.

# Clone the repo to your local environment
git clone https://github.com/kubeflow-kale/kale
cd kale
# Install the package in your virtualenv
python install -r requirements.txt

For a more detailed explanation of the internals of Kale, head over to Architecture.md

kale's People

Contributors

stefanofioravanzo avatar elikatsis avatar klolos avatar dependabot[bot] avatar shangdibufashi avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.