Code Monkey home page Code Monkey logo

example-seldon's Introduction

⚠️ kubeflow/example-seldon is not maintained

This repository has been deprecated and archived on Nov 30th, 2021.

Train and Deploy Machine Learning Models on Kubernetes with Kubeflow and Seldon-Core

MNIST

Using:

The example will be the MNIST handwritten digit classification task. We will train 3 different models to solve this task:

  • A TensorFlow neural network model.
  • A scikit-learn random forest model.
  • An R least squares model.

We will then show various rolling deployments

  1. Deploy the single Tensorflow model.
  2. Do a rolling update to an AB test of the Tensorflow model and the sklearn model.
  3. Do a rolling update to a Multi-armed Bandit over all 3 models to direct traffic in real time to the best model.

In the follow we will:

  1. Install kubeflow and seldon-core on a kubernetes cluster
  2. Train the models
  3. Serve the models

Requirements

  • gcloud
  • kubectl
  • ksonnet
  • argo

Setup

There is a consolidated script to create the demo which can be found here. For a step by step guide do the following:

  1. Install kubeflow on GKE. This should create kubeflow in a namespace kubeflow. We suggest you use the command line install so you can easily modify your Ksonnet installation. Ensure you have the environment variables KUBEFLOW_SRC and KFAPP set. OAUTH is preferred as with basic auth port-forwarding to ambassador is insufficient

  2. Install seldon. Go to your Ksonnet application folder setup in the previous step and run

    cd ${KUBEFLOW_SRC}/${KFAPP}/ks_app
    
    ks pkg install kubeflow/seldon
    ks generate seldon seldon
    ks apply default -c seldon
    
  3. Install Helm

    kubectl -n kube-system create sa tiller
    kubectl create clusterrolebinding tiller --clusterrole cluster-admin --serviceaccount=kube-system:tiller
    helm init --service-account tiller
    kubectl rollout status deploy/tiller-deploy -n kube-system
    
  4. Create an NFS disk and persistent volume claim called nfs-1. You can follow one guide on create an NFS volume using Google Filestore here. A consolidated set of steps is shown here

  5. Add Cluster Roles so Argo can start jobs successfully

    kubectl create clusterrolebinding my-cluster-admin-binding --clusterrole=cluster-admin --user=$(gcloud info --format="value(config.account)")
    kubectl create clusterrolebinding default-admin2 --clusterrole=cluster-admin --serviceaccount=kubeflow:default
    
  6. Install Seldon Analytics Dashboard

    helm install seldon-core-analytics --name seldon-core-analytics --set grafana_prom_admin_password=password --set persistence.enabled=false --repo https://storage.googleapis.com/seldon-charts --namespace kubeflow
    
  7. Port forward the dashboard when running

    kubectl port-forward $(kubectl get pods -n kubeflow -l app=grafana-prom-server -o jsonpath='{.items[0].metadata.name}') -n kubeflow 3000:3000
    
  8. Visit http://localhost:3000/dashboard/db/prediction-analytics?refresh=5s&orgId=1 and login using "admin" and the password you set above when launching with helm.

MNIST models

Tensorflow Model

SKLearn Model

R Model

Train the Models

Follow the steps in ./notebooks/training.ipynb to:

  • Run Argo Jobs for each model to:
    • Creating training images and push to repo
    • Run training
    • Create runtime prediction images and push to repo
    • Deploy individual runtime model

To push to your own repo the Docker images you will need to setup your docker credentials as a Kubernetes secret containing a config.json. To do this you can find your docker home (typically ~/.docker) and run kubectl create secret generic docker-config --from-file=config.json=${DOCKERHOME}/config.json --type=kubernetes.io/config to create a secret.

Serve the Models

Follow the steps in ./notebooks/serving.ipynb to:

  1. Deploy the single Tensorflow model.
  2. Do a rolling update to an AB test of the Tensorflow model and the sklearn model.
  3. Do a rolling update to a Multi-armed Bandit over all 3 models to direct traffic in real time to the best model.

To ensure the notebook can run successfully install the python dependencies:

pip install -r notebooks/requirements.txt

If you have installed the Seldon-Core analytics you can view them on the grafana dashboard:

Grafana

example-seldon's People

Contributors

jinchihe avatar nicholas-fwang avatar ryandawsonuk avatar ukclivecox avatar windkit avatar zijianjoy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

example-seldon's Issues

Prediction Analytics Dashboard not showing metrics

I deployed a keras mnist model using seldon core and I'm trying to monitor the model using seldon-core-analytics with grafana dashaboards. The cluster_monitoring dashboard looks fine but the Prediction Analytics doesn't find the my deployed model and all the panels are empty. I have installed kubeflow, the seldon-core and seldon-core-analytics in the same namespace and I have my model deployed in that same namespace.

After checking prometheus service logs in the same namespace I found the following message recurrently:

level=warn ts=2019-07-17T16:24:27.691294684Z caller=scrape.go:836 component="scrape manager" scrape_pool=kubernetes-pods target=http://192.168.49.198:16686/metrics msg="append failed" err=""INVALID" is not a valid start token"

Can you please advise on this?

Thanks.

Not using GCloud to setup Seldon

Hello,
Can seldon-core be setup without using GCloud ?
I mean, can I have the NFS set on the VM itself ?
Please share your inputs.
Thanks in advance.

RESOURCE_ERROR:No valid versions with the prefix \"1.11\" found

I tried to deploy the seldon on GCP after changing the env.sh file & running create_demo.sh. Got below error while deployig on GCP:
ERROR: (gcloud.deployment-manager.deployments.create) Error in Operation [operation-1608701216204-5b71af08aa508-44ff7cbc-8fd80a7f]: errors:

Please help!

Training error on all frameworks

And I am running into an error training the model with all the different frameworks.  This is a new installation and this is my first time through, so I expect that there is a missing dependency or something, but I cannot figure out how to debug this and find out what the problem it.  The error does not occur until about 4 hours into a run, so I can replicate it reliably, but it takes a long time to do so.

Here is the error:

Name: kubeflow-tf-train-bp9ln
Namespace: kubeflow
ServiceAccount: default
Status: Failed
Message: child 'kubeflow-tf-train-bp9ln-480988007' failed
Created: Tue Apr 02 21:03:12 +0000 (1 week ago)
Started: Tue Apr 02 21:03:12 +0000 (1 week ago)
Finished: Tue Apr 02 21:03:21 +0000 (1 week ago)
Duration: 9 seconds
Parameters:
tfjob-version-hack: 1
version: 0.1
github-user: kubeflow
github-revision: master
docker-user: seldonio
build-push-image: false

STEP PODNAME DURATION MESSAGE
✖ kubeflow-tf-train-bp9ln child 'kubeflow-tf-train-bp9ln-480988007' failed
├---○ build-push when 'false == true' evaluated false
└---✖ train kubeflow-tf-train-bp9ln-480988007 8s Error from server (AlreadyExists): error when creating "/tmp/manifest.yaml": tfjobs.kubeflow.org "mnist-train-1" already exists

Wrapping model for MNIST Scikit-learn doesn't work

When trying to serve the MNIST Scikit-learn model (kubeflow-seldon example), using the latest code there (which is using s2i for building images) I'm getting the following error:

santiago@santiago-Inspiron-5559:~$ kubectl logs seldon-sk-deploy-wc9qg-927719684 main
Connecting to github.com (192.30.253.112:443)
wget: can't execute 'ssl_helper': No such file or directory
wget: error getting response: Connection reset by peer
tar: can't open 'source-to-image-v1.1.9a-40ad911d-linux-amd64.tar.gz': No such file or directory
Cannot connect to the Docker daemon at tcp://127.0.0.1:2375. Is the docker daemon running?
Cannot connect to the Docker daemon at tcp://127.0.0.1:2375. Is the docker daemon running?
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
./wrap.sh: line 14: ./s2i: not found
REPOSITORY TAG IMAGE ID CREATED SIZE
Pushing image to santiagomol40/skmnistclassifier_runtime:0.1
Login Succeeded
The push refers to a repository [docker.io/santiagomol40/skmnistclassifier_runtime]
An image does not exist locally with the tag: santiagomol40/skmnistclassifier_runtime

Looks like the container in charge of running s2i doesn't have SSL support.

deploy.sh fail with error "must provide URIs beginning with 'github.com'"

During installation the following command will fail:
$ curl https://raw.githubusercontent.com/kubeflow/kubeflow/v${KUBEFLOW_VERSION}/scripts/deploy.sh | bash

Produce error:
"ERROR Registries using protocol 'github' must provide URIs beginning with 'github.com' (optionally prefaced with 'http', 'https', 'www', and so on"
This is the full output:

$ export KUBEFLOW_VERSION=0.2.2
$ export KUBEFLOW_KS_DIR=/home/arllanos/ks_kubeflow_seldon
$ export KUBEFLOW_DEPLOY=false
$ curl https://raw.githubusercontent.com/kubeflow/kubeflow/v${KUBEFLOW_VERSION}/scripts/deploy.sh | bash
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1947  100  1947    0     0   2309      0 --:--:-- --:--:-- --:--:--  2306
++ pwd
+ KUBEFLOW_REPO=/home/arllanos/kubeflow_repo
+ KUBEFLOW_VERSION=0.2.2
+ KUBEFLOW_DEPLOY=false
+ [[ ! -d /home/arllanos/kubeflow_repo ]]
+ source /home/arllanos/kubeflow_repo/scripts/util.sh
+ check_install ks
+ which ks
+ check_install kubectl
+ which kubectl
+ DEPLOYMENT_NAME=kubeflow
+ KUBEFLOW_KS_DIR=/home/arllanos/ks_kubeflow_seldon
++ dirname /home/arllanos/ks_kubeflow_seldon
+ cd /home/arllanos
++ basename /home/arllanos/ks_kubeflow_seldon
+ ks init ks_kubeflow_seldon
INFO  Using context 'gke_silicon-cell-209113_us-central1-a_kubeflow-seldon-ml' from the kubeconfig file specified at the environment variable $KUBECONFIG
INFO  Creating environment "default" with namespace "default", pointing to cluster at address "https://35.224.223.191"
INFO  Generating ksonnet-lib data at path '/home/arllanos/ks_kubeflow_seldon/lib/v1.7.0'
INFO  ksonnet app successfully created! Next, try creating a component with `ks generate`.
+ cd /home/arllanos/ks_kubeflow_seldon
+ ks registry add kubeflow /home/arllanos/kubeflow_repo/kubeflow
ERROR Registries using protocol 'github' must provide URIs beginning with 'github.com' (optionally prefaced with 'http', 'https', 'www', and so on

-BTW, in setup.md there is a broken link (404-not found) in the following text:
"Install kubeflow - for details see here"

predict fails and seldondeployment missing .status

@cliveseldon
Calling predict on a deployment that returned sucess fails with a connection error. Attempting to debug this reveals that .status is missing from seldondeployment. Sugestions for how to debug this?

!kubectl get seldondeployments mnist-classifier -o jsonpath='{.status}'

returns nothing

!kubectl get seldondeployments mnist-classifier -o json
returns
{
"apiVersion": "machinelearning.seldon.io/v1alpha2",
"kind": "SeldonDeployment",
"metadata": {
"annotations": {
"kubectl.kubernetes.io/last-applied-configuration": "{"apiVersion":"machinelearning.seldon.io/v1alpha2","kind":"SeldonDeployment","metadata":{"annotations":{},"labels":{"app":"seldon"},"name":"mnist-classifier","namespace":"kubeflow"},"spec":{"annotations":{"deployment_version":"v1","project_name":"MNIST Example","seldon.io/engine-separate-pod":"false","seldon.io/rest-connection-timeout":"100"},"name":"mnist-classifier","predictors":[{"annotations":{"predictor_version":"v1"},"componentSpecs":[{"spec":{"containers":[{"image":"seldonio/deepmnistclassifier_runtime:0.2","imagePullPolicy":"Always","name":"tf-model","volumeMounts":[{"mountPath":"/data","name":"persistent-storage"}]}],"terminationGracePeriodSeconds":1,"volumes":[{"name":"persistent-storage","volumeSource":{"persistentVolumeClaim":{"claimName":"nfs-1"}}}]}}],"graph":{"children":[],"endpoint":{"type":"REST"},"name":"tf-model","type":"MODEL"},"name":"mnist-classifier","replicas":1}]}}\n"
},
"creationTimestamp": "2019-04-18T21:26:32Z",
"generation": 1,
"labels": {
"app": "seldon"
},
"name": "mnist-classifier",
"namespace": "kubeflow",
"resourceVersion": "128631",
"selfLink": "/apis/machinelearning.seldon.io/v1alpha2/namespaces/kubeflow/seldondeployments/mnist-classifier",
"uid": "a3450e71-6220-11e9-a023-da0ed60f5a55"
},
"spec": {
"annotations": {
"deployment_version": "v1",
"project_name": "MNIST Example",
"seldon.io/engine-separate-pod": "false",
"seldon.io/rest-connection-timeout": "100"
},
"name": "mnist-classifier",
"predictors": [
{
"annotations": {
"predictor_version": "v1"
},
"componentSpecs": [
{
"spec": {
"containers": [
{
"image": "seldonio/deepmnistclassifier_runtime:0.2",
"imagePullPolicy": "Always",
"name": "tf-model",
"volumeMounts": [
{
"mountPath": "/data",
"name": "persistent-storage"
}
]
}
],
"terminationGracePeriodSeconds": 1,
"volumes": [
{
"name": "persistent-storage",
"volumeSource": {
"persistentVolumeClaim": {
"claimName": "nfs-1"
}
}
}
]
}
}
],
"graph": {
"children": [],
"endpoint": {
"type": "REST"
},
"name": "tf-model",
"type": "MODEL"
},
"name": "mnist-classifier",
"replicas": 1
}
]
}
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.