kubeflow / kubeflow Goto Github PK

Machine Learning Toolkit for Kubernetes

License: Apache License 2.0

Makefile 3.39% Python 9.64% Shell 0.19% Go 14.75% HTML 3.72% CSS 2.00% TypeScript 52.57% Dockerfile 1.63% JavaScript 9.69% PowerShell 0.35% Pug 0.70% SCSS 1.36%

google-kubernetes-engine jupyter kubeflow kubernetes machine-learning minikube ml notebook tensorflow

kubeflow's Introduction

Kubeflow the cloud-native platform for machine learning operations - pipelines, training and deployment.

Documentation

Please refer to the official docs at kubeflow.org.

Working Groups

The Kubeflow community is organized into working groups (WGs) with associated repositories, that focus on specific pieces of the ML platform.

Quick Links

PR Dashboard

Get Involved

Please refer to the Community page.

kubeflow's People

Contributors

Stargazers

Watchers

Forkers

foxish aronchick jlewi amygdala vishh mattf holdenk ksharpdabu srikarvaka firassghari ivan-rodriguez-ck esmevane addisonhuddy metral 3cham oleewere zhangruiskyline jdc08161063 zyyj007 cloudbring ml-lab squawell cartertsai rishabhbhardwaj xiaoq17 codebase-cn heyuhere vdt cristicmf hustnn chenqiangzhishen sidlinux22 betatim sun363587351 gnanam336 kod3r tony32769 cwkcyd chubbymaggie successren mehrdad-shokri dmueller2001 marami52 amitkumarj441 futurepw hoontme yew1eb dinneo zmoon111 markjacksonfishing john-lin aland-zhang nkhuyu winnerineast scorpiocph katacoda neilbryant joseph-chan zachhhhh limx59 amoahcp mutual-ai abechoi vicaire jarlene qdj0511 justin2061 mofelee aping-fo antzar rawindra111 kiskong ramchentheersridhar kubeflowio glasgabson littlezhou bright-i hamidmhl zephyr2018 tikyau karthikrk1 hamza2404 sozercan cymotiffany liuheng2cqupt changfengfeng kalengo leezqcst bjfanchen andyrao colinsongf samuell 40a adam-zhang qizheng09 skymysky sk38897 jingchun01 hbcbh1999 boragocode

kubeflow's Issues

Remove namespace as a package parameter

Most of our ksonnet packages have namespace as an explicit parameter.

This is a work around for ksonnet/ksonnet#222. Once that's fixed and components can inherit namespace from the environment we can remove namespace as an explicit parameter.

Support IAP on GKE

When Kubeflow is deployed on GKE it should be easy to configure it to setup IAP to secure remote access to Kubeflow services.

Ksonnet manifests for setting up ingress and other resources
Docs explaining how to enable IAP
Use Cloud Endpoints (or some other proxy) to validate JWT and reject traffic that bypassed IAP
- Blocked by #104 - Can't connect to Jupyter Kernel using sidecar

Here are issues that need to be fixed to use Envoy as a proxy that handles JWT verification

istio/proxy#941 -Need support for ES256 algorithm which is what IAP uses
istio/proxy#939 - Need to support headers used by IAP
istio/proxy#929 - Envoy proxy needs to support rejecting all requests without valid JWT credentials
istio/proxy#930 - Need to support JWT validation without breaking GCP health checking
- We can work around this just by having 2 envoy proxies in sequence
- both can be running in the same pod

link to ksonnet might be confusing

ksonnet is a pre-req, but you link to the release page which only has a linux build.

so if someone is on mac they are sent on a goose chase, having to read the ksonnet readme to install ks from brew.

so the readme could use a bit more info about installing ks.

Proposal: our expectation on KubeFlow

Authors:

@DjangoPeng - Jingtian Peng <[email protected]>
@zjx-caicloud - Jianxin Zhang <[email protected]>
@gaocegege - Ce Gao <[email protected]>
@ScorpioCPH - Penghao Cen <[email protected]>

Motivation

TensorFlow users can run training jobs on Kubernetes with the support of this project: tensorflow/k8s, which is a CRD controller (or TensorFlow operator) for deploying distributed TensorFlow on Kubernetes.

Maybe it is not sustainable for TensorFlow users (e.g. data scientist or deep learning engineer
) to write a complex YAML file to deploy their TensorFlow Training and other related service (e.g. launch TensorFlow Serving and TensorBoard). And the users could not get the training status in the current design.

The KubeFlow project is dedicated to making a high-efficient, easy-to-use, cloud-native and scalable Machine Learning platform based on Kubernetes. Our goal is building an end-to-end system to manage the whole lifecycle of Machine Learning Jobs, which makes the TensorFlow to be kubernetes-friendly and user-friendly.

This doc describes a user-oriented distributed ML platform with:

Defining a user spec to deploy TensorFlow Training/Serving/TensorBoard job
Workflow from user side
Components of the architecture

Use Cases

I have no idea about Kubernetes and just want to focus on developing TensorFlow model.
I don’t need to write YAML files for any kind of TensorFlow jobs, including training, serving and TensorBoard jobs.
I can specify computing resources (e.g. GPUs) which will be used in TensorFlow training or serving.
I can get the processing status and logs from KubeFlow API after triggering TensorFlow training jobs.
I can launch TensorFlow model as a service to expose trained TensorFlow model.
I can launch TensorBoard as a service which would track multiple experiments.

Goals

Implement a server
- Accept training code files, e.g. python files as input, and run training jobs on Kubernetes, with no need for additional configs.
- watch job status running on Kubernetes and do some cleanup works on the cluster
Implement a monitor, to get the job status from Kubernetes and report specific status to the KubeFlow server
Update CRD specification and controller to fit the new design
For scheduling (future work):
- Implement customized scheduler or reuse the community project kubernetes-incubator/kube-arbitrator to support fine-grained scheduling for TensorFlow jobs
For resource management (future work):
- Use device plugin (in Kubelet) and NVIDIA k8s-device-plugin with nvidia-docker v2 for NVIDIA GPU scheduling
- Implement Resource Class with CRD and Custom Scheduler for different type GPUs scheduling (on homogeneous nodes)

Workflow

Fig 1 KubeFlow distributed training job

As shown in Fig 1, users develop some ML models in the JupyterHub. Then they send the TensorFlow Python script (e.g. inception_v3.py) into KubeFlow. When receiving the script, KubeFlow creates TensorFlow distributed training job based on Kubernetes. Finally, the TensorFlow training job would delivery sort of files, such as tf.events for TensorBoard visualization and saved_model.pb for TensorFlow Serving.

Fig 2 KubeFlow serving job

As shown in Fig 2, users send the saved_model.pb file into KubeFlow. Then KubeFlow creates a TensorFlow Serving job to provide a public service. With the number of requests increasing, KubeFlow would automatically scale out the job.

Fig 3 KubeFlow TensorBoard job

As shown in Fig 3, users send the tf.events file into KubeFlow. Then KubeFlow creates a TensorBoard job to track multiple experiments.

Components

Fig 4 KubeFlow components

The design has several components:

Kubernetes Extended (to better serve Machine Learning workload)
- TensorFlow CRD
- TensorFlow Controller
- TensorFlow Scheduler
Server (to manage TensorFlow jobs and interact with the client)
Monitor (to watch and report statuses of all TensorFlow jobs)

CRD

According to the use case and design goal, TensorFlow jobs are always divided into three types as below:

Training Job
- Local Training Job
- Distributed Training Job
Serving Job
TensorBoard Job

TensorFlow distributed training job is most complicated and elusive among them. It adopts the classical PS-Worker distributed framework. PS (Parameter Server) is responsible for storing and updating the model's parameters. Worker is responsible for computing and applying gradients.

But, the native implementation of TensorFlow distributed framework is not perfect. For example, PS processes would join forever even though all Workers processes completed their computation works, which is a waste of resource. At the same time, Kubernetes has no idea of the TenforFlow job for lack of TensorFlow conceptions and semantics. So, it's difficult for native Kubernetes to make a realtime monitor watching the TenforFlow job status. In this case, KubeFlow aims to handle all problems above and fill in the gap between TensorFlow and Kubernetes.

The state-of-art implementation tensorflow/k8s generally works well and lots of works could be reused in the new CRD design. We have some changes based on tensorflow/k8s. We divides TensorFlow jobs into three types: training jobs, serving jobs and TensorBoard jobs.

In general, we always launch more than one TensorFlow training jobs with multiple kinds of hyper parameters combinations to get the best training results. At the same time, we can use one TensorBoard instance to visualize all of TensorFlow events outputted from training jobs. It is really a common case for data scientists and algorithm engineers. But, in the current design, the TensorBoard job is binding to one TensorFlow job.

In order to make above come true, we need to decouple the TensorFlow job and TensorBoard job. We can make some changes on the TenforFlow job CRD. At first, both jobs can share TFReplicaSpec. Moreover, we can use TFReplicaSpec to specify the TensorFlow Serving job. At the same time, we can add a type field to define the TensorFlow job type (one of Training, TensorBoard or Serving).

apiVersion: "tensorflow.org/v1alpha1"
kind: "TFJob"
metadata:
  name: "training-job"
spec:
  type: Training
  tfReplicaSpec:
  - replicas: 2
    tfReplicaType: PS
    template:
      spec:
        containers:
        - name: tensorflow
          image: tensorflow/tensorflow:1.4.0
          command:
          - "/workdir/tensorflow/launch_training.sh"
          volumeMounts:
          - name: workdir
            mountPath: /workdir
        volumes:
        - name: workdir
          glusterfs:
            endpoints: <gluster-cluster>
            path: <gluster_vol_subpath>
        restartPolicy: OnFailure
      tfReplicaSpec:
  - replicas: 4
    tfReplicaType: Worker
    template:
      spec:
        containers:
        - name: tensorflow
          image: tensorflow/tensorflow:1.4.0
          command:
          - "/workdir/tensorflow/launch_training.sh"
          args:
          - "--data_dir=/workdir/data"
          - "--train_dir=/workdir/train"
          - "--model_dir=/workdir/model"
          volumeMounts:
          - name: workdir
            mountPath: /workdir
        volumes:
        - name: workdir
          glusterfs:
            endpoints: <gluster-cluster>
            path: <gluster_vol_subpath>
        restartPolicy: OnFailure

Algo 1 YAML config for distributed training job

Related issues: tensorflow/k8s#209

Controller

tensorflow/k8s follows the CoreOS operator pattern to handle the events from Kubernetes, we aim to refactor the project to controller pattern which is similar to kubernetes/sample-controller to be extensible and robust.

But it has the same function as tensorflow/k8s: the controller watches the shared state of the cluster through the Kubernetes apiserver and makes changes attempting to move the current state towards the desired state for TensorFlow jobs on Kubernetes.

Server

Server is the most important component in KubeFlow, which are responsible for:

Setting up RESTful apiserver to listen for requests for creating TensorFlow jobs or requests from KubeFlow monitor for status updating.

Generating objects that can be accepted by Kubernetes apiserver to create TensorFlow jobs according to user requests.

Recording status of TensorFlow jobs in order to reschedule TensorFlow jobs when something goes wrong during training.

Monitor

KubeFlow controller allows users to create TensorFlow training jobs on Kubernetes, but the parameter servers are always running even after the training job finished. Besides this, users could not get the Tensorflow job's' status unless they access to Kubernetes.

Then we aim to implement a component which is called KubeFlow monitor, to watch the status of TensorFlow jobs on Kubernetes. And the monitor should report the status to KubeFlow server, then the server could move the current state towards the desired state.

Google Doc Version

For continuously and conveniently discuss the proposal and scope of KubeFlow, we created a gdoc of this proposal. Welcome for suggestion and comments! 🎉

Fault tolerant storage for Jupterhub

Jupyter pods are storing data in the pod volume. So if the pod dies you would lose any notebook/file edits.

We should be using a fault tolerant volume so that if the pod dies we don't lose our data.

/cc @foxish

Proposal: Discuss Kubeflow organization and community

Current proposal (by @aronchick):

Find an hour when we can discuss the overall community plans. This should include:
- What org this should roll up to
- If/when this moves to a foundation
- Calendar for the next ~6 months
- Community language for the website (e.g. guidelines, ownership, etc)

cc @willnorris @erikerlandson @foxish @elmiko

Make JupyterHub More Configurable

What are the common points of JupyterHub that users will want to configure?

Default images ?
Auth mode?
SSL?
- This is probably blocked on Kubernetes adding good support for KMS (#54)

In addition or alternatively, we could show people how they could easily provide their own KubeSpawner py file to use in their component.

We might want to look at the existing helm package for JupyterHub and think about whether that can be reused in some fashion.

/cc @foxish @yuvipanda

Katacoda Demo Scenario Python incompatibility - Warning

I was able to deploy demo scenario developed by Katacoda.

In this scenario, you learn how to deploy different Machine Learning workloads using Kubeflow and Kubernetes. The interactive environment is a two-node Kubernetes cluster allowing you to experience Kubeflow and deploy real workloads to understand how it can solve your problems.

The Kubeflow project is dedicated to making Machine Learning on Kubernetes easy, portable and scalable.

I went thru 7 steps and final result is below.
NAME READY STATUS RESTARTS AGE
example-job-master-legq-0-xr9wm 0/1 Completed 0 17m
example-job-ps-legq-0-ddk8g 1/1 Running 0 17m
example-job-ps-legq-1-mdrzw 1/1 Running 0 17m
example-job-worker-legq-0-5gzsh 1/1 Running 0 17m
jupyter-admin 1/1 Running 0 12m
model-client-job-katacoda-z7bq6 0/1 Completed 0 17s
model-server-584cf76db9-5cvw9 1/1 Running 0 19m
model-server-584cf76db9-g4dq6 1/1 Running 0 19m
model-server-584cf76db9-xzcc8 1/1 Running 0 19m
tf-hub-0 1/1 Running 0 19m
tf-job-operator-6f7ccdfd4d-k8qw9 1/1 Running 0 19m
ed | tail -n1 | tr -s ' ' | cut -d ' ' -f 1) grep Complet
D1225 16:55:43.428628983 1 ev_posix.c:101]Using polling engine: poll
E1225 16:55:48.380718591 1 chttp2_transport.c:1810]close_transport: {"created":"@1514220948.380687479","description":"FD shutdown","file":"src/core/lib/iomgr/ev_poll_posix.c","file_line":427}
outputs {
key: "classes"
value {
dtype: DT_STRING
tensor_shape {
dim {
size: 1
}
dim {
size: 5
}
}
string_val: "comic book"
string_val: "rubber eraser, rubber, pencil eraser"
string_val: "coffee mug"
string_val: "pencil sharpener"
string_val: "envelope"
}
}
outputs {
key: "scores"
value {
dtype: DT_FLOAT
tensor_shape {
dim {
size: 1
}
dim {
size: 5
}
}
float_val: 8.31655883789
float_val: 5.18350791931
float_val: 4.77944898605
float_val: 4.31814956665
float_val: 4.29243946075
}
}

If you know basics of Linux and Jupyter, it is easy to get thru the steps. I found only one WARNING for Python incompatibility between versions 3.5 ( libraries ) and 3.6 ( master/worker versions) while running "Hello Tensorflow!" code in my Jupyter notebook. But, the demo completed fine.

Thanks,

Sunil Sabat

Incorporate tf.transform

We should include tf.transform

At a minimum we should include tf.transform in our Jupyter images so people can use it in Jupyter via the DirectPipelineRunner.

TfServing uber tracking bug

Opening an uber tracking bug to keep track of various features/improvements to ensure TfServing is well supported in Kubeflow

Publish Docker images #50
Health/liveness checks #368
Monitoring #369
GPU support

/cc @rhaertel80 @kow3ns

CUJ: Make Kubeflow flow

Opening this bug to track work related to following CUJ.

Sam develops a ML model on jupyter with some small sample data.
After having tweaked the model, Sam now wants to train with a slightly larger dataset.
Sam is familiar with Python, but does not understand Kubernetes.
Sam is provided with a python library that will package his model and run it using TF Job using the data
Sam provides
Sam can then interpret the output of the experiment from Jupyter.
Sam can trigger more jobs or experiments in the future and possibly extend the client library to perform more complex tasks.
When Sam is finished with training, Sam can invoke the same library to serve the trained model either via a new deployment or update an existing deployment.
Throughout this process, Sam did not have to exit his notebook interface.
Tensorboard plugin in Jupyter is able to introspect model generated via Jupyter and TF Job

release .yaml manifest

Hi, having to install ksonnet to try kubeflow and then learn how to use it to be able to have the manifests is a bit cumbersome.

Would be nice to release the working manifests in the github release page, so that one can just kubectl apply to get going and test kubeflow quickly.

Set suitable defaults for JupyterHub service type and make it configurable?

I think the original configuration of JupyterHub was to automatically create an external load balancer.

Is that the best option?

Maybe we should add parameters to the JupyterHub component and make the service type easily configurable.

Failed to connect to my Hub at http://tf-hub-0:8081/hub/api (attempt 1/5). Is it running?

usermod: no changes
Execute the command as jovyan
[W 2018-01-03 06:36:22.606 SingleUserNotebookApp configurable:168] Config option `open_browser` not recognized by `SingleUserNotebookApp`.  Did you mean `browser`?
/opt/conda/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: compiletime version 3.5 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.6
  return f(*args, **kwds)
[I 2018-01-03 06:36:24.227 SingleUserNotebookApp handlers:46] jupyter_tensorboard extension loaded.
[I 2018-01-03 06:36:24.257 SingleUserNotebookApp extension:38] JupyterLab alpha preview extension loaded from /opt/conda/lib/python3.6/site-packages/jupyterlab
[I 2018-01-03 06:36:24.258 SingleUserNotebookApp extension:88] Running the core application with no additional extensions or settings
[I 2018-01-03 06:36:24.262 SingleUserNotebookApp singleuser:365] Starting jupyterhub-singleuser server version 0.8.1
[I 2018-01-03 06:36:44.287 SingleUserNotebookApp log:122] 302 GET /user/admin/ → /user/admin/tree? (@172.17.0.6) 0.72ms
[I 2018-01-03 06:36:44.288 SingleUserNotebookApp log:122] 302 GET /user/admin/ → /user/admin/tree? (@172.17.0.6) 0.53ms
[E 2018-01-03 06:36:44.288 SingleUserNotebookApp singleuser:354] Failed to connect to my Hub at http://tf-hub-0:8081/hub/api (attempt 1/5). Is it running?
    Traceback (most recent call last):
      File "/opt/conda/lib/python3.6/site-packages/jupyterhub/singleuser.py", line 351, in check_hub_version

Add missing features to TfJob controller ksonnet component

The TfJob controller is missing cloud specific bits that are in the helm package.

We should add them so that we can deprecate the helm package.

Tooling to manage configuration and deployment

This is an issue to discuss how we manage the configuration in this repo. Options that have come up so far:

The existing kubectl apply workflow
helm
ksonnet

Since we want a unified approach to deploying the entire suite here,
we should pick a direction that's best suitable for current and future components here.

@jlewi @kow3ns @yuvipanda

Setup PR Dashboard

We should setup the Prow PR Dashboard to make it easier to see which PRs need attention.

Secure proxy

Opening this issue to track addition of a secure proxy.

Kubeflow will consist of many applications running insecure web apps. Already the stack contains Juptyer notebooks and TensorBoard.

It would be nice if Kubeflow included a secure proxy so that a user could access Kubeflow components at URLs like

https://my-kubeflow.my-domain.io/juptyerhub
https://my-kubeflow.my-domain.io/tf-jobs/my-job/tensorboard

Some issues to consider.

Some of these apps are ephemeral. For example,
* TfJob spins up a TensorBoard instance whose lifetime is tied to the life time of the job. So we'd like to dynamically add routes.
We'd like to integrate nicely with different Clouds
* For example, on GKE we'd like to use IAP

I think we can use Ambassador to do this

Ambassador is a proxy based on Envoy
Mappings can be dynamically added just by adding annotations to a K8s service

Here's a diagram illustrating what this would like

Using Ambassador everything sits behind a single ingress point. The ingress would be configured differently on different clouds but everything else should be the same.

On GKE we can secure the ingress using IAP

Hopefully Ambassador supports (or will support) Envoy's JWT features so that we can use Ambassador to reject traffic that bypassed IAP

Some open questions

How do people securely connect to non-GKE clusters?
* Do people use VPN? Do we need an IAP equivalent?
Can we configure Ambassador to do JWT validation?
On non-GKE clusters I think we could use oauth-proxy to replace IAP
* This could authenticate the request using a variety of providers (Github, Google, Facebook) and provide a signed JWT. At which point the solution is largely the same as in the GKE case (i.e. use Envoy to validate JWT and reject traffic without a JWT)
* i think we'd want to integrate oauth-proxy into our envoy proxy
* Ambassador already has support for external auth (but not oauth-proxy)
* Not sure about contour
- How much can we get done for Kubecon May?

Alternatives

It looks like HeptIo has a similar project contour
- I think contour is designed as an ingress controller

Not sure what the benefits of Contour vs. Ambassador might be. Anyone have an opinion?

I've gotten Ambassador working so that's what I'm probably going to start with.

Add Argo package to our ksonnet registry

https://applatix.com/open-source/argo/

This looks like it could be very useful for

Decoratively describing pipelines/workflows
Building containers from notebooks

We should create an Argo package in our Ksonnet registry to make it easy to deploy Argo with Kubeflow.

We should be able to use the libsonnet files in our test infrastructure as a starting point.

Securely supply credentials and secrets to Jupyter pods

Plumbing of secrets etc tracked upstream in jupyterhub/kubespawner#110

tf-cnn prototype doesn't add GPU resource requests

If we configure a tf-cnn component to use GPUs, we don't actually end up specifying GPUs in the resource request.

e.g.

ks param set --env=gke cnn num_gpus 1
ks param set --env=gke cnn num_workers 1

ks apply gke -c cnn

Produces pod spec

spec:
  containers:
  - args:
    - python
    - tf_cnn_benchmarks.py
    - --batch_size=32
    - --model=resnet50
    - --variable_update=parameter_server
    - --flush_stdout=true
    - --num_gpus=1
    env:
    - name: TF_CONFIG
      value: '{"cluster":{"master":["cnn-master-nvsh-0:2222"],"ps":["cnn-ps-nvsh-0:2222"],"worker":["cnn-worker-nvsh-0:2222"]},"task":{"type":"worker","index":0},"environment":"cloud"}'
    image: gcr.io/kubeflow/tf-benchmarks-gpu:v20171202-bdab599-dirty-284af3
    imagePullPolicy: IfNotPresent
    name: tensorflow
    resources:
      requests:
        cpu: 100m
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-mnnrt
      readOnly: true
    workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
  dnsPolicy: ClusterFirst

So the image is correctly set to use the GPU but we are missing GPUs in the resources; e.g. it should look like

resources:
   limits:
       nvidia.com/gpu: 1

Clean up repo after switching to ksonnet

Address cleanup following merge of #36

Update README to discuss ksonnet
Remove hacks directory

if i use kubeflow on local cluster with gpu

as i didn't see any introduction about map the nvidia driver in the guide, a little confuse about that.

Should Kubeflow publish and maintain TF Serving Docker Images?

Opening this issue to consider whether Kubeflow should publish and curate TF Serving Docker images.

I think there is a reasonable expectation in the community that users shouldn't have to build Docker images just to serve models: see

A number of folks have already publicly shared Docker images (see tensorflow/serving#513), Kubeflow has also made a Docker image publicly available based on this Dockerfile

Kubeflow's ksonnet component for serving depends on having a publicly available Docker image to prevent customers from having to build their own image.

I think what's missing is a committement to maintaining and supporting these images. I think kubeflow should try to figure out what that would mean and whether we want to take that on.

Here are some questions to consider

Which versions of TensorFlow serving would we support?
What validation/verification of images we perform?
Would we provide GPU images?
Are there other artifacts we should release? e.g. the python client libraries for TF serving?

Initial proposal

We provide a Docker image for each TensorFlow Serving minor release
- Starting with the current 1.4.0 release
We only provide an image for CPU

/cc @bhack @chrisolston @rhaertel80 @lluunn@nfiedel @bhupchan@ddysher

AWS support!

In the interest of portability and a true ability to run workloads "everywhere", we should support AWS.

There is definitely work to make this happen and priority is likely Google, but AWS has ML offerings and researchers should be able to use them in a similarly slick manner to GOOG.

Get rid of master in tf_cnn example; use WORKER 0 as chief

Now that kubeflow/training-operator#192 has been fixed, TfJob can run TensorFlow code that uses worker 0 as the chief.

We should update the tf-cnn example and get rid of the master and just use Worker 0 as the chief.

@lluunn Do you want to take this?

Extend KubeSpawner and its UI to handle Persistent Volume Claims

In the JuptyerHub spawner you can specify Docker image and resource requirements. It would be nice if you could also specify volume claims and mount points so that users can easily attach shared volume to their notebooks.

According to jupyterhub/jupyterhub/issues/1580 we can do this just by adding some properties to our KubeSpawner.

Include eclipse-che in the Kubeflow stack

I think we should investigate including eclipse-che in the Kubeflow stack.

Motivation:

I think ML requires an IDE/editor. For example, a lot of TensorFlow models are multi-file packages. So a notebook environment may not be sufficient if you want to work with these packages.

In addition, if your developing and building containers locally and pushing containers to the Cloud, pushing containers from your local machine to the Cloud can be a big bottleneck in the dev-test cycle. So if we can move development and container building into the cluster that can eliminate that bottleneck.

There's a couple reasons why I think eclipse-che is a great choice

The UI is a web-app; so it should be performant when running on a remote cluster.
Eclipse-che runs on openshift so hopefully it can be ported to K8s without much difficulty
Custom runtimes are a really nice feature for allowing users to run their code in customized Docker images that contain all the libraries they need.

Support for other Deep Learning Libraries

First of all it's awesome seeing an initiative to help bring a bit of DevOps to ML, something I feel is well overdue - however I have a question just to clarify my own understanding of what KubeFlow actually is.

The Kubeflow project is dedicated to making Machine Learning on Kubernetes easy, portable and scalable. Our goal is not to recreate other services, but to provide a straightforward way for spinning up best of breed OSS solutions.

Regarding this statement in the readme it seems to imply that other OSS solutions will be supported however I'm not sure to what extent, especially regarding TensorFlow. My uncertainty rises from the fact that this is a Google run endeavour and the project name itself (implying the mash up between K8 and TF). As such would it be correct in thinking that other data flow programming libraries are out of this projects scope? (Caffe, CNTK etc)?

Delete inception.tar.gz

We have a copy of inception checked into the repo
here.

I don't recall why we checked it into the repo. I think we already host a public copy on GCS for use in our samples.

We should remove it from the repo; its just unnecessary bloat.

tf-job missing from ksonnet registry yaml

kubeflow/tf-job is not in the registry.yaml file. This prevents tf-job from showing up when someone does a ks pgk list

E2E Testing For Kubeflow.

We need to setup continuous E2E testing for google/kubeflow

Create a basic an E2E test
- Deploy and verify JupyterHub is working
- Deploy and verify TfJob is working
- Verify TfServing is working
Setup prow for google/kubeflow
- Setup presubmit tests
- Setup postsubmit tests

I think it makes sense to write the E2E test first and then once we have it we can setup prow integration.

Tensorboard support

This issue tracks Tensorboard support in kubeflow.

There is already a plugin installed in some of our images to integrate Tensorboard with Jupyter.
We require some additional vetting to ensure this works as expected through the proxy.

cc @jlewi @vishh

Split up core ksonnet package into separate packages

The core package consists of

JupyterHub
TfJob CRD
NFS Provisioner

These should probably each be their own package. The core component could then depend on these packages.

The reason we didn't do this in the initial PR #36 was because ksonnet doesn't yet support having packages depend on other packages see ksonnet/ksonnet#231

Proposal: Official Jupyter Images for Kubeflow

Opening this issue to discuss providing one or more official Kubeflow Jupyter images

Some things to consider:

What libraries/frameworks/tools should be in the image
- TensorFlow?
- GRPC client libraries?
- xgboost?
- scikits?
- ksonnet? helm?
Do we want to produce a small number of fat images or many smaller images?
Can we reuse existing images? e.g. TensorFlow docker images? Jupyter images?

Related:
#50 Docker images for TF Serving
#37 Size of our Jupyter images

/cc @flx42 @yuvipanda

Postsubmits need to use registry components at commit being tested

In postsubmit tests we need to pull registry components at the commit being tested by prow and not head.

ProwTestCase library

See discussion #71 (comment)

I think it might make sense to create a class ProwTestCase inspired by unittest.test_case to handle a bunch of the boilerplate around writing tests intended to run as a step in an Argo workflow under prow.

So a test file might look like

class TfJobTest(prow.TestCase):
   def testCpu(self):
      ....
   def testGpu(self):
      ...

if __name__ == "__main__":
    prow.main()

Some of the boilerplate that the test framework could take care of is

Creating junit files to report results to prow
Uploading results and artifacts and logs to location used by Gubernator
The entrypoint could be used to handle common "Prow" specific flags (e.g. Gubernator bucket) that writer of the test infrastructure shouldn't have to think about.

Proposal: Include very basic tracking of usage by default

Using something like Spartakus (https://github.com/kubernetes-incubator/spartakus), ping back to a central server information about the Kubeflow deployment once per day. It should be absolutely anonymous, with zero PII. Just how many components are deployed, and how many pods are running - with a unique identifier to track deployments that last for more than one day.

We should also enable opting out with a single flag, something like --report-metrics=false during ksonnet deployment.

Syntax error of juypterhub.yaml for Kubernetes 1.6

I got the error to validate the juypterhub.yaml before starting with the latest kubeflow and Kubernete 1.6

Here is the error message.

KubeFlow or Kubeflow?

I'm confused of the project name. Is it KubeFlow or Kubeflow?

Personally speaking, KubeFlow is better because Kube and Flow are individually selected from two communities, which are Kubernetes and TensorFlow (can be extended to ML community in the future).

Proposal: kubeflow-scheduler

Kubeflow Scheduler

Motivation
Use Cases
API
Design
Where to Start

Status: Draft
Version: Alpha
Implementation Owner: TBD

Authors:

@ScorpioCPH - Penghao Cen <[email protected]>
@gaocegege - Ce Gao <[email protected]>

Motivation

Kubeflow has a controller (or operator) that makes it easy to create ML jobs which are defined by CRD. When we create new ML jobs, kube-scheduler reacts by scheduling them to nodes which satisfy the requests.

It works well for most workloads but not good enough for the ML workloads.
Instead, we need a more advanced scheduler to help us schedule ML workloads more efficiently.
We can achieve this goal by custom scheduler.

Use Cases

For distributed ML workloads, communication bandwidth will be bottleneck, so we want our jobs fall on the same machine as much as possible.
ML workloads could be accelerated by hardware accelerator (e.g. GPUs), we want to ensure these workloads only schedule on nodes with specialized hardware.
GPUs have various versions with different properties (e.g. Cores, Memory), we want to ensure that our workloads have enough resources.
- Case A: our training jobs will require more than 8GB memory on GPUs as we have a large model.
- Case B: we want some serving jobs to use NVIDIA Tesla K80 for better performance when running inference.
GPUs can be connected together to have much more performance, such as NVLink, so hardware topology need to be considered.

API

Kubernetes have a proposal about Resource Class, which provides a better resource representation. We can use this feature for resource management.

In TFJob spec, we can tell kubeflow-scheduler which resource we request by adding Resource Class in the spec as the same as CPU/Memory request.

Example, we want a TF worker to use 2 NVIDIA Tesla K80 while training.

spec:
  containers:
    - image: tf_training_worker_gpu_1
      name: tensorflow_worker_gpu_1
      resources:
        requests:
          nvidia.com/gpu.tesla.k80: 2
        limits:
          nvidia.com/gpu.tesla.k80: 2

Design

Resource Class

The details about Resource Class is still TBD as it is under discussion now.
But we can use CRD and label selector to implement a simple example of this (only support homogenous architecture now) which proposed here.

Scheduler

Kubeflow-scheduler should work well with default scheduler (kube-scheduler). It will listen on Pods which created by kubeflow-controller, evaluate whether a Node satisfies the requirements of this Pod by predicate functions.

These predicate functions may include:

Pods are CPU-sensitive (e.g. TensorFlow PS) or GPU-sensitive (e.g. TensorFlow Worker)?
Pods are in a group of the same job? e.g. workers of the same TensorFlow training job.
Pods are network-sensitive which request InfiniBand for high throughput and low latency.
Pods are storage-sensitive which request SSD hardware for high I/O speed.

Where to Start

@mitake has implemented a demo here.
And the resource-scheduler is another reference.
There is another project kube-arbitrator may be helpful for this proposal.

python model server

@aronchick / @jlewi - I enjoyed the BoF: Machine Learning on Kubernetes talk this evening, thanks for organizing it!

It would be great if kubeflow could include a Python ML model container with support for:

conda environment.yml requirements management
- I use conda to build production ML containers that support multiple model types including
  - scikit-learn
  - theano
  - xgboost gradient boosted trees
gRPC and REST Prediction API's with Swagger documentation
horizontal pod autoscaling
prometheus application and system metrics

K8s support for KMS

I think we want to make it easy for people to plug in their secrets into their Kubeflow deployment. For example, to support SSL with JupyterHub users need to be able to plug in their certificates.

I think this is blocked on K8s supporting KMS:

kubernetes/kubernetes#51965

Once K8s supports KMs I think we can just expose relevant parameters in our ksonnet prototypes to fetch the key from the appropriate KMS.

Looks like their might be some early support for
- Google KMS
- Valut

(Opening this issue with the new label K8s dependencies to make it easy to track K8s features of high interest).

Doc gen for kubeflow ksonnet registry

There should be a way to autogenerate README files for each ksonnent component in our registry.

Tutorial(s) that correspond to CUJs

We need a tutorial that maps a user through one or more CUJs as way of illustrating the value of Kubeflow.

This might be something like

Deploy Kubeflow
Open up an existing TensorFlow notebook that provides a tutorial
Train locally
Train distributed (with GPUs) using TfJob
Deploy the trained model
Send some predictions

kubeflow/training-operator#195 has some links to existing tutorials about TensorFlow and K8s.

Size of gcr.io/kubeflow/tensorflow-notebook-*

From the README:

We also ship standard docker images that you can use for training Tensorflow models with Jupyter.
gcr.io/kubeflow/tensorflow-notebook-cpu
gcr.io/kubeflow/tensorflow-notebook-gpu
[...] Note that GPU-based image is several gigabytes in size and may take a few minutes to localize.

("localize"?)

They are both large:

$ docker images gcr.io/kubeflow/tensorflow-notebook-gpu:latest
REPOSITORY                                TAG                 IMAGE ID            CREATED             SIZE
gcr.io/kubeflow/tensorflow-notebook-gpu   latest              e68d36c67064        2 weeks ago         7.11GB

$ docker images gcr.io/kubeflow/tensorflow-notebook-cpu:latest
REPOSITORY                                TAG                 IMAGE ID            CREATED             SIZE
gcr.io/kubeflow/tensorflow-notebook-cpu   latest              9cb2a6008740        2 weeks ago         5.17GB

Are the Dockerfiles public for these images? I can probably do a quick PR to improve the size.

You might be interested to look at the improvements I did in the devel-gpu Dockerfile for TensorFlow:
tensorflow/tensorflow#15355

Also, it would be helpful if you could chime in on this RFE:
tensorflow/tensorflow#15284
Maybe we can have a single image with Jupyter+TensorFlow+TensorBoard? That would shrink the other TensorFlow images that are shipped today (e.g. gpu and devel-gpu).

Presubmits need to pull registry from the PR branch

In presubmit tests we want to pull registry components from the PR/Commit actually being tested.

It looks like we can do this just by using the github URL of the fork from which the PR is submited
e.g.

ks registry add kubeflow-jlewi github.com/jlewi/kubeflow/tree/azure_config/kubeflow
ks pkg install kubeflow-jlewi/core@d95c554

Add LICENSE file: Apache License, Version 2.0

https://github.com/google/addlicense

Support ml_workbench

Is ml_workbench something we should include in our Docker images for Jupyter?

Document specs for a "reasonable" starter cluster

Clearly, this is highly dependent on what we have available for folks to run.

I repurposed a tiny GKE cluster (n1-standard-1 with autoscaling from 1-5 nodes). I was able to apply the manifests in components successfully, but wasn't able to schedule my model-server pods due to resource limitations. It looks like the bare minimum for the model server pods is a full CPU and a gig of ram- and they would be happy to get more.

I think we should document a recommended minimum cluster config to get through these demos.

kubeflow / kubeflow Goto Github PK

kubeflow's Introduction

Documentation

Working Groups

Quick Links

Get Involved

kubeflow's People

Contributors

Stargazers

Watchers

Forkers

kubeflow's Issues

Motivation

Use Cases

Goals

Workflow

Components

CRD

Controller

Server

Monitor

Google Doc Version

Kubeflow Scheduler

Motivation

Use Cases

API

Design

Resource Class

Scheduler

Where to Start

Recommend Projects

Recommend Topics

Recommend Org