kubeflow / kubebench Goto Github PK

View Code? Open in Web Editor NEW

77.0 77.0 35.0 23.86 MB

Repository for benchmarking

License: Apache License 2.0

Python 0.60% Shell 0.63% Dockerfile 0.23% Go 5.20% Makefile 0.01% HTML 0.12% JavaScript 1.60% Jsonnet 91.61%

kubebench's Introduction

Kubeflow the cloud-native platform for machine learning operations - pipelines, training and deployment.

Documentation

Please refer to the official docs at kubeflow.org.

Working Groups

The Kubeflow community is organized into working groups (WGs) with associated repositories, that focus on specific pieces of the ML platform.

Quick Links

PR Dashboard

Get Involved

Please refer to the Community page.

kubebench's People

Contributors

Stargazers

Watchers

kubebench's Issues

Simplify installation of needed resources

We can provide a jsonnet package to install all related resources in order to get user started (while for advanced user they can still use their own without running this installation). The package can include followings:

secrets (if specified): github token, gcp creds
nfs service: include an nfs server and corresponding service
volume: nfs pv & pvc

Cannot install kubench

Hello.
I'm following the installation guide.
ks registry add kubebench github.com/kubeflow/kubebench/tree/master/kubebench

But I found that this command is not working...

Here is my log

ERROR could not find ksonnet app

I already checked ksonnet and kubeflow are working properly

Configurator should generate experiment id and create proper directory structure automatically

The cofigurator should create an experiment id for each job deployment (or use what user provides in there job config), and create the following structure in the supplied experiments pvc:

/(experiment-id)
  /config
  /output
  /result

Run kubebench with Freeflow enabled

It'd be interesting to see the difference in performance that Freeflow provides for distributed training.
https://github.com/joyq-github/TensorFlowonK8s/blob/master/FreeFlow.md
This describes the steps for running freeflow.

Support for benchmarks based on openmpi job?

The openmpi job provides a way to run distributed training https://github.com/kubeflow/kubeflow/tree/master/kubeflow/openmpi. Shall we investigate how to port that to kubebench?

We may use #18 as a reference and then extend to this.

"step-run" fails during long benchmark run

This is caused by using "successCondition" in the Argo step to track the status of the created kubeflow resources (tfjob), which causes the step to timeout in a few minutes if the "successCondition" is not met. However the kubeflow resources running benchmarks might take longer time than the step can wait.
We need a more proper way to track the status of the created kubeflow resources.

kubeflow job monitor/waiter

we need a monitor function in the kubebench controller image that polls status of deployed kubeflow jobs until the desired status (success/fail/etc.) is met. the monitor will be run in the "wait for job finish" step in #50. It will take a job manifest (generated in the previous step), a success/failure condition, and an optional timeout as input, and return error if failure condition or timeout is met.
we might take argo's resource execution functionality as a reference (which contains polling of k8s resource status but with a fixed timeout, which causes #17).

Add basic benchmark workflow

The benchmark workflow is implemented as Argo workflow, and should consist of config retrieving, benchmark running, and reporting steps.

remove --log-dir from runner

The example tf-cnn-benchmark runner has a required --log-dir argument, this is not user friendly as it requires user to change their custom runner codes. We can remove this argument, and maybe provide a environment variable pointing to the auto mounted volume instead.

Incorporate workloads from mlperf.org

https://github.com/mlperf/reference ... pick any reference and ensure Kubebench workflow can incorporate this

set up initial e2e test

Add a barebone workflow for e2e test.

Configurator should generate manifests directly from kubeflow job prototypes

Right now we use a separate set of job prototypes nested in the configurator itself, this can be buggy and can also cause confusions to user as those prototypes might have different behavior as the ones of same names provided by kubeflow.
Instead, we should use the prototypes from kubeflow directly, with necessary modifications mixed in.

example benchmarks using fairseq and tensor2tensor?

We could add this as an example.
https://www.facebook.com/photo.php?fbid=10102604644134681&set=a.10102284137612321.1073741837.61013326&type=3&theater

need an actual smoke test

Right now we have a placeholder for job test in the e2e workflow. We need an actual smoke test, with a dummy benchmark job. The test need to cover followings:

create config file in a pvc mount
create kubebench job and run
check results from pvc mount
clean up

Missing dependency under /vendor/github.com

This is probably caused by local .gitignore ignored the files by mistake.

Add an example to show how the benchmark workflow works

We need to have an example/reference implementation to demonstrate how to work with the benchmark workflow.

Add test codes for kubebench ksonnet package contents

Support for distributed tf-cnn benchmark

We need to provide support for distributed benchmarks, a starting point could be using tf-cnn-benchmark https://github.com/tensorflow/benchmarks https://github.com/kubeflow/kubeflow/tree/master/tf-controller-examples/tf-cnn

We need to provide images for the whole workflow, which should consists of configurator, runner, and reporter, where the runner should adapt the tf-cnn benchmark to a tfjob.

improve ksonnet prototype for kubebench-job

so as to adapt to the improved workflow as in #50.

Configurator should retrieve config through a path relative to config's mounting point

Currently configurator assumes an absolute path for input config files, this is both inconvenient and buggy. User should not need to have the knowledge of where we mounted the config volume to specify their config files. We should let configurator use a config file path that is relative to the root of user's config volume.

Configurator should automatically modify kf job manifest with volume mounts and env vars

the configurator need to provide volume information and environment variables to the kubeflow jobs, this can be done through modifying the generated manifest.

Create travis or tests for common code patterns and practices.

Add tests for golint, govet, gofmt etc..
Basically, address things thrown by goreport: https://goreportcard.com/report/github.com/kubeflow/kubebench.

Add python tests to test workflow

re-consider how we use pvc

Currently it is mandatory for user to provide a pvc when creating a kubebench job, which is mainly used in 3 ways:

store pre-defined benchmark configs from user
store intermediate results during benchmarks
store outputs and results

where 1) and 3) are supposed to be replaceable by cloud location in the future, and 2) is not necessary to be controlled by user. It seems that we should provide optional pvc parameters rather than mandatory.

We can consider the following changes:

have an optional pvc parameter for user to provide custom benchmark configs (replaceable by cloud locations)
have an implicit pvc (created automatically) for intermediate results, user can store desired runner data in this pvc
have an optional pvc parameter for user to store their results (replaceable by cloud locations)

Improve kubebench workflow

Here is the workflow I would suggest:

step1: run configurator
configurator is in kubebench controller image that takes a tfjob config file and generates a tfjob manifest.
step2: run job
the job uses a user provided image to run distributed benchmark codes, the job parameters are specified in step1 output.
step3: wait for job finish
we need to hold the workflow until job finishes, we can include a function in controller image to keep pulling status of kubeflow jobs until it finishes.
step4: run output processor
the output processor is a user provided image that collects outputs from distributed pods above, then aggregate them / generate result records, and put them in a designated location.
step5: run reporter
the reporter should be added to kubebench controller image that takes results from above location and sends the results to user specified destination (e.g. update a local .csv or remote db, etc.)

Choice between PVC and Artifact Repository

Support pushing results to cloud locations

We should allow user to specify a cloud location (e.g. big query) where they would like to place benchmark results, and have a reporter to push results to there.

[Discussion] Support for benchmarking serving

What's the plan for serving benchmark?

We need to

configure and generate the traffic pattern.
collect metrics

any thought?

CSV reporter should access the input/output files through relative path

The CSV reporter currently assumes an absolute path. User should not need to have knowledge of the mount point of experiment volumes. We should let the CSV reporter to:

retrieve the input file through a path relative to $KUBEBENCH_EXP_RESULT_DIR under experiments volume
write the output file through a path relative to $KUBEBENCH_EXP_ROOT under experiments volume

namespace of kubeflow job should be same as that of kubebench workflow by default

since the tf-job is configured through a separate config file, the namespace of the job can be different from where we launch the kubebench workflow, this can be both confusing to user and cause fatals e.g. persistent volume cannot be shared between runner and reporter in different namespaces.
We should fix this in the refactored workflow.
One way to do it is to let the configurator modify the generated manifest with the same namespace as that of workflow.

Split reporter into result parsing and reporting

Currently the reporter has 2 functionalities: parse the results from benchmark generated logs, and report the results to a storage location. The former is bound to specific benchmark runners, while the latter should be common across all benchmark runners.
We need to split the reporter into 2 different steps, and turn the reporting part into a common module. We also need a common protocol for feeding data into the reporting part.

configurator fails to generate manifest when num_gpu > 0

the issue was reported in #54. this is probably caused by incorrect order of args in the configurator's jsonnet file https://github.com/kubeflow/kubebench/blob/master/components/common/configurator/tf-job.jsonnet#L30

we won't have this issue in the new version of configurator but it can be a quick fix to unblock people trying out the old version.

Add initial files

readme, license, gitignore

What is the initial milestone for kubbench?

What is the initial milestone for kube bench?

Are we planning to use Kube bench to regularly run a set of benchmarks and publish them? For example, would we run TFCNN regularly and publish those metrics?

Are we planning on publishing benchmarks for some of the other operators like horovod?

Improve kubebench API

We need to improve API for kubebench job. Current idea is to have 2 tiers of parameters, where the first tier specifies the kubebench job workflow (e.g. location of second tier configs, type and location of outputs and reports, etc.), the second tier specifies the tfjob (and the like) that includes job specific parameters. The reason we need 2 tiers of config is because we want to decouple the kubebench's configuration/reporting workflow from the actual tf jobs' parameters (which we are interested in for actual benchmarks). We will expect user to have a less frequently changed 1st tier config to specify the config/result locations, while store a bunch of 2nd tier configs in .yaml files and have the 1st tier config point to them so they can easily run multiple benchmark jobs with single-line of parameter changes.

Example 1st tier config:
(For now, this will be key-value parameters fed directly into kubebench-job's ksonnet prototype to generate the workflow, Q: shall we use a .yaml instead?)

name: my-kubebench-job
namespace: default

configuratorImage: kubeflow/kubebench-helper:0.0.1  //image info is just for example
configuratorCmd: kubebench-configurator
configuratorArgs: --source=local,--runner-config=config/tf-cnn-scenario-1.yaml
configuratorSecrets: github-token,gcloud-cred
configuratorVolumes: kubebench-pvc

outputProcessorImage: some-repo/tf-cnn-output-processor:1.0
outputProcessorCmd: python main.py
outputProcessorArgs: null
outputProcessorSecrets: null
outputProcessorVolumes: kubebench-pvc

reporterImage: kubeflow/kubebench-helper:0.0.1
reporterCmd: kubebench-reporter
reporterArgs: --dest=local,--type=csv,--report-file=report/report.csv
reporterSecrets: null
reporterVolumes: kubebench-pvc

Example 2nd tier config:
(this will be persisted in a yaml file and picked up by the kubebench configurator to generate a runner job)

metadata:
  name: my-test-scenario
spec:
  prototype:
    name: tf-job
    package: tf-job
    registry: github.com/kubeflow/kubeflow/tree/master/kubeflow
  parameters:
    name: my-tf-job
    namespace: default
    args: null
    image: null
    image_gpu: null
    image_pull_secrets: null
    num_masters: 1
    num_ps number: 1
    num_workers: 1
    num_gpus: 1

refactor reporter

The refactored reporter should go into the kubebench controller image. It should fit in the "run reporter" step in #50. It should take a key-value styled input from previous step and send the record to an external file or db. The initial supported data destination are csv and big query.

tf-cnn workload refactoring

the workload should consist of 2 parts: runner and output processor, each corresponds to a step as in #50. The runner will run the main tf-cnn job, and save useful results/logs in a shared output directory, the output processor will read from the output directory and compile a json file. We can assume that the shared directory are mounted by the workflow and can be accessible via $KUBEBENCH_OUTPUT_DIR variable in the container.

Support getting benchmark configs from cloud locations

We should allow user to store their config files/db in a cloud based location, and pick up the configs from there in a configurator.

New Dockerfile and image for refactored controller codes

Create dockerfile and push image for refactored controller functionalities (configurator + reporter for now). It would be more convenient to include all in a single image.

Add jsonnet tests to test workflow

Create a thorough development guide.

Make it easy for people to start contributing / develop.

Add tests for image contents

Documentation of kubebench design

Add e2e tests for kubebench deployment

we need to set up the testing workflow for e2e deployment

create cluster
test deploy kubeflow
test kubebench job
teardown cluster

Need getting-started user guide

kubeflow-core component creation issue on the latest

level=error msg="no prototype names matched 'kubeflow-core'"
level=error msg="could not find component: "kubeflow-core" is not a component or a module"
level=error msg="could not find component: "kubeflow-core" is not a component or a module"

Controller image build failed

Shows following error:

WARNING: Ignoring APKINDEX.70c88391.tar.gz: No such file or directory
WARNING: Ignoring APKINDEX.5022a8a2.tar.gz: No such file or directory
ERROR: unsatisfiable constraints:
  ca-certificates (missing):
    required by: world[ca-certificates]
  wget (missing):
    required by: world[wget]
The command '/bin/sh -c apk add ca-certificates wget && update-ca-certificates' returned a non-zero code: 2

we need to give an "apk --update" when installing package.