Code Monkey home page Code Monkey logo

kubebench's Introduction

OpenSSF Best Practices OpenSSF Scorecard CLOMonitor

Kubeflow the cloud-native platform for machine learning operations - pipelines, training and deployment.


Documentation

Please refer to the official docs at kubeflow.org.

Working Groups

The Kubeflow community is organized into working groups (WGs) with associated repositories, that focus on specific pieces of the ML platform.

Quick Links

Get Involved

Please refer to the Community page.

kubebench's People

Contributors

akado2009 avatar andreyvelich avatar ddutta avatar gaocegege avatar jeffwan avatar jlewi avatar libbyandhelen avatar owoshch avatar pingsutw avatar ramdootp avatar swiftdiaries avatar xyhuang avatar yanniszark avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kubebench's Issues

Simplify installation of needed resources

We can provide a jsonnet package to install all related resources in order to get user started (while for advanced user they can still use their own without running this installation). The package can include followings:

  • secrets (if specified): github token, gcp creds
  • nfs service: include an nfs server and corresponding service
  • volume: nfs pv & pvc

Cannot install kubench

Hello.
I'm following the installation guide.
ks registry add kubebench github.com/kubeflow/kubebench/tree/master/kubebench

But I found that this command is not working...

Here is my log

ERROR could not find ksonnet app

I already checked ksonnet and kubeflow are working properly

"step-run" fails during long benchmark run

This is caused by using "successCondition" in the Argo step to track the status of the created kubeflow resources (tfjob), which causes the step to timeout in a few minutes if the "successCondition" is not met. However the kubeflow resources running benchmarks might take longer time than the step can wait.
We need a more proper way to track the status of the created kubeflow resources.

kubeflow job monitor/waiter

we need a monitor function in the kubebench controller image that polls status of deployed kubeflow jobs until the desired status (success/fail/etc.) is met. the monitor will be run in the "wait for job finish" step in #50. It will take a job manifest (generated in the previous step), a success/failure condition, and an optional timeout as input, and return error if failure condition or timeout is met.
we might take argo's resource execution functionality as a reference (which contains polling of k8s resource status but with a fixed timeout, which causes #17).

Add basic benchmark workflow

The benchmark workflow is implemented as Argo workflow, and should consist of config retrieving, benchmark running, and reporting steps.

remove --log-dir from runner

The example tf-cnn-benchmark runner has a required --log-dir argument, this is not user friendly as it requires user to change their custom runner codes. We can remove this argument, and maybe provide a environment variable pointing to the auto mounted volume instead.

Configurator should generate manifests directly from kubeflow job prototypes

Right now we use a separate set of job prototypes nested in the configurator itself, this can be buggy and can also cause confusions to user as those prototypes might have different behavior as the ones of same names provided by kubeflow.
Instead, we should use the prototypes from kubeflow directly, with necessary modifications mixed in.

need an actual smoke test

Right now we have a placeholder for job test in the e2e workflow. We need an actual smoke test, with a dummy benchmark job. The test need to cover followings:

  • create config file in a pvc mount
  • create kubebench job and run
  • check results from pvc mount
  • clean up

re-consider how we use pvc

Currently it is mandatory for user to provide a pvc when creating a kubebench job, which is mainly used in 3 ways:

  1. store pre-defined benchmark configs from user
  2. store intermediate results during benchmarks
  3. store outputs and results

where 1) and 3) are supposed to be replaceable by cloud location in the future, and 2) is not necessary to be controlled by user. It seems that we should provide optional pvc parameters rather than mandatory.

We can consider the following changes:

  • have an optional pvc parameter for user to provide custom benchmark configs (replaceable by cloud locations)
  • have an implicit pvc (created automatically) for intermediate results, user can store desired runner data in this pvc
  • have an optional pvc parameter for user to store their results (replaceable by cloud locations)

Improve kubebench workflow

Here is the workflow I would suggest:

  • step1: run configurator
    configurator is in kubebench controller image that takes a tfjob config file and generates a tfjob manifest.
  • step2: run job
    the job uses a user provided image to run distributed benchmark codes, the job parameters are specified in step1 output.
  • step3: wait for job finish
    we need to hold the workflow until job finishes, we can include a function in controller image to keep pulling status of kubeflow jobs until it finishes.
  • step4: run output processor
    the output processor is a user provided image that collects outputs from distributed pods above, then aggregate them / generate result records, and put them in a designated location.
  • step5: run reporter
    the reporter should be added to kubebench controller image that takes results from above location and sends the results to user specified destination (e.g. update a local .csv or remote db, etc.)

CSV reporter should access the input/output files through relative path

The CSV reporter currently assumes an absolute path. User should not need to have knowledge of the mount point of experiment volumes. We should let the CSV reporter to:

  • retrieve the input file through a path relative to $KUBEBENCH_EXP_RESULT_DIR under experiments volume
  • write the output file through a path relative to $KUBEBENCH_EXP_ROOT under experiments volume

namespace of kubeflow job should be same as that of kubebench workflow by default

since the tf-job is configured through a separate config file, the namespace of the job can be different from where we launch the kubebench workflow, this can be both confusing to user and cause fatals e.g. persistent volume cannot be shared between runner and reporter in different namespaces.
We should fix this in the refactored workflow.
One way to do it is to let the configurator modify the generated manifest with the same namespace as that of workflow.

Split reporter into result parsing and reporting

Currently the reporter has 2 functionalities: parse the results from benchmark generated logs, and report the results to a storage location. The former is bound to specific benchmark runners, while the latter should be common across all benchmark runners.
We need to split the reporter into 2 different steps, and turn the reporting part into a common module. We also need a common protocol for feeding data into the reporting part.

What is the initial milestone for kubbench?

What is the initial milestone for kube bench?

Are we planning to use Kube bench to regularly run a set of benchmarks and publish them? For example, would we run TFCNN regularly and publish those metrics?

Are we planning on publishing benchmarks for some of the other operators like horovod?

Improve kubebench API

We need to improve API for kubebench job. Current idea is to have 2 tiers of parameters, where the first tier specifies the kubebench job workflow (e.g. location of second tier configs, type and location of outputs and reports, etc.), the second tier specifies the tfjob (and the like) that includes job specific parameters. The reason we need 2 tiers of config is because we want to decouple the kubebench's configuration/reporting workflow from the actual tf jobs' parameters (which we are interested in for actual benchmarks). We will expect user to have a less frequently changed 1st tier config to specify the config/result locations, while store a bunch of 2nd tier configs in .yaml files and have the 1st tier config point to them so they can easily run multiple benchmark jobs with single-line of parameter changes.

Example 1st tier config:
(For now, this will be key-value parameters fed directly into kubebench-job's ksonnet prototype to generate the workflow, Q: shall we use a .yaml instead?)

name: my-kubebench-job
namespace: default

configuratorImage: kubeflow/kubebench-helper:0.0.1  //image info is just for example
configuratorCmd: kubebench-configurator
configuratorArgs: --source=local,--runner-config=config/tf-cnn-scenario-1.yaml
configuratorSecrets: github-token,gcloud-cred
configuratorVolumes: kubebench-pvc

outputProcessorImage: some-repo/tf-cnn-output-processor:1.0
outputProcessorCmd: python main.py
outputProcessorArgs: null
outputProcessorSecrets: null
outputProcessorVolumes: kubebench-pvc

reporterImage: kubeflow/kubebench-helper:0.0.1
reporterCmd: kubebench-reporter
reporterArgs: --dest=local,--type=csv,--report-file=report/report.csv
reporterSecrets: null
reporterVolumes: kubebench-pvc

Example 2nd tier config:
(this will be persisted in a yaml file and picked up by the kubebench configurator to generate a runner job)

metadata:
  name: my-test-scenario
spec:
  prototype:
    name: tf-job
    package: tf-job
    registry: github.com/kubeflow/kubeflow/tree/master/kubeflow
  parameters:
    name: my-tf-job
    namespace: default
    args: null
    image: null
    image_gpu: null
    image_pull_secrets: null
    num_masters: 1
    num_ps number: 1
    num_workers: 1
    num_gpus: 1


refactor reporter

The refactored reporter should go into the kubebench controller image. It should fit in the "run reporter" step in #50. It should take a key-value styled input from previous step and send the record to an external file or db. The initial supported data destination are csv and big query.

tf-cnn workload refactoring

the workload should consist of 2 parts: runner and output processor, each corresponds to a step as in #50. The runner will run the main tf-cnn job, and save useful results/logs in a shared output directory, the output processor will read from the output directory and compile a json file. We can assume that the shared directory are mounted by the workflow and can be accessible via $KUBEBENCH_OUTPUT_DIR variable in the container.

kubeflow-core component creation issue on the latest

level=error msg="no prototype names matched 'kubeflow-core'"
level=error msg="could not find component: "kubeflow-core" is not a component or a module"
level=error msg="could not find component: "kubeflow-core" is not a component or a module"

Controller image build failed

Shows following error:

WARNING: Ignoring APKINDEX.70c88391.tar.gz: No such file or directory
WARNING: Ignoring APKINDEX.5022a8a2.tar.gz: No such file or directory
ERROR: unsatisfiable constraints:
  ca-certificates (missing):
    required by: world[ca-certificates]
  wget (missing):
    required by: world[wget]
The command '/bin/sh -c apk add ca-certificates wget && update-ca-certificates' returned a non-zero code: 2

we need to give an "apk --update" when installing package.

Run a list of benchmarks automatically

We need to be able to pick up a list of benchmarking configs and run them 1-by-1 automatically. We can start with providing a helper script that creates kubebench-job with different configs repeatedly.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.