Code Monkey home page Code Monkey logo

volcano's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

volcano's Issues

Job GC

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

Description:

Make sure "completed" & "terminated" jobs will be removed later.

Setup travis as CI env

For now, found the follow two issues here:

  • no hack/verify-gofmt.sh for make verify
  • both e2e-test and e2e-test-kind miss scripts

Reclaim CI failed

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

Description:

Queue E2E teat failed as follow, it seems there're not enough resource for recliam e2e test.

• Failure [19.615 seconds]
Queue E2E Test
/home/travis/gopath/src/volcano.sh/volcano/test/e2e/queue.go:26
  Reclaim [It]
  /home/travis/gopath/src/volcano.sh/volcano/test/e2e/queue.go:27
  Expected error:
      <*errors.errorString | 0xc00028f4f0>: {
          s: "expected replica <1> is too small",
      }
      expected replica <1> is too small
  not to have occurred
  /home/travis/gopath/src/volcano.sh/volcano/test/e2e/queue.go:57

refer to https://travis-ci.com/volcano-sh/volcano/jobs/188302297 for more detail :)

Resolve the golint issues

Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug

What happened:
Resolve all the golint issues ignored in the file of .golint_failures

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Volcano Version:
  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

Add e2e test for admission service

Is this a BUG REPORT or FEATURE REQUEST?:

/kind test

What happened:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • master

How do we support Job.Spec update

I noticed we want to support Job.Spec update in Controller.updateJob

But the generated request is

	req := apis.Request{
		Namespace: newJob.Namespace,
		JobName:   newJob.Name,

		Event: vkbatchv1.OutOfSyncEvent,
	}

But in syncJob
if no pods provided in request, it will create new pods for the Job, and so it will fail, and the following status is unknown.

btw, I am not very familiar with the entire state machine , and maybe i miss something.

Fix state machine issue

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

New state of enqueue has been introduced, but it's unfinished, need keep working on this and fix related testcase issues.

NOTES: There are some testcases are expected to have job status: pending->running/xxxxx, which are incorrect within new status of enqueue, please update them all asl well.

Add error handling for exit code

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

Description:

Got this requirements from a user, it's better to support error handling for exit code.

Resource Reservation to avoid starvation

Description:

When batch jobs have to compete with each others or elastic jobs for resources, the resources that become available are likely to be taken immediately by elastic job. Batch jobs need multiple resources to be available before they can be dispatched. If the cluster is always busy, a large batch job could be pending indefinitely. The more processors a parallel job requires, the worse the problem is. Resource reservation solves this problem by reserving resources as they become available, until there are enough reserved resources to run the batch job.

Add event on actions

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

Description:

Currently, we only record event for Commands; it's better to also record an event for each actions of jobs.

Queue controller and related cli

Currently, user can only create a Queue for scheduling; but it's hard to know more info about it, e.g. how many job in the queue, which plugins is used by this queue; and if the Queue is deleted, the job is still there :( It's better to have QueueController to mamange Queue's lifecycle and update its status; and have related command line for uset to get its info.

Support TaskSpec level error handling

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

Description:

Currently, we only support Job level and Task instance level error handling; TaskSpec level error handling is also necessary, e.g. the MPI job should be completed when mpirun Pod completed successfully.

make mutating and validating admission controllers consistent

func mutateSpec(tasks []v1alpha1.TaskSpec, basePath string) (patch []patchOperation) {
	for index := range tasks {
		// add default task name
		taskName := tasks[index].Name
		if len(taskName) == 0 {
			tasks[index].Name = v1alpha1.DefaultTaskSpec
		}
	}
	patch = append(patch, patchOperation{
		Op:    "replace",
		Path:  basePath,
		Value: tasks,
	})

	return patch
}

If user not specify the task names of a job, default will be used in mutating stage, but the validating admission controller will reject the Job creation because of duplicate task names.

Set default value of PodGroup in admission controller

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

Description:

Currently, the default value of PodGroup is set by operator/customized-controller which is inconvenience for developer. It's better to set those default value to PodGroup for all users/developers.

Unable get csr when building test cluster

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

/kind feature

What happened:

Error log:

certificatesigningrequest.certificates.k8s.io/integration-admission-service.kube-system created
NAME                                        AGE   REQUESTOR          CONDITION
integration-admission-service.kube-system   0s    kubernetes-admin   Pending
certificatesigningrequest.certificates.k8s.io/integration-admission-service.kube-system approved
ERROR: After approving csr integration-admission-service.kube-system, the signed certificate did not appear on the resource. Giving up after 10 attempts.
Error: plugin "gen-admission-secret" exited with error
Install volcano chart
NAME:   integration
LAST DEPLOYED: Mon Apr  1 03:20:04 2019
NAMESPACE: kube-system
STATUS: DEPLOYED

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Volcano Version:
  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

Add example on MPI Job

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

Description:

Add an example on how to run MPI job :)

Update Imports

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:
I see imports used in files with format "volcano.sh/volcano/XXX/XXX/XXX"

What you expected to happen:
It should of the format "github.com/volcano-sh/volcano/XXX/XXX/XXX"

Screenshot from 2019-03-20 14-40-08

Pass conformance test

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

Description:

Cherry pick related PR in kube-batch to volcano-sh/kube-batch for conformance test.

/cc @asifdxtreme

Deleting helm chart exits with error

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:

While deleting helm chart we get an error and all crds still exist.

# helm delete sid
Error: deletion completed with 1 error(s): mutatingwebhookconfigurations.admissionregistration.k8s.io "sid-mutate-job" already exists

because of which for deploying it next time we need to delete all crd's and then deploy again

What you expected to happen:
Delete helm chart should exit properly

Allow multi sync job works run in parallel

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

What happened:

Currently, there is only one goroutine worker syncing jobs. For large scale jobs, this will be a bottle neck.

Add PodGroupController to creat shadow PodGroup

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

Description:

Currently, kube-batch create shadow PodGroup by pod's OwnerReference for upstream objects, e.g. Deployment. It make Queue related feature harder, e.g. Queue's status, it's better to have such a controller to create PodGroup for upstream objects.

Enable robot for Volcano

Currently, we still merge code manually; it's better to have robot for it. We can leverage robot from other community, e.g. Kubernetes.

11 tests are failed in CI

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

Description:
There're 11 tests in CI were failed, we need to get it fixed ASAP before release.



[Fail] Job Error Handling [It] job level LifecyclePolicy, Event: PodFailed; Action: TerminateJob 

/home/travis/gopath/src/volcano.sh/volcano/test/e2e/job_error_handling.go:102


[Fail] Job Error Handling [It] job level LifecyclePolicy, Event: PodFailed; Action: AbortJob 

/home/travis/gopath/src/volcano.sh/volcano/test/e2e/job_error_handling.go:139


[Fail] Job Error Handling [It] job level LifecyclePolicy, Event: PodEvicted; Action: RestartJob 

/home/travis/gopath/src/volcano.sh/volcano/test/e2e/job_error_handling.go:174


[Fail] Job Error Handling [It] job level LifecyclePolicy, Event: PodEvicted; Action: TerminateJob 

/home/travis/gopath/src/volcano.sh/volcano/test/e2e/job_error_handling.go:218


[Fail] Job Error Handling [It] job level LifecyclePolicy, Event: PodEvicted; Action: AbortJob 

/home/travis/gopath/src/volcano.sh/volcano/test/e2e/job_error_handling.go:262


[Fail] Job Error Handling [It] job level LifecyclePolicy, Event: Any; Action: RestartJob 

/home/travis/gopath/src/volcano.sh/volcano/test/e2e/job_error_handling.go:306


[Fail] Job Error Handling [It] job level LifecyclePolicy, Event: TaskCompleted; Action: CompletedJob 

/home/travis/gopath/src/volcano.sh/volcano/test/e2e/job_error_handling.go:468


[Fail] Job Error Handling [It] job level LifecyclePolicy, error code: 3; Action: RestartJob 

/home/travis/gopath/src/volcano.sh/volcano/test/e2e/job_error_handling.go:507


[Fail] Job E2E Test [It] Gang scheduling 

/home/travis/gopath/src/volcano.sh/volcano/test/e2e/job_scheduling.go:109


[Fail] MPI E2E Test [It] will run and complete finally 

/home/travis/gopath/src/volcano.sh/volcano/test/e2e/mpi.go:74


[Fail] Job E2E Test: Test Job Command [It] Suspend pending job 

/home/travis/gopath/src/volcano.sh/volcano/test/e2e/command.go:142

xref https://travis-ci.com/volcano-sh/volcano/jobs/197649052

Support Job plugins

Both MPI and Tensorflow need hostfile for its workers; and MPI job need more, e.g. ssh authentication. It's better to provide related plugins for different works.

The yaml file maybe similar as follow:

spec:
  - plugins
      ssh: ["seed"]
      env: [""]

For example, if ssh is enabled, job controller should create related rsa public/private keys and mount them for ssh.

Refactor Delay Pod Creation by admission controller

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

Description:

Currently, we delay pod creattion in job controller which make it hard for two scenarios:

  1. vk.job can not work with other scheduler
  2. enqueue can not support other operators

To resolve the above issues, perfer to add an admission controller to check PodGroup's status for them. If they did not use PodGroup, PodGroupController will help them to create a shadow one.

Makefile cleanup

Is this a BUG REPORT or FEATURE REQUEST?:

/kind cleanup

Description:

  • release is almost equal to all
  • docker target should be images
  • build info is necessary
  • can not build release from MacOS or other platform

Support ScheduledJob/CronJob

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

Description:

Scheduled job is an common requirement for high performance workload.

Speed up E2E tests

/kind bug

Currently, Travis would spend almost 26 minutes to finish e2e tests, need to figure it out how to speed up these tests.

Ran for 26 min 14 sec
Ran 33 of 33 Specs in 773.302 seconds
SUCCESS! -- 33 Passed | 0 Failed | 0 Pending | 0 Skipped
--- PASS: TestE2E (773.30s)
PASS
ok  	volcano.sh/volcano/test/e2e	773.323s
release "integration" deleted
Running kind: [kind delete cluster --name integration]
Deleting cluster "integration" ...
$KUBECONFIG is still set to use /home/travis/.kube/kind-config-integration even though that file has been deleted, remember to unset it
Volcano logs are currently not supported.
The command "make e2e-test-kind" exited with 0.

Support Task/Job retry

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

Description:

Support task/job retry; if it's still failed after try count, mark as Failed.

The docker image name should align with binaries'

Is this a BUG REPORT or FEATURE REQUEST?:

/kind cleanup

Description:

In Makefile, our binaries are vk-controller, vk-scheduler and so on; but the docker image is volcanosh/volcano-scheduler. It's better to make them align with each other to avoid confusion.

/cc @asifdxtreme

Makefile cleanup

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

Description:

It's better to support following targets in Makefile:

  1. make: only make related binaries, e.g. controller, scheduler
  2. make images: build related docker images
  3. make e2e-test-kind: run e2e test with kind
  4. make unit-test: run unit test
  5. make integration-test: run integration test

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.