Comments (9)
@MadhavJivrajani Great! I will let you know when I have a doc.
from kuberay.
I have already worked on a document. I will let you know when it is ready for review.
from kuberay.
Thanks for the hard work! I'm currently querying the RayCluster CRD for error messages such as resource quota issues. Would love to see what improvements you make, and I'll try to make it to the sync!
from kuberay.
Honestly, I don't think KubeRay should handle and expose K8s Pod errors. You can think of RayCluster as equivalent to multiple ReplicaSets. ReplicaSetStatus doesn't include "Pod failure" in its status. Maybe we can introduce a new conditions
field to handle Pod-level observability. Currently, the RayCluster state includes Failed
, which is quite undefined and makes the state machine rather messy. I am planning to refactor the RayCluster status soon. If you are interested, we can work on it together, or you can provide feedback on my design document.
from kuberay.
I'd be happy to help out here in case the help is needed @han-steve!
from kuberay.
Thanks for the response. I agree that the status state machine can get messy with pod failure statuses. An alternative would be to use the Conditions field to reflect the errors in the underlying cluster. For example, ReplicaSet and Deployment use a Condition to inform a user that the pods fail to scale up due to a resource quota error. They also produce events that can be easily seen with a kubectl describe
.
Our goal is to surface the underlying error to the user so they know if a job is pending or stuck due to resource quota errors. If there's no plan to surface these conditions, we'll query the associated ray cluster for this info to show to the user. Thanks again for taking a look!
from kuberay.
from kuberay.
Hi @han-steve @MadhavJivrajani,
I have scheduled a meeting for the RayCluster status improvement work stream on July 10 8:30 - 8:55 AM PT. You can add the following Google calendar to subscribe the events for Ray / KubeRay open-source community.
from kuberay.
hi @kevin85421 we have a similar requirement where we want to expose the errors encountered by Ray pods to the users. The main reason is that some of these errors can be self served by the users of the Ray jobs without further involvement or debugging. Please let me know if you have already published the doc or if there's any meeting notes from this. Thanks.
from kuberay.
Related Issues (20)
- [Feature] Allow referring requirements.txt path from zipfile in runtime_env in serveConfigV2
- [Bug] Unable to launch vLLM with llama3.1:70B HOT 19
- [Bug] Exec probes are causing high load on Ray pods HOT 4
- [Doc] KubeRay configuration
- [Regression from v1.1] RayCluster Condition should surface resource quota issue correctly HOT 8
- [Bug] Resume the `Replicas` field in type HeadGroupSpec struct HOT 1
- [Bug] RayJob does not shut down the submitter pod properly HOT 1
- [Feature] Should --num-cpu be based on CPU requests instead of limits? HOT 14
- [Feature] scheduled jobs, workflows, and incremental learning HOT 1
- [Feature] Support setting QPS and Burst in configuration and command line flags
- [Feature] Remove `//nolint:gosec` to allow rule G115 after the false positive issue is solved
- [Bug] CI error: actions/upload-artifact version deprecation
- [Bug] Old RayServices not deleted after operator update to 1.2.1 HOT 8
- [Bug] Submit a task with kuberay-operator, deliberately let the task fail, found that the submitter will repeatedly create three times, how to turn off this option? HOT 2
- [Bug] kuberay cannot run long tasks, and service disconnection occurs. HOT 5
- [Feature] Consider setting AUTOSCALER_CONSERVE_GPU_NODES by default in Ray autoscaler HOT 1
- [Bug] Autoscaler sideacr crashes, bringing down head pod, if request exceeds max pod replicas
- [Proposal] Let kubectl-plugin import KubeRay types from ray-operator HOT 1
- [Bug] Bubble ImagePullErr and ImagePullBackoff to the Ray CRD HOT 1
- [Feature][kubectl-plugin] Run port-forward in a goroutine and retry if the connection fails HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kuberay.