Comments (6)
@kyle-v6x Could you describe what behavior you would expect? Setting a small downscaling factor could block downscaling altogether, even if the traffic is 10% of the expected amount of traffic the current N running replicas should handle, defined by your own autoscaling config.
The purpose of a small downscaling factor is to prevent the serve controller from making drastic decisions, e.g. dropping from 100 replicas to 10 replicas. However it should not be prevent the number of replicas from dropping from 2 replicas to 1 replica if the current traffic is way below what 2 replicas can handle.
from ray.
So there's a few issues with the merged solution:
- It changes the way the scaling factor operates for all values. Let's say you have a scaling factor of 0.6, the new code will drastically change the behaviour of the autoscaler as it will always remove one replica despite the smoothed value. For a cluster with 5 workers during a low-traffic period, this could cause a 20% mismatch in capacity. There is no way to easilly explain this behaviour, and it contradicts the config parameter
- Previously, low scaling factors could actually be a benefit by providing a 'soft-min' for replica counts. After this point, the cluster would only scale down if the traffic was 0.
I would argue that the previous behaviour was not an issue, as it followed the expected behaviour of a low multiplicative factor. The new behaviour breaks smoothing for any clusters with a low numbers of replicas. It would be nice to revert the changes at least untill another solution is found that doesn't also break large scaling values (>=0.6).
Regarding potential long-term solutions after reverting the current solution:
- Compute a momentum value based on previous scaling decisions, so each decision is not made indepentantly
- Apply some log factor to squish the possible inputs
- Apply some conditional logic to increase the smoothing factor for low replica counts:
if smoothing_factor < 0.55: smoothing_factor = min(0.55, smoothing_factor + 1/num_running_replicas)
From my own testing, the third option seems the most ideal.
from ray.
@kyle-v6x Could you help me better understand your issue?
For a cluster with 5 workers during a low-traffic period, this could cause a 20% mismatch in capacity.
In this case, Serve will decrease the number of replicas only if traffic is consistently at less than 80% of what your 5 workers is expected to handle. If you don't want the number of replicas to decrease in this case, you can tune target_ongoing_requests
(if you'd rather have each worker handle less traffic), min_replicas
(if you want to have each worker handle the same amount of traffic, but don't want the number of workers to drop below 5), or downscale_delay_s
(if you want to decrease the frequency of downscale decisions to prevent oscillation).
Previously, low scaling factors could actually be a benefit by providing a 'soft-min' for replica counts. After this point, the cluster would only scale down if the traffic was 0.
Although this was a side effect of low scaling factors before, I think this is not the original intended purpose. Our motivation for introducing scaling factor was to prevent Serve from introducing too big of a "delta" in the number of replicas running in the cluster, to prevent oscillation in the replica graph. If "soft min" is your goal (i.e. only decrease replicas if traffic is zero, but otherwise keep it at some positive minimum of replicas) in my opinion this is more of a feature request. Do you want to file a github issue for this, and assign it to the Serve team?
from ray.
Of course.
In this case, Serve will decrease the number of replicas only if traffic is consistently at less than 80% of what your 5 workers is expected to handle.
With the changes made, it will decrease the workers by one on the first downscaling_delay check, completely bypassing the scaling factor. As the warning previously mentioned, the scaling factor would only cause issues less than ~0.6. These changes are applied to all ranges of scaling factors, even those above 0.6. So the new logic will always result in 1 less worker than before the change. With 100 workers, it's marginal, but at something like 5 workers, this is a 20% error. This change fundamentally breaks the downscaling_factor for clusters with a low number of replicas. Even when the scaling_factor is greater than 0.6.
or downscale_delay_s (if you want to decrease the frequency of downscale decisions to prevent oscillation)
In this case, downscale_delay
is not a solution. As per the logic here, the decision to downscale is only enforced if it has occured consistently. However, the decision_num_replicas
is only calculated from the last decision. For a deployment with spikes in traffic, the downscale_delay
therefore only results in huge delayed oscillations. This is exactly why the downscaling factor has been so useful. Alternatively, you could also store an average or min decision_num_replicas
over the downscale_delay_s
, but that is beyond the scope of this issue.
Although this was a side effect of low scaling factors before, I think this is not the original intended purpose.
I understand this was not the intended purpose, but it is a natural result of applying a scaler to the decision. If you would like to remove this behaviour, something as I suggested above does so without breaking scaling at low replica counts.
I notice this issues was created internally, so perhaps it was causing issues with your own deployments?
from ray.
One more note:
Since the new changes only apply a maximum of -1 replica per scaling decision, I can try tune a larger downscale_delay_s
to handle lower replica scaling, but this would come at the cost of more agile downscaling during peak time, and seems like a workaround rather than a solution.
from ray.
Although I still think this change should be reverted, the feature suggestion here would provide a valid solution/workaround as well.
from ray.
Related Issues (20)
- [core] placement group stuck in pending mode
- ```suggestion
- [data] Support multiple label columns for `to_torch`
- [Serve] Latency on Model Multiplexing
- [ray[default]==2.22.0] .write_parquet(s3_path, mode="overwrite") doesn't work properly HOT 3
- Release test single_node_oom.aws failed HOT 1
- [Ray Dashboard] show better hints when gpus are available but not used by actors
- Release test dataset_shuffle_random_shuffle_1tb.aws failed HOT 2
- [Ray component: Core] Runtime Envs should support package installation with uv
- [RLlib] Running RLlib example using Actor causes worker to die unexpectedly HOT 1
- Release test dataset_shuffle_push_based_random_shuffle_1tb.aws failed HOT 1
- Release test dataset_shuffle_sort_1tb.aws failed HOT 1
- Release test single_node_oom.aws failed HOT 2
- [Dashboard] Ray dashboard crash when usiking kill with unknown actor
- [Core|Data] read_csv(): Exception from as task of operator "ReadCSV->SplitBlocks(100)"
- [Core] Deserialization of generic pydantic models
- CI test linux://rllib:test_algorithm_export_checkpoint is consistently_failing HOT 1
- Cloudpickle can't serialize python's new generic syntax for ParamSpec HOT 1
- [data] to_tf does not properly convert list[float] to a Tensor with dtype float
- [data] Mark restarting actors are pending actors
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ray.