What happened + What you expected to happen From <code class="notr

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[Serve] downscaling_factor logic is broken for no-scaling decisions about ray HOT 6 OPEN

kyle-v6x commented on September 28, 2024

[Serve] downscaling_factor logic is broken for no-scaling decisions

from ray.

Comments (6)

zcin commented on September 28, 2024

@kyle-v6x Could you describe what behavior you would expect? Setting a small downscaling factor could block downscaling altogether, even if the traffic is 10% of the expected amount of traffic the current N running replicas should handle, defined by your own autoscaling config.

The purpose of a small downscaling factor is to prevent the serve controller from making drastic decisions, e.g. dropping from 100 replicas to 10 replicas. However it should not be prevent the number of replicas from dropping from 2 replicas to 1 replica if the current traffic is way below what 2 replicas can handle.

from ray.

kyle-v6x commented on September 28, 2024

@zcin

So there's a few issues with the merged solution:

It changes the way the scaling factor operates for all values. Let's say you have a scaling factor of 0.6, the new code will drastically change the behaviour of the autoscaler as it will always remove one replica despite the smoothed value. For a cluster with 5 workers during a low-traffic period, this could cause a 20% mismatch in capacity. There is no way to easilly explain this behaviour, and it contradicts the config parameter
Previously, low scaling factors could actually be a benefit by providing a 'soft-min' for replica counts. After this point, the cluster would only scale down if the traffic was 0.

I would argue that the previous behaviour was not an issue, as it followed the expected behaviour of a low multiplicative factor. The new behaviour breaks smoothing for any clusters with a low numbers of replicas. It would be nice to revert the changes at least untill another solution is found that doesn't also break large scaling values (>=0.6).

Regarding potential long-term solutions after reverting the current solution:

Compute a momentum value based on previous scaling decisions, so each decision is not made indepentantly
Apply some log factor to squish the possible inputs
Apply some conditional logic to increase the smoothing factor for low replica counts: if smoothing_factor < 0.55: smoothing_factor = min(0.55, smoothing_factor + 1/num_running_replicas)

From my own testing, the third option seems the most ideal.

from ray.

zcin commented on September 28, 2024

@kyle-v6x Could you help me better understand your issue?

For a cluster with 5 workers during a low-traffic period, this could cause a 20% mismatch in capacity.

In this case, Serve will decrease the number of replicas only if traffic is consistently at less than 80% of what your 5 workers is expected to handle. If you don't want the number of replicas to decrease in this case, you can tune target_ongoing_requests (if you'd rather have each worker handle less traffic), min_replicas (if you want to have each worker handle the same amount of traffic, but don't want the number of workers to drop below 5), or downscale_delay_s (if you want to decrease the frequency of downscale decisions to prevent oscillation).

Previously, low scaling factors could actually be a benefit by providing a 'soft-min' for replica counts. After this point, the cluster would only scale down if the traffic was 0.

Although this was a side effect of low scaling factors before, I think this is not the original intended purpose. Our motivation for introducing scaling factor was to prevent Serve from introducing too big of a "delta" in the number of replicas running in the cluster, to prevent oscillation in the replica graph. If "soft min" is your goal (i.e. only decrease replicas if traffic is zero, but otherwise keep it at some positive minimum of replicas) in my opinion this is more of a feature request. Do you want to file a github issue for this, and assign it to the Serve team?

from ray.

kyle-v6x commented on September 28, 2024

Of course.

In this case, Serve will decrease the number of replicas only if traffic is consistently at less than 80% of what your 5 workers is expected to handle.

With the changes made, it will decrease the workers by one on the first downscaling_delay check, completely bypassing the scaling factor. As the warning previously mentioned, the scaling factor would only cause issues less than ~0.6. These changes are applied to all ranges of scaling factors, even those above 0.6. So the new logic will always result in 1 less worker than before the change. With 100 workers, it's marginal, but at something like 5 workers, this is a 20% error. This change fundamentally breaks the downscaling_factor for clusters with a low number of replicas. Even when the scaling_factor is greater than 0.6.

or downscale_delay_s (if you want to decrease the frequency of downscale decisions to prevent oscillation)

In this case, downscale_delay is not a solution. As per the logic here, the decision to downscale is only enforced if it has occured consistently. However, the decision_num_replicas is only calculated from the last decision. For a deployment with spikes in traffic, the downscale_delay therefore only results in huge delayed oscillations. This is exactly why the downscaling factor has been so useful. Alternatively, you could also store an average or min decision_num_replicas over the downscale_delay_s, but that is beyond the scope of this issue.

Although this was a side effect of low scaling factors before, I think this is not the original intended purpose.

I understand this was not the intended purpose, but it is a natural result of applying a scaler to the decision. If you would like to remove this behaviour, something as I suggested above does so without breaking scaling at low replica counts.

I notice this issues was created internally, so perhaps it was causing issues with your own deployments?

from ray.

kyle-v6x commented on September 28, 2024

One more note:

Since the new changes only apply a maximum of -1 replica per scaling decision, I can try tune a larger downscale_delay_s to handle lower replica scaling, but this would come at the cost of more agile downscaling during peak time, and seems like a workaround rather than a solution.

from ray.

kyle-v6x commented on September 28, 2024

Although I still think this change should be reverted, the feature suggestion here would provide a valid solution/workaround as well.

from ray.

[Serve] downscaling_factor logic is broken for no-scaling decisions about ray HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent