If you are reporting any crash or any potential security iss

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

/assign <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Cold-start Admission Control filter rejects before receiving the entire sliding window of samples. about envoy HOT 17 CLOSED

idrajeev12 commented on September 26, 2024

Cold-start Admission Control filter rejects before receiving the entire sliding window of samples.

from envoy.

Comments (17)

tonya11en commented on September 26, 2024 5

@ipoval I think #34257 will fix this. Please verify if you have a chance.

from envoy.

tonya11en commented on September 26, 2024 4

Let's just add a boolean parameter that will give the behavior you expect. We can't change the existing default brehavior, but I think it's fair to have some flag called something like strict_window_enforcement.

This should be pretty straightforward to implement if you want to submit the patch. I'm happy to answer questions and review.

from envoy.

tonya11en commented on September 26, 2024 3

Your suggestion about averageRps() not requiring to be thread-local sounds reasonable, but aren't you afraid that it will require other parts also be transformed into non-thread-local equivalents, which could potentially impact performance?

@ipoval I'm not really worried about performance impact, since we'd only be introducing an atomic incr/decr. We can also benchmark it to ensure there's no impact. I'm not sure what you meant by "seems that it negatively affects some other logic", but I'd assume the unit tests would fail if it affected the behavior of the filter.

@Jayden-Lind thanks for the suggestion. I'll start with a unit test and see if your suggested change addresses the problem. It's been a while since I've looked at this code, so I'll need to get familiar again.

I should be able to do it this week.

from envoy.

tonya11en commented on September 26, 2024 2

/assign @tonya11en

from envoy.

tonya11en commented on September 26, 2024 2

Sounds good to me! Don't feel like you have to write a big formal document or anything. A few paragraphs on the high level approach should suffice and we can iterate on the specifics in the PR.

Here's our guide on contributing to the project:
https://github.com/envoyproxy/envoy/blob/main/CONTRIBUTING.md

Let me know if there's anything else I can do to help you get going.

from envoy.

ipoval commented on September 26, 2024 2

@tonya11en 👋 , and sorry for the delay with the response

we tried to experiment with some options for ThreadLocalControllerImpl::averageRps() that we had in mind but unfortunately still not getting the expected behaviour:

one of the things that was tried is:
if (historical_data_.empty() || global_data_.requests == 0 || ageOfOldestSample() < sampling_window_) { return 0; }

it helped with waiting for the sampling window interval before starting rejections but negatively affected the rejections amount - seems that it negatively affects some other logic

another option that could have been acceptable for us would be:
ENVOY_LOG(debug, "Global data requests: {}", global_data_.requests); return global_data_.requests / std::chrono::duration_cast<seconds>(sampling_window_).count();

but debug logs show that global_data_.requests is a lot smaller than I would expect it to be -
with configuration:
sampling_window: 60s | rps_threshold: 1
and envoy started wth single worker thread:
./envoy-static --concurrency 1 -c sp_envoy_config_adm_control.yaml -l debug |& grep 'source/extensions/filters/http/admission_control'
=> Global data requests was always around 60 (size of the window) even if a large number of curl requests were sent

Probably I'd like to ask you if you still have time to take a look into the possible fix. 🙇‍♂️

Your suggestion about averageRps() not requiring to be thread-local sounds reasonable, but aren't you afraid that it will require other parts also be transformed into non-thread-local equivalents, which could potentially impact performance?

from envoy.

Jayden-Lind commented on September 26, 2024 1

Hey @tonya11en,

A crude fix for this is possibly something like below. After testing this, this consistently waited til the window's last object was at the time of the sampling window.

uint32_t ThreadLocalControllerImpl::averageRps() const {
  if (historical_data_.empty() || global_data_.requests == 0) {
    return 0;
  }
  using std::chrono::seconds;
  std::chrono::microseconds oldestSampleAge = ageOfOldestSample();
  seconds secs = std::max(seconds(1), std::chrono::duration_cast<seconds>(oldestSampleAge));
  if (secs == sampling_window_ || secs == sampling_window_ - seconds(1)) {
    return global_data_.requests / secs.count();
  }
  return 0;
}

from envoy.

ipoval commented on September 26, 2024 1

@tonya11en, thanks for the quick turnaround on the fix. We have done testing, and the results still show behaviour that does not wait for the full sampling window of events to start rejections. The change seconds secs = std::max(sampling_window_, std::chrono::duration_cast<seconds>(ageOfOldestSample())); does improve the fact that we will react to larger spikes or failed requests, but it won't wait.

Testing was done with configuration: sampling_window(60), aggression(3), rps_threshold(1) + ./envoy-static --concurrency 1

while true; do curl -i 'http://localhost:9001/response?status=504'; sleep 0.2; done
the first request going through the filter: HTTP/1.1 504 Gateway Timeout | date: Tue, 21 May 2024 04:51:24 GMT
the first rejected request through the filter: HTTP/1.1 503 Service Unavailable | date: Tue, 21 May 2024 04:51:40 GMT

I'm thinking if there could be a case with a missing unit test 🤷

@Jayden-Lind has made a quick spike on top of your branch: main...Jayden-Lind:envoy:jlind/fix#diff-acb1cd289fb89f10edeb48075c4760815b8adc52a8082e2fa98e98bf45949987R103 and it seems that it does wait for the sampling window, but it may not be the most intention-reveling/readable approach

thanks again and please ping us if you would like us to test some other approach

from envoy.

idrajeev12 commented on September 26, 2024

@tonya11en Please review and share your thoughts?

from envoy.

ipoval commented on September 26, 2024

Adding the configuration example used in the original question:

- name: envoy.filters.http.admission_control
  typed_config:
    '@type': type.googleapis.com/envoy.extensions.filters.http.admission_control.v3.AdmissionControl
    enabled:
      default_value: true
      runtime_key: admission_control.enabled
    sampling_window: 60s
    sr_threshold:
      default_value:
        value: 90
      runtime_key: admission_control.sr_threshold
    aggression:
      default_value: 2
      runtime_key: admission_control.aggression
    rps_threshold:
      default_value: 1
      runtime_key: admission_control.rps_threshold
    max_rejection_probability:
      default_value:
        value: 10
      runtime_key: admission_control.max_rejection_probability
    success_criteria:
      http_criteria:
        http_success_status:
        - start: 100
          end: 500

while true; do curl -i http://egress.mesh:9001/response?status=500; sleep 0.5; done

HTTP/1.1 500 Internal Server Error
date: Tue, 23 Apr 2024 13:11:25 GMT
content-length: 0
x-envoy-upstream-service-time: 4
server: envoy
...
HTTP/1.1 503 Service Unavailable
date: Tue, 23 Apr 2024 13:11:34 GMT
server: envoy
content-length: 0

http.egress_http_....admission_control.rq_failure: 19 (444)
http.egress_http_....admission_control.rq_rejected: 1 (3)

We can observe that the first 503 (rejection) is generated after 9 seconds - much sooner than the filter had a chance to accumulate a full sliding window (60 seconds) of samples

from envoy.

tonya11en commented on September 26, 2024

Thanks for the detailed info!

for example: having a sliding window of 60 seconds we can see rejections starting after the first few seconds from the cold start by having only a few failed requests and successes.

This is not the behavior I would expect and is likely a result of the issue you found:

We also noticed that seconds.count in averageRPS() implementation is derived from ageOfOldestSample() and we were curious would it be reasonable to use a sliding window interval instead of ageOfOldestSample()? We thought that it could improve the behavior when we see very early rejections from the cold start or when the traffic is sparse.

This is a bug. The docs are very clear on the expected behavior of the rps_threshold parameter:

If the average RPS of the sampling window is below this threshold, 
the request will not be rejected, even if the success rate is lower 
than sr_threshold. Defaults to 0.

Your sampling window is 60 seconds, so the averageRPS() should be calculated as the number of requests encountered over the last 60 seconds divided by 60. This current behavior will result in the broken behavior you've encountered at proxy start up and is made worse by the fact that the admission controller implementation tracks this thread-locally.

A naive fix would be simple, since all we'd need to do is modify is this function, but looking at how this is implemented, I see no reason for the average RPS calculation to be thread-local. We'd likely be able to just change that to account for the global average RPS and use that number in the rejection probabilities. Making the average RPS a common value across all threads would make the behavior more sane in my opinion.

Now, if we take a look at your config, my initial thoughts are that your rps_threshold of 1 is too low to be very effective even if the average RPS were being calculated properly. If you need to mitigate this issue before we're able to patch it, I'd start by increasing the RPS threshold to be something larger and see if you can get by until I or someone else (maybe you? :D) can send out a PR.

I would also keep in mind that increasing your lookback window to 60s will result in a slower recovery time. You may have already seen the experiments in one of my earlier PR submissions, but I'm linking again for anyone else who may want to build an intuition for how the lookback window affects recovery times.

Let me know if you need me to clarify anything above. If you want to do the change, I'm happy to offer some guidance. Otherwise, I'll try to get around to submitting a PR in the next week or two.

from envoy.

idrajeev12 commented on September 26, 2024

Appreciate your quick response and support on this @tonya11en, Let me review you this and get back to you with an update.

from envoy.

ipoval commented on September 26, 2024

@tonya11en , thank you!

What do you think if we try to work on this issue with our resources and submit a fix?
If that's ok, then probably the first step for us would be to come up with the proposal for the alternative implementation/fix and align with you.

from envoy.

tonya11en commented on September 26, 2024

Maybe I'm misunderstanding something, but I don't see an issue here. Your first rejection came 16 seconds after the first request went through the filter and all of your requests will return a 504, which counts as an error by default. So you would have sent ~75 requests, which puts you over the RPS threshold, and your success rate would be 0%.

What is the behavior you expect?

from envoy.

ipoval commented on September 26, 2024

What is the behaviour you expect?

As I understand the documentation and the algorithm there are 2 conditions that need to be satisfied:

request samples need to be accumulated for the whole sliding window interval
RPS for the whole sliding window of samples needs to be above the threshold

In the mentioned example the sliding window is set to 60s, and our expectation is that the algorithm would wait for at least 60s before the first rejection happens even if within the sliding window we could have bursts of samples that already satisfy the RPS threshold

from envoy.

tonya11en commented on September 26, 2024

We should reopen this issue, since the discussion is still ongoing.

Regarding this:

request samples need to be accumulated for the whole sliding window interval

I think this is our disconnect. I agree with the 2nd condition, but the first one I do not. I think that the RPS threshold parameter gives enough control. If we were to wait for the entire sliding window to pass before activating the admission controller, we open ourselves up to overloading the servers.

My recommendation is that you increase the RPS threshold to be higher. A threshold of 1, as you have above, is much too small in my opinion. What happens if you raise the threshold to be 50-75% of your steady-state RPS?

from envoy.

ipoval commented on September 26, 2024

thanks @tonya11en 🙇‍♂️ , I do understand your point
RPS does sound like a powerful control check, but it also feels that the current approach undermines the sliding window's configurational meaning.
If a service owner is worried about server overload, they could adjust the sliding window to smaller values (for example, 10 seconds instead of 60 seconds). This would allow them to react to more frequent bursts and be more aggressive in protecting the server.
However, if the service owner chooses a sliding window of the 60s or 120s, I feel they would like to trade the risk of server overload for a more sustainable signal of significant degradation instead of noise.
This is how we understood the documentation 🤷

The above configuration can be understood as follows:

Calculate the request success-rate over a 120s sliding window.

"Calculate the request success-rate over a 120s sliding window." - this gives us the impression that success-rate will be calculated over the whole 120s window

from envoy.

Cold-start Admission Control filter rejects before receiving the entire sliding window of samples. about envoy HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent