Comments (12)
I've released v2.4.0 which includes fixes in PRs #219 and #225 that specifically address this problem. We've been running it in production for the last week and I'm satisfied with the results.
Let me know if you have any feedback. Thanks for your help!
from gubernator.
Actually it looks to be related to https://github.com/mailgun/gubernator/blob/master/global.go#L225 which sets hits to zero in all cases on the broadcast, so when it arrives at the non-owning peer it will never pass the code added by #157 and report over the limit.
from gubernator.
@philipgough's findings seem pretty odd. Imagine this scenario:
- A rate limit for a given key is a 5/5 (hits/limit), with status
UNDER_LIMIT
. - On the 6th hit that gets to the key owner, it will go to 6/5
OVER_LIMIT
, a broadcast from the owner should happen: now we rungetLocalRateLimit
inside the owner with a request that has 0/5OVER_LIMIT
before broadcasting to get the most up-to-date status of that limit:Line 227 in 885519d
- This will run through each algorithm's code, which is where things get problematic and we end up getting status
UNDER_LIMIT
for some reason. 🤷
I tried to understand why that UNDER_LIMIT
status comes and I failed. 😭
Both leaky bucket and token bucket algorithms show this problem, potentially for different reasons... I don't know.
from gubernator.
@Baliedge would you mind taking a look at this if you can spare some time? Not sure if its an issue with understanding the correct usage or a valid problem! TY
from gubernator.
I stumbled upon this issue yesterday because I saw in our server logs that, for any given request, 429s were always returned by one specific host in our cluster (the owner host for that request). I spent a couple of hours looking into this and I believe I know what's going on.
I wrote a test (in a private repo, sorry) that does this:
- Create two peers with the default settings. Peer 2 will be the owner.
- Call GetRateLimits on peer 1 with
Behavior: GLOBAL, Amount: 1, Limit: 1, Duration: 1 * time.Minute,
. This is a cache miss in peer 1 so it goes to peer 2. Peer 2 gets the hit and should update peer 1 withremaining = 0, status = under_limit
- Verify that
Allowed = true
,Remaining = 0
. time.Sleep(1 * time.Second)
. This gives enough time to broadcast the change from peer 2 to peer 1.- Call GetRateLimits on peer 1, again, with
Behavior: GLOBAL, Amount: 1, Limit: 1, Duration: 1 * time.Minute,
. This is a cache hit in peer 1 - Verify that
Allowed = false
. This assertion fails.
I think the reason why the assertion fails is: in algorithms.go
, both algorithms have
rl := &RateLimitResp{
Status: Status_UNDER_LIMIT,
// ...
}
// ...
if r.Hits == 0 {
return rl, nil
}
When the owner of a key is done aggregating all the hits it got and makes a final peek
to GetRateLimits (ie it sends a copy of the request with hits=0)
, we don't want to blindly return whatever we had. We are not inspecting the value of r.Remaining
, which could be zero at this point. So we are not setting the status to OVER_LIMIT from the owners, which means the peers will not see it either.
Either we have to fix this in the algorithms themselves (risky, this code doesn't have unit tests!), or we need to change the code of global.broadcastPeers
so that after doing getLocalRateLimit
we manually inspect the value of Remaining
and if it's zero, we manually set the status to OVER_LIMIT.
Judging by the git blame, this problem has been there for roughly 3 years. But if my judgement is correct then it's odd that it was noticed by two independent reporters with very few days in between. As @philipgough said, the test passes on v2.0.0-rc.32. Which means this has been broken since October 2022 😅
from gubernator.
I might have a fix #216
from gubernator.
PR #219 was merged recently. Did this fix the issue reported here?
from gubernator.
@Baliedge: @philipgough kindly left a PR with a reproduction of the problem. So it would be cool to incorporate the test into the codebase (if it's not already done) and confirm wether the issue is fixed through the test.
from gubernator.
@douglascamata I'm not finding any PRs to your name. Could you provide a link?
from gubernator.
@Baliedge: as I said, it is @philipgough's PR... mentioned in his first comment in this issue. Here it is: #207
from gubernator.
@douglascamata @philipgough: Would you both please review take a look at PR #224?
from gubernator.
@Baliedge - sorry I was on vacation for a bit so missed this ping.
from gubernator.
Related Issues (20)
- Using Gubernator as a library HOT 1
- Scheduling over-limit requests HOT 17
- Question: Difference between name and unique_key HOT 4
- Update and publish Python files HOT 1
- Standard GRPC health check endpoint HOT 1
- CacheSize in the config is ignored HOT 1
- Helm repository https://mailgun.github.io/gubernator shows no entries HOT 1
- Please upgrade go.opentelemetry.io/* dependencies HOT 2
- Peer list update bug in K8s cluster HOT 3
- Adding rate_limit_name as tag on metrics HOT 5
- Bug: Latest Image doesn't work with docker-compose HOT 2
- Tests should verify Docker build
- Run benchmarks!
- Flaky test: `TestLeakyBucketGregorian`
- Logrus is in maintenance mode HOT 1
- More unit tests needed + Upload code coverage to CodeCov HOT 1
- Simplify SetupDaemonConfig API
- Flaky test: `TestPeerClientShutdown`
- Peering authentication HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gubernator.