Code Monkey home page Code Monkey logo

Comments (4)

aembke avatar aembke commented on August 19, 2024

Hi @davlum

Thanks for the good callouts. I'll be linking to some code but I'll use the feat/resp3-support branch, since that's the foundation for the soon upcoming 5.0.0 major release.

You're right the parsing logic for CLUSTER NODES doesn't properly account for other flags in that section. That being said, I'm not sure how to best handle a situation where a node is marked as master, but in a fail state. If the corresponding replica hasn't been promoted then presumably that means (at the time we called CLUSTER NODES at least) that at least some section of the hash slots were not mapped to a working node. In that scenario I probably wouldn't necessarily fail immediately, but rather maybe spin a bit until the CLUSTER NODES response indicated at least that each hash slot was covered. Either way I don't think the client should return from wait_for_connect until either all the hash slots are covered, or it's clear things are really broken (depends on the fail_fast flag though). I could be misunderstanding this, but that strikes me as a bit of wrinkle to this.

I'll need to think on this a bit. The parsing logic probably needs some changes regardless, but the failure mode you mentioned (master is in a fail state, but the replica hasn't been promoted yet - or there's no replica) strikes me as something that'll require a bit more thought.

Regarding the sync_cluster question - it's true that sync_cluster is only called when handling a reconnection (or in response to MOVED/ASK), but I'm not sure that's an issue here. Under the hood the logic on an initial connection attempt still calls CLUSTER NODES (https://github.com/aembke/fred.rs/blob/feat/resp3-support/src/multiplexer/utils.rs#L813), which then sets the cached state on the Connections (https://github.com/aembke/fred.rs/blob/feat/resp3-support/src/multiplexer/utils.rs#L815). This is an Arc<RwLock<ClusterKeyCache>> and so it's the same underlying data as the cluster_state field on the RedisClientInner.

Maybe I'm misunderstand though since I'm not sure how the initial master node connection breaking would affect things (since that would invalidate the cache and result in another call to sync_cluster). Is that the failure mode you had in mind? Or am I totally misunderstanding this?

from fred.rs.

davlum avatar davlum commented on August 19, 2024

Ah my mistake I did some more investigating and this line here is the problem:

let mut servers: Vec<Arc<String>> = inner
.config
.read()
.server
.hosts()
.iter()
.map(|(h, p)| Arc::new(format!("{}:{}", h, p)))
.collect();

This list of hosts is never updated, which means if you've initially configured one host, you're connected to it and it fails you can never reconnect to the cluster.

Also worth noting that for some reason during infinite retries the sleep duration goes to 0 at some point.

from fred.rs.

aembke avatar aembke commented on August 19, 2024

Hey @davlum apologies for the delay getting back to you.

Thanks for the PR, I think I understand the issue better now. I'm working on a different branch now, and unfortunately it's pretty far off main at this point. I'll work these changes into the new branch and tag you in the changes when they're ready. Unfortunately the branches have diverged enough at this point that it's not really feasible for me to cherry pick any of these commits, but I'll do what I can to get your contributions into the new version with minimal changes.

Thanks for digging into this.

Also - do you have any more info on the sleep duration going to 0? That's pretty worrisome since it could result in a DoS on the server if done at scale (I've seen this happen during a big Elasticache rebalance operation and it is not fun). Do you happen to remember which backoff policy you were using at the time?

from fred.rs.

aembke avatar aembke commented on August 19, 2024

Hi @davlum, I took in the key change in your PR and put it in 5.0.0-beta.1. If you have any further issues please let me know, but in the meantime I'm going to close this.

from fred.rs.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.