Code Monkey home page Code Monkey logo

Comments (5)

gianm avatar gianm commented on July 19, 2024 2

We saw a double-leader situation recently when a ZK server cycled, and we suspect it has something to do with https://issues.apache.org/jira/browse/CURATOR-696. That Curator Jira suggests a bug was introduced by https://issues.apache.org/jira/browse/CURATOR-644 (PR: apache/curator#430).

It seems possible that this did introduce a bug, since that changed the logic from doing reset() always on reconnection (which would recreate the ephemeral znode) to doing getChildren(), which would look for existing ones, and then only call reset() if they could not be found.

We updated to Curator 5.4 some time ago, in #13302. So if this is indeed what’s going on, it has potentially been an issue since Druid 25.

What we saw specifically was this scenario:

  • OL 1 was leader prior to ZK connection loss

  • OL 1 reconnected to ZK and got a session id that we believe is a new session id (although we were not able to confirm that)

  • OL 1's LeaderLatch recipe checked the latch patch and saw an ephemeral znode there that it believed was its own, so it started leadership.

  • OL 2, 30s later, checked the latch path and saw no children at all (not even the one for OL 1). It created an ephemeral znode for itself, and started leadership.

We think what happened is that both OLs established new sessions, even though the old sessions hadn’t expired yet. Because the old sessions hadn’t expired yet, the old ephemeral znodes were still there upon reconnection. The old leader, OL 1, saw both old znodes there and assumed it was still leader. But because those znodes were associated with different sessions, they went away in 30s. When OL 2 noticed that, it assumed there was no active leader, so it became one and then we had two leaders.

from druid.

gianm avatar gianm commented on July 19, 2024 2

I commented on CURATOR-696 linking back here.

from druid.

razinbouzar avatar razinbouzar commented on July 19, 2024 1

@cryptoe can we re-open this issue since #16425 was reverted in #16445?

from druid.

razinbouzar avatar razinbouzar commented on July 19, 2024

Another observation is that this condition occurred during a ZK leader election change.
zk-leader-change

from druid.

razinbouzar avatar razinbouzar commented on July 19, 2024

@gianm Curator 5.7.0 includes the fix for https://issues.apache.org/jira/browse/CURATOR-696. I'm unsure when this version will be made available, but have asked here.

from druid.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.