Code Monkey home page Code Monkey logo

Comments (2)

Kilowhisky avatar Kilowhisky commented on May 23, 2024

I'm able to replicate it on AWS ECS (any container environment should do)

  1. Setup configuration with 1 leader and 2 followers.
  2. When setting up the follow, use a DNS record to locate the leader (this is the important part).
  3. Startup the servers and begin writing keys and generally futzing with it. Make sure you write at least one OBJECT and one STRING and issue SET and DEL and DROP so that it appears in the AOF. Before the next step make sure a STRING and a OBJECT type exist.
  4. kill the leader with kill -9 <pid>
  5. ECS/Docker should see the task died and reboot the container. If it doesn't, do so manually. Since there is no data attached it should reboot with a clean /data directory and reinitialize based on your config.
  6. The leader should come online and reattach to followers but with an empty AOF/DB.
  7. The followers report that everything is all good even though they are no longer in sync with each other as the master is empty and the followers have records.
  8. Begin writing records and notice that the followers attempt to stay in sync but don't really as the old keys are never cleared.
  9. Repeat the kill on one of the followers and notice that it now comes in sync with the leader properly as it downloads a new AOF.
  10. At this point it is in the state i described above, the leader and one of the followers are in sync but the last remaining follower is holding onto old records.

So what really happened here is that the leader will not cleanly killed and when it comes back online, its empty. The followers don't notice this change and continue along thinking everything is good. This means the issue is that the IP address changed of the leader during the reboot and the followers didn't re-verify that they weren't connecting to a leader who's AOF didn't match theirs (or even their server id).

In my case the leader was dying because AOFSHRINK was not properly running so it ran out of drive space. Its a solvable problem but still reveals that there is an issue.

from tile38.

tidwall avatar tidwall commented on May 23, 2024

I can confirm this on my side. Normally, immediately after connecting to a leader, a follower will issue some md5 checks to the leader to determine if they share the same AOF, and it not the follower AOF should be reset to match the leader.
I'll need to dig a little deeper, but I suspect the hard reset of the leader, which changes the server_id and empties the AOF, may be confusing the followers.

from tile38.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.