Code Monkey home page Code Monkey logo

Comments (11)

ibuildthecloud avatar ibuildthecloud commented on May 23, 2024

Sorry, there isn't a lot of info in this issue. I think I'm asking if there is somewhere I should look to troubleshoot a bit further. If I can't make any progress I'll try a small use case to reproduce. Right now I'm running this in the full context of k3s.

from dqlite.

ibuildthecloud avatar ibuildthecloud commented on May 23, 2024

I figured out that

rv = DQLITE_ERROR;
effectively masks the error. Removing that line I can now get the error from libraft which gives me something to work off of.

from dqlite.

freeekanayaka avatar freeekanayaka commented on May 23, 2024

Yeah, error propagation between dqlite and raft is currently sub-optimal. I'm working to improve the situation and provide better diagnostic messages for situations like this.

Hope that hacking around the code helps you to spot what's wrong, otherwise let me know.

from dqlite.

ibuildthecloud avatar ibuildthecloud commented on May 23, 2024

On line https://github.com/canonical/raft/blob/25805d9a7c73c4e401e87ff7df67e50efa5db03a/src/raft.c#L136
recovery checks that the node is unavailable. In my situation the node is in state RAFT_FOLLOWER so recover fails.

The full scenario is this.

  1. Start two nodes.
  2. Kill node two
  3. Node one (because of my application) will kill itself because quorum is lost
  4. All nodes are shut off
  5. Start node one in "recovery" mode.
  6. Node.Recover() fails because node is in FOLLOWER state.

All I'm doing effectively in the code is

	node, err := dqlite.New(id, advertiseAddress, dbDir,
		dqlite.WithBindAddress(bindAddress),
		dqlite.WithDialFunc(dial),
		dqlite.WithNetworkLatency(20*time.Millisecond))

node.Start()
node.Recover(singleNode)

from dqlite.

ibuildthecloud avatar ibuildthecloud commented on May 23, 2024

So right now removing the "UNAVAILABLE" check in libraft everything works as expected.

from dqlite.

freeekanayaka avatar freeekanayaka commented on May 23, 2024

Okay, the intended way to use Node.Recover() is before Node.Start(), see the docstring of the low-level dqlite_node_reover() C API:

 * 1. Make sure no dqlite node in the cluster is running.

(which translates to "make sure that Node.Start() wasn't called yet").

I think there is some work to do both in improving documentation and reporting better errors.

from dqlite.

freeekanayaka avatar freeekanayaka commented on May 23, 2024

Note that removing the "UNAVAILABLE" check would make the API unsafe. It might turn out to work in your specific case only by chance.

from dqlite.

ibuildthecloud avatar ibuildthecloud commented on May 23, 2024

@freeekanayaka I tried not starting the node and I got a connection error. I think maybe I was doing something wrong. I'll try again. Thanks.

from dqlite.

ibuildthecloud avatar ibuildthecloud commented on May 23, 2024

@freeekanayaka Thanks, I found the issue. Prior to doing the reset I was contacting the node to find the surviving members ID. Due to my lack of understanding of IDs I didn't want to assume on the client side I had the right ID for the surviving node. I'm guessing I don't have to do that? In my use case I only support resetting the cluster to one member. In that situation I assume I can just hard code the ID to 1? Actually is ID 1 required?

from dqlite.

freeekanayaka avatar freeekanayaka commented on May 23, 2024

@freeekanayaka Thanks, I found the issue. Prior to doing the reset I was contacting the node to find the surviving members ID.

I'm not surely to understand what you mean here.

Due to my lack of understanding of IDs I didn't want to assume on the client side I had the right ID for the surviving node. I'm guessing I don't have to do that?

I'm not clear about this as well.

In my use case I only support resetting the cluster to one member. In that situation I assume I can just hard code the ID to 1? Actually is ID 1 required?

The use case of resetting the cluster to one member should be indeed be one of the most frequent and should also be pretty straightforward to handle. As long as you:

  1. Use 1 for the id parameter of dqlite.New()
  2. Use 1 for the ID attribute of the single dqlite.NodeInfo object that you pass to Node.Recover()

you should be good, regardless of what ID the node actually had before.

from dqlite.

ibuildthecloud avatar ibuildthecloud commented on May 23, 2024

Thanks, everything is working now.

from dqlite.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.