Code Monkey home page Code Monkey logo

Comments (10)

werkt avatar werkt commented on June 3, 2024

Buildfarm is currently intolerant to redis configuration changes. Redis clusters that change IPs/names are not reflected in server/worker grouped communication. Its something I've been meaning to experiment with tolerating, but the current guidance is to only run buildfarm on stable redis clusters, and to bring it down when clusters' layouts (including slot allocations) change. A number of modifications will need to be made to have the entire cluster respond correctly to reconfigured clusters.

from bazel-buildfarm.

aisbaa avatar aisbaa commented on June 3, 2024

Thanks for explanation @werkt, from what I've observed is that if I restart buildfarm after redis comes back up, it starts working as usual. I can give a demo if your up for zoom call.

One more anecdotal data point, we have old internal buildfarm build (before 2.8.0) which closes the tcp port if redis goes down. This allows me to use tcp liveness probe. In case of redis restart buildfarm gets restarted by kubernetes.

Edit:

Redis clusters that change IPs/names are not reflected in server/worker grouped communication.

For redis we use kubernetes ClusterIP service which is like load balance with static IP address. We also pass redis address as domain (f.e.: redis://redis.buildfarm-java:6379).

from bazel-buildfarm.

werkt avatar werkt commented on June 3, 2024

I think I see - you're questioning the discrepancy between 'No available workers' and the threefold service availability on the workers.

The problem needs to be looked at on the servers first - conditions in the cluster are leading to them, not the workers, responding with UNAVAILABLE.

The servers need to see the manifest of current storage workers in redis to read/write blobs. I don't know if a failing connection to redis will inspire it into something other than the UNAVAILABLE behavior from above.
Further, if all of the workers listed in storage are inaccessible as named in the storage map, you will get the UNAVAILABLE response above.
I suggest trying this with a single instance, seeding broken values, introduce redis shutdowns or node removals, and see the results. With explicit steps for an unexpected reproducer case, you can augment this issue and/or write a unit test.

from bazel-buildfarm.

aisbaa avatar aisbaa commented on June 3, 2024

I think we are going away from the main problem I was referring to. Which is buildfarm shard-worker is not healthy (as observed in the dashboard), but liveness and readiness probes report that shard-worker is fine.

Manually sending curl request to worker-shard readiness/liveness endpoit returns 200:

% curl -I http://10.32.8.185:9090/
HTTP/1.1 200 OK
Date: Tue, 20 Feb 2024 13:12:18 GMT
Content-type: text/plain; version=0.0.4; charset=utf-8
Content-length: 37180

from bazel-buildfarm.

werkt avatar werkt commented on June 3, 2024

200 on the prometheus endpoint doesn't necessarily indicate 'fine', but I agree that there's no presentation difference on any public interface that a worker can't talk to the backplane if it happens to be down. Ostensibly, the worker is fine, and coping with the fact that it can't talk to redis at a particular moment.

Starting a worker does fail currently if it cannot connect to redis, which happens because backplane initialization was never made tolerant to it. With little effort, I could actually make it start cleanly and function normally, without any connection available to redis.
Killing my connected redis server during steady state and then putting it back up however does allow the worker to continue without a restart.
Workers with a disconnected redis are able to serve files as normal CAS members, and without an execution pipeline enabled, would essentially serve as standalone grpc CAS endpoints.

So the statement "Health check does not seem to work..." is not accurate: the worker is healthy. Redis, hosting the backplane, is not.

What do you want the worker to do differently when the backplane cannot be reached?

from bazel-buildfarm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.