Code Monkey home page Code Monkey logo

Comments (13)

claco avatar claco commented on July 22, 2024

Can you define "compute node goes hard down"? Was the host machine completely dead, etc?

from rpc-maas.

phalmos avatar phalmos commented on July 22, 2024

In this specific case, the compute node had a RAID card failure and kernel panic'd and subsequently when they replaced the card during a maintenance we also did not get an alert.

from rpc-maas.

mancdaz avatar mancdaz commented on July 22, 2024

We need to follow up with the maas team and find out why alerts are not raised when an agent stops checking in.

from rpc-maas.

claco avatar claco commented on July 22, 2024

I don't think that's ever been the case. If I stop the agent on my own servers, I don't get any notification. Did it used to work like that?

from rpc-maas.

mattt416 avatar mattt416 commented on July 22, 2024

Not as far as I'm aware.

from rpc-maas.

claco avatar claco commented on July 22, 2024

Hrm. In this case, I would think maybe the compute being down would mean there's something in nova service-list that we could monitor.

from rpc-maas.

mancdaz avatar mancdaz commented on July 22, 2024

What if all compute nodes go down?

On Fri, 3 Apr 2015 01:51 Christopher H. Laco [email protected]
wrote:

Hrm. In this case, I would think maybe the compute being down would mean
there's something in nova service-list that we could monitor.


Reply to this email directly or view it on GitHub
#186 (comment).

from rpc-maas.

mancdaz avatar mancdaz commented on July 22, 2024

We should actually get something from this plugin. It checks administrative
state versus actual state of nova services

https://github.com/rcbops/rpc-maas/blob/master/nova_service_check.py

On Fri, 3 Apr 2015 08:54 Darren Birkett [email protected] wrote:

What if all compute nodes go down?

On Fri, 3 Apr 2015 01:51 Christopher H. Laco [email protected]
wrote:

Hrm. In this case, I would think maybe the compute being down would mean
there's something in nova service-list that we could monitor.


Reply to this email directly or view it on GitHub
#186 (comment).

from rpc-maas.

cfarquhar avatar cfarquhar commented on July 22, 2024

nova_service_check.py does check service state, but it only does so from the local host that it's checking. So, the monitor is in the same failure domain as the service it's monitoring.

https://github.com/rcbops/rpc-extras/blob/master/playbooks/roles/rpc_maas/tasks/local.yml#L242

@claco @mancdaz

from rpc-maas.

mancdaz avatar mancdaz commented on July 22, 2024

Right, but the other compute nodes also poll their nova-api for nova services, which includes all compute nodes that are registered. If one is administratively up, but shows as down, we should be getting an alert.

from rpc-maas.

claco avatar claco commented on July 22, 2024

So, I was digging w/ @cloudnull. The check on the computes pass --host (itself), so they only would error if they themselves go offline.

If you don't pass --host, it does return all metrics for all services...but currently, we're not dropping in that type of no-host check anywhere. In addition to that, the no-host version lists all nova services, and when that list hits the metrics max per check, well, then things get interesting.

from rpc-maas.

claco avatar claco commented on July 22, 2024

@mancdaz @mattt416 While not ideal, in the interim, we could add an aggregate 1/0 flag so when something in nova service-list goes down, we can at least drop in a check/alert for it.

I'm not sure about adding the check/alarms to all the infras via ansible, as we might not want to keep them. At the very least, we should be able to manually configure some checks when the changed file is in place.

from rpc-maas.

mancdaz avatar mancdaz commented on July 22, 2024

This is being taken care of by the 'agent down' alerting being rolled out currently by maas themselves, and as such is not something we need to track in this repo

from rpc-maas.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.