Comments (13)
Can you define "compute node goes hard down"? Was the host machine completely dead, etc?
from rpc-maas.
In this specific case, the compute node had a RAID card failure and kernel panic'd and subsequently when they replaced the card during a maintenance we also did not get an alert.
from rpc-maas.
We need to follow up with the maas team and find out why alerts are not raised when an agent stops checking in.
from rpc-maas.
I don't think that's ever been the case. If I stop the agent on my own servers, I don't get any notification. Did it used to work like that?
from rpc-maas.
Not as far as I'm aware.
from rpc-maas.
Hrm. In this case, I would think maybe the compute being down would mean there's something in nova service-list
that we could monitor.
from rpc-maas.
What if all compute nodes go down?
On Fri, 3 Apr 2015 01:51 Christopher H. Laco [email protected]
wrote:
Hrm. In this case, I would think maybe the compute being down would mean
there's something in nova service-list that we could monitor.—
Reply to this email directly or view it on GitHub
#186 (comment).
from rpc-maas.
We should actually get something from this plugin. It checks administrative
state versus actual state of nova services
https://github.com/rcbops/rpc-maas/blob/master/nova_service_check.py
On Fri, 3 Apr 2015 08:54 Darren Birkett [email protected] wrote:
What if all compute nodes go down?
On Fri, 3 Apr 2015 01:51 Christopher H. Laco [email protected]
wrote:Hrm. In this case, I would think maybe the compute being down would mean
there's something in nova service-list that we could monitor.—
Reply to this email directly or view it on GitHub
#186 (comment).
from rpc-maas.
nova_service_check.py does check service state, but it only does so from the local host that it's checking. So, the monitor is in the same failure domain as the service it's monitoring.
https://github.com/rcbops/rpc-extras/blob/master/playbooks/roles/rpc_maas/tasks/local.yml#L242
from rpc-maas.
Right, but the other compute nodes also poll their nova-api for nova services, which includes all compute nodes that are registered. If one is administratively up, but shows as down, we should be getting an alert.
from rpc-maas.
So, I was digging w/ @cloudnull. The check on the computes pass --host (itself), so they only would error if they themselves go offline.
If you don't pass --host, it does return all metrics for all services...but currently, we're not dropping in that type of no-host check anywhere. In addition to that, the no-host version lists all nova services, and when that list hits the metrics max per check, well, then things get interesting.
from rpc-maas.
@mancdaz @mattt416 While not ideal, in the interim, we could add an aggregate 1/0 flag so when something in nova service-list goes down, we can at least drop in a check/alert for it.
I'm not sure about adding the check/alarms to all the infras via ansible, as we might not want to keep them. At the very least, we should be able to manually configure some checks when the changed file is in place.
from rpc-maas.
This is being taken care of by the 'agent down' alerting being rolled out currently by maas themselves, and as such is not something we need to track in this repo
from rpc-maas.
Related Issues (20)
- conntack count not monitored inside the containers
- Multiple HP checks break if battery status isn't shown HOT 3
- Add support for HTTP proxy with rackspace-monitoring-agent HOT 2
- Split maas_external_ip_address into multiple vars
- Remote Horizon check needs a trailing slash
- Swift/Ceph URE monitoring with smartctl
- virtual block devices installed as disk utilization alarms
- Grafana Dashboard wont show graphs HOT 1
- files/pip-constraints doesn't have setuptools, wheels, pip.
- Process check incorrectly matches itself
- Add testing for MaaS poller functionality
- Horizon Check fails with self Sign CA HOT 2
- Monitor increases of softnet drops
- Create more specific checks around ceph HOT 1
- Deprecated virtualenv argument "--no-site-packages" HOT 1
- possible ubuntu 18.04 upgrade issues
- Octavia - MaaS fails on Octavia if there are multiple projects named 'admin' HOT 1
- Container storage check plugin fails when /etc/mtab is a symlink HOT 2
- verify-maas does no longer work with Ubuntu 20.04
- breaking changes in netaddr
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rpc-maas.