ceph / ceph-mixins Goto Github PK
View Code? Open in Web Editor NEWA set of Grafana dashboards and Prometheus alerts for Ceph.
License: GNU Lesser General Public License v2.1
A set of Grafana dashboards and Prometheus alerts for Ceph.
License: GNU Lesser General Public License v2.1
If the underlying storage disk is not responding or is in-accessible, we should raise an alert for the same.
The readme really could use an example of how to use with kube-prometheus, for those building a custom prometheus setup for kubernetes - and not just hacking the result of manual build into prometheus somehow.
Perhaps based on https://github.com/prometheus-operator/kube-prometheus/blob/main/examples/mixin-inclusion.jsonnet - an example that actually works for ceph mixin ?
Clock skew detected on a host. Example: NTP is not configured correctly on this host
If OSDs are in down state for more than defined time, it represents disk failure and backfill, which the admin should be aware of due to the additional load recovery places on the backend (i.e. during backfill client IO could be affected)
If the ceph cluster remains in state HEALTH_ERR for more than 10 mins raise a critical alert.
There are few alerts raised with severity info
. They are not actually alerts and should be removed. Also some spelling and grammatical typos should be corrected.
Is this repo abandoned or replaced?
The Ceph github project has two sets of mixins. The one at: https://github.com/ceph/ceph/tree/main/monitoring/ceph-mixin seems to be more maintained than the one here. There seems to be significant differences between the two collections. I'm not clear which one to use, and considering doing a mix of both since both seem to have valuable checks. Should the unique checks here be merged into the main ceph repo? Curious how others are handling this.
inconsistent pgs represents an data reliability issue and needs action. this needs to be subtley different - the pgs not being active is a natural recovery state - but if the rate of recovery is 0 or a negative value that indicates a problem that needs follow up
rook handles mon down conditions. If rook is unable to bring back required mons for quorum with in a defined time limit (say 15 mins), raise an alert for the same
Should be prometheus: k8s
Should be prometheus-ceph-rules
Should be default
As in the commits ae51e28, 082c58a
The previous duration might have provided a more extensive window to average out momentary stuff, ensuring only persistent issues were flagged.
We are seeing this alert a lot in our deployment, causing unnecessary noise.
I think we are duplicating info in our current annotations:
For example
annotations: {
message: 'Network interface is down on {{ $labels.host }}',
description: 'Interface {{ $labels.interface }} on host {{ $labels.host }} is down. Please check cabling, and network switch logs',
storage_type: $._config.storageType,
severity_level: 'warning',
},
message
and description
has almost the same information.
Maybe instead we just do description
?
also what is storage_type
field, why does it matter during solving the alert?
I think if we would end up with something that looks like https://blog.pvincent.io/2017/12/prometheus-blog-series-part-5-alerting-rules/#provide-context-to-facilitate-resolution
would be best.
His example
alert: Lots_Of_Billing_Jobs_In_Queue
expr: sum(jobs_in_queue{service="billing-processing"}) > 100
for: 5m
labels:
severity: major
annotations:
summary: Billing queue appears to be building up (consistently more than 100 jobs waiting)
dashboard: https://grafana.monitoring.intra/dashboard/db/billing-overview
impact: Billing is experiencing delays, causing orders to be marked as pending
runbook: https://wiki.intra/runbooks/billing-queues-issues.html
Also I think serverity
should be a label not an annotation. Alertmanager can only use labels for alert routing https://prometheus.io/docs/alerting/configuration/#route
So if it's an annotation then you can't send warning alerts to email, critical alerts to opsgenie/pagerduty
Say for more than a defined time period (~1hr) PGs in the cluster remain in inconsistent state, raise an alert.
Out of curiosity, how does these dashboard relate to the dashboards in the ceph repo at
cc @jan--f
The alert time should be set as 2 sec for sample rules yaml file
Due to un-availability of storage disk, is data recovery hasn't happened after waiting for say ~2hrs yet, raise an alert.
If the ceph cluster remains in HEALTH_WARN state for more than 10 mins, raise a warning alert
Makefile doesn't contain anything for prometheus_alerts, prometheus_rules or dashboards_out. when run you get:
ceph-mixins $ make prometheus_alerts.yaml
make: No rule to make target 'prometheus_alerts.yaml'. Stop.
ceph-mixins $ make prometheus_rules.yaml
make: No rule to make target 'prometheus_rules.yaml'. Stop.
ceph-mixins $ make dashboards_out
make: No rule to make target 'dashboards_out'. Stop.
When reading the Makefile there are no references to any of these items but when make is run w/o argument it only produces prometheus_alert_rules.yaml .
What happened? No more dashboard,etc?
Directions says to do the following:
To generate Prometheus Alert file
$ make prometheus_alerts.yaml
To generate Prometheus Rule file
$ make prometheus_rules.yaml
To generate Grafana Dashboard configs
$ make dashboards_out
After restarting MON pods in the cluster, one after the other, the following alert messages were seen in the console.
Ceph Monitor "rook-ceph-mgr": instance 10.131.0.16:9283 has seen 3.86 leader changes per minute recently.
Ceph Monitor "rook-ceph-mgr": instance 10.131.0.16:9283 has seen 4.47 leader changes per minute recently.
Ceph Monitor "rook-ceph-mgr": instance 10.131.0.16:9283 has seen 2.85 leader changes per minute recently.
The alert mentions "rook-ceph-mgr", which is not a ceph monitor. The message needs to be updated to provide proper alerts.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.