The ceph-mixins's discuss from ceph

Raise an alert if OSD disk is not responding/inaccessible

If the underlying storage disk is not responding or is in-accessible, we should raise an alert for the same.

Raise an alert if a network interface is down

Raise alert for Network transmit/receive errors

Add unit tests for alerting rules

Using with kube-prometheus?

The readme really could use an example of how to use with kube-prometheus, for those building a custom prometheus setup for kubernetes - and not just hacking the result of manual build into prometheus somehow.

Perhaps based on https://github.com/prometheus-operator/kube-prometheus/blob/main/examples/mixin-inclusion.jsonnet - an example that actually works for ceph mixin ?

Raise an alert if Clock skew detected.

Clock skew detected on a host. Example: NTP is not configured correctly on this host

Raise an alert if few OSDs in ceph cluster are in down state

If OSDs are in down state for more than defined time, it represents disk failure and backfill, which the admin should be aware of due to the additional load recovery places on the backend (i.e. during backfill client IO could be affected)

Add alert for cluster in error state

If the ceph cluster remains in state HEALTH_ERR for more than 10 mins raise a critical alert.

Raise an alert if a network interface is flapping

Correct the messages and severity levels

There are few alerts raised with severity info. They are not actually alerts and should be removed. Also some spelling and grammatical typos should be corrected.

Duplicate mixin locations

Is this repo abandoned or replaced?

The Ceph github project has two sets of mixins. The one at: https://github.com/ceph/ceph/tree/main/monitoring/ceph-mixin seems to be more maintained than the one here. There seems to be significant differences between the two collections. I'm not clear which one to use, and considering doing a mix of both since both seem to have valuable checks. Should the unique checks here be merged into the main ceph repo? Curious how others are handling this.

Raise an alert if PGs in stuck state

inconsistent pgs represents an data reliability issue and needs action. this needs to be subtley different - the pgs not being active is a natural recovery state - but if the rate of recovery is 0 or a negative value that indicates a problem that needs follow up

Raise an alert if we can't sustain a node failure in ceph

Raise an alert if mon quorum is lost for the ceph cluster for more than defined time

rook handles mon down conditions. If rook is unable to bring back required mons for quorum with in a defined time limit (say 15 mins), raise an alert for the same

Raise an alert if a ceph storage node goes down

Few general fixes required for 1.0

ceph-mixins/extras/manifests/prometheus-rules.yaml

Line 5 in 46e3c59

prometheus: alert

Should be prometheus: k8s

ceph-mixins/extras/manifests/prometheus-rules.yaml

Line 7 in 46e3c59

name: prometheus-alert-rules

Should be prometheus-ceph-rules

ceph-mixins/extras/manifests/prometheus-rules.yaml

Line 8 in 46e3c59

namespace: openshift-storage

Should be default

Required telemeter rules

Issue with CephMonHighNumberOfLeaderChanges Alert Duration and Thresholds

Why was the alert duration changed from 15 minutes to 5 minutes?

As in the commits ae51e28, 082c58a

The previous duration might have provided a more extensive window to average out momentary stuff, ensuring only persistent issues were flagged.

Is the threshold of > 0.95 leader changes per minute too low?

We are seeing this alert a lot in our deployment, causing unnecessary noise.

Annotation convention

I think we are duplicating info in our current annotations:

For example

            annotations: {
              message: 'Network interface is down on {{ $labels.host }}',
              description: 'Interface {{ $labels.interface }} on host {{ $labels.host }} is down. Please check cabling, and network switch logs',
              storage_type: $._config.storageType,
              severity_level: 'warning',
            },

message and description has almost the same information.

Maybe instead we just do description ?

also what is storage_type field, why does it matter during solving the alert?

I think if we would end up with something that looks like https://blog.pvincent.io/2017/12/prometheus-blog-series-part-5-alerting-rules/#provide-context-to-facilitate-resolution
would be best.

His example

alert: Lots_Of_Billing_Jobs_In_Queue
expr: sum(jobs_in_queue{service="billing-processing"}) > 100
for: 5m
labels:
   severity: major
annotations:
   summary: Billing queue appears to be building up (consistently more than 100 jobs waiting)
   dashboard: https://grafana.monitoring.intra/dashboard/db/billing-overview
   impact: Billing is experiencing delays, causing orders to be marked as pending
   runbook: https://wiki.intra/runbooks/billing-queues-issues.html

Also I think serverity should be a label not an annotation. Alertmanager can only use labels for alert routing https://prometheus.io/docs/alerting/configuration/#route

So if it's an annotation then you can't send warning alerts to email, critical alerts to opsgenie/pagerduty

Raise an alert if ceph PGs remains in inconsistent state for longer

Say for more than a defined time period (~1hr) PGs in the cluster remain in inconsistent state, raise an alert.

Raise an alert if MDS services not UP for defined time

Grafana dashboards?

Out of curiosity, how does these dashboard relate to the dashboards in the ceph repo at

https://github.com/ceph/ceph/tree/f386db64332837dacf8f45ec13aa07ca7d6a0b1d/monitoring/grafana/dashboards

cc @jan--f

Raise a critical alert if overall cluster utilization crosses 95%

Change the alert time for rules in sample rules yaml file

The alert time should be set as 2 sec for sample rules yaml file

Handle NAN values in latency recording rule

Raise an alert if data recovery is taking longer than expected time

Due to un-availability of storage disk, is data recovery hasn't happened after waiting for say ~2hrs yet, raise an alert.

Recording rule required for ceph latency

Add alert for cluster state in warning

If the ceph cluster remains in HEALTH_WARN state for more than 10 mins, raise a warning alert

Make does not work for dashboard, alerts or rules

Makefile doesn't contain anything for prometheus_alerts, prometheus_rules or dashboards_out. when run you get:
ceph-mixins $ make prometheus_alerts.yaml
make: No rule to make target 'prometheus_alerts.yaml'. Stop.
ceph-mixins $ make prometheus_rules.yaml
make: No rule to make target 'prometheus_rules.yaml'. Stop.
ceph-mixins $ make dashboards_out
make: No rule to make target 'dashboards_out'. Stop.

When reading the Makefile there are no references to any of these items but when make is run w/o argument it only produces prometheus_alert_rules.yaml .

What happened? No more dashboard,etc?

Directions says to do the following:
To generate Prometheus Alert file

$ make prometheus_alerts.yaml

To generate Prometheus Rule file

$ make prometheus_rules.yaml

To generate Grafana Dashboard configs

$ make dashboards_out

Grafana Dependency Not Required

ceph-mixins/extras/operator/jsonnet/kube-prometheus.libsonnet

Line 4 in e22dc26

(import 'grafana.libsonnet') +

ceph-mixins/extras/operator/jsonnet/kube-prometheus.libsonnet

Lines 10 to 12 in e22dc26

    
           grafana+:: { 
        
             dashboardDefinitions: configMapList.new(super.dashboardDefinitions), 
        
           },

Wrong pod name and hostname shown in alert CephMonHighNumberOfLeaderChanges

After restarting MON pods in the cluster, one after the other, the following alert messages were seen in the console.

Ceph Monitor "rook-ceph-mgr": instance 10.131.0.16:9283 has seen 3.86 leader changes per minute recently.
Ceph Monitor "rook-ceph-mgr": instance 10.131.0.16:9283 has seen 4.47 leader changes per minute recently.
Ceph Monitor "rook-ceph-mgr": instance 10.131.0.16:9283 has seen 2.85 leader changes per minute recently.

The alert mentions "rook-ceph-mgr", which is not a ceph monitor. The message needs to be updated to provide proper alerts.

	grafana+:: {
	dashboardDefinitions: configMapList.new(super.dashboardDefinitions),
	},

ceph / ceph-mixins Goto Github PK

ceph-mixins's Issues

Recommend Projects

Recommend Topics

Recommend Org