ceph / ceph-mixins Goto Github PK

View Code? Open in Web Editor NEW

19.0 13.0 15.0 169 KB

A set of Grafana dashboards and Prometheus alerts for Ceph.

License: GNU Lesser General Public License v2.1

Makefile 3.43% Shell 2.52% Jsonnet 94.05%

ceph prometheus-mixins prometheus-alerts ceph-grafana-dashboards kubernetes-mix jsonnet

ceph-mixins's Introduction

Prometheus Monitoring Mixin for Ceph

A set of Prometheus alerts for Ceph.

The scope of this project is to provide Ceph specific Prometheus rule files using Prometheus Mixins.

Prerequisites

Jsonnet [Install Jsonnet]

Jsonnet is a data templating language for app and tool developers.

The mixin project uses Jsonnet to provide reusable and configurable configs for Grafana Dashboards and Prometheus Alerts.
Jsonnet-bundler [Install Jsonnet-bundler]

Jsonnet-bundler is a package manager for jsonnet.
Promtool
1. Download Go (>=1.11) and install it on your system.
2. Setup the GOPATH environment.
3. Run $ go get -d github.com/prometheus/prometheus/cmd/promtool

How to use?

Manually generate configs and rules

You can clone this repository and manually generate Grafana Dashboard configs and Prometheus Rules files, and apply it according to your setup.

$ git clone https://github.com/ceph/ceph-mixins.git
$ cd ceph-mixins

To get dependencies

$ jb install

To generate Prometheus Alert file

$ make prometheus_alert_rules.yaml

To generate Prometheus Rule file

$ make prometheus_rules.yaml

The prometheus_alert_rules.yaml and prometheus_rules.yaml files then needs to be passed to your Prometheus Server.

Background

Prometheus Monitoring Mixin design doc

ceph-mixins's People

Contributors

Stargazers

Watchers

Forkers

shtripat umangachapagain anmolsachan devopsjonas dhimanha tnp-ltd muff1nman mms59 isabella232 aruniiird gowthamshanmugam linode-obs obmondo raptorsun

ceph-mixins's Issues

Duplicate mixin locations

Is this repo abandoned or replaced?

The Ceph github project has two sets of mixins. The one at: https://github.com/ceph/ceph/tree/main/monitoring/ceph-mixin seems to be more maintained than the one here. There seems to be significant differences between the two collections. I'm not clear which one to use, and considering doing a mix of both since both seem to have valuable checks. Should the unique checks here be merged into the main ceph repo? Curious how others are handling this.

Raise an alert if a network interface is down

Raise alert for Network transmit/receive errors

Issue with CephMonHighNumberOfLeaderChanges Alert Duration and Thresholds

Why was the alert duration changed from 15 minutes to 5 minutes?

As in the commits ae51e28, 082c58a

The previous duration might have provided a more extensive window to average out momentary stuff, ensuring only persistent issues were flagged.

Is the threshold of > 0.95 leader changes per minute too low?

We are seeing this alert a lot in our deployment, causing unnecessary noise.

Add alert for cluster state in warning

If the ceph cluster remains in HEALTH_WARN state for more than 10 mins, raise a warning alert

Raise a warning alert if overall cluster utilization crosses 85%

Raise an alert if data recovery is taking longer than expected time

Due to un-availability of storage disk, is data recovery hasn't happened after waiting for say ~2hrs yet, raise an alert.

Correct the messages and severity levels

There are few alerts raised with severity info. They are not actually alerts and should be removed. Also some spelling and grammatical typos should be corrected.

Grafana Dependency Not Required

ceph-mixins/extras/operator/jsonnet/kube-prometheus.libsonnet

Line 4 in e22dc26

(import 'grafana.libsonnet') +

ceph-mixins/extras/operator/jsonnet/kube-prometheus.libsonnet

Lines 10 to 12 in e22dc26

    
           grafana+:: { 
        
             dashboardDefinitions: configMapList.new(super.dashboardDefinitions), 
        
           },

Raise an alert if Clock skew detected.

Clock skew detected on a host. Example: NTP is not configured correctly on this host

Few general fixes required for 1.0

ceph-mixins/extras/manifests/prometheus-rules.yaml

Line 5 in 46e3c59

prometheus: alert

Should be prometheus: k8s

ceph-mixins/extras/manifests/prometheus-rules.yaml

Line 7 in 46e3c59

name: prometheus-alert-rules

Should be prometheus-ceph-rules

ceph-mixins/extras/manifests/prometheus-rules.yaml

Line 8 in 46e3c59

namespace: openshift-storage

Should be default

Raise an alert if OSD disk is not responding/inaccessible

If the underlying storage disk is not responding or is in-accessible, we should raise an alert for the same.

Recording rule required for ceph latency

Raise an alert if ceph PGs remains in inconsistent state for longer

Say for more than a defined time period (~1hr) PGs in the cluster remain in inconsistent state, raise an alert.

Grafana dashboards?

Out of curiosity, how does these dashboard relate to the dashboards in the ceph repo at

https://github.com/ceph/ceph/tree/f386db64332837dacf8f45ec13aa07ca7d6a0b1d/monitoring/grafana/dashboards

cc @jan--f

Add unit tests for alerting rules

Raise an alert if few OSDs in ceph cluster are in down state

If OSDs are in down state for more than defined time, it represents disk failure and backfill, which the admin should be aware of due to the additional load recovery places on the backend (i.e. during backfill client IO could be affected)

Make does not work for dashboard, alerts or rules

Makefile doesn't contain anything for prometheus_alerts, prometheus_rules or dashboards_out. when run you get:
ceph-mixins $ make prometheus_alerts.yaml
make: No rule to make target 'prometheus_alerts.yaml'. Stop.
ceph-mixins $ make prometheus_rules.yaml
make: No rule to make target 'prometheus_rules.yaml'. Stop.
ceph-mixins $ make dashboards_out
make: No rule to make target 'dashboards_out'. Stop.

When reading the Makefile there are no references to any of these items but when make is run w/o argument it only produces prometheus_alert_rules.yaml .

What happened? No more dashboard,etc?

Directions says to do the following:
To generate Prometheus Alert file

$ make prometheus_alerts.yaml

To generate Prometheus Rule file

$ make prometheus_rules.yaml

To generate Grafana Dashboard configs

$ make dashboards_out

Handle NAN values in latency recording rule

Using with kube-prometheus?

The readme really could use an example of how to use with kube-prometheus, for those building a custom prometheus setup for kubernetes - and not just hacking the result of manual build into prometheus somehow.

Perhaps based on https://github.com/prometheus-operator/kube-prometheus/blob/main/examples/mixin-inclusion.jsonnet - an example that actually works for ceph mixin ?

Wrong pod name and hostname shown in alert CephMonHighNumberOfLeaderChanges

After restarting MON pods in the cluster, one after the other, the following alert messages were seen in the console.

Ceph Monitor "rook-ceph-mgr": instance 10.131.0.16:9283 has seen 3.86 leader changes per minute recently.
Ceph Monitor "rook-ceph-mgr": instance 10.131.0.16:9283 has seen 4.47 leader changes per minute recently.
Ceph Monitor "rook-ceph-mgr": instance 10.131.0.16:9283 has seen 2.85 leader changes per minute recently.

The alert mentions "rook-ceph-mgr", which is not a ceph monitor. The message needs to be updated to provide proper alerts.

Raise a critical alert if overall cluster utilization crosses 95%

Raise an alert if MDS services not UP for defined time

Raise an alert if a network interface is flapping

Annotation convention

I think we are duplicating info in our current annotations:

For example

            annotations: {
              message: 'Network interface is down on {{ $labels.host }}',
              description: 'Interface {{ $labels.interface }} on host {{ $labels.host }} is down. Please check cabling, and network switch logs',
              storage_type: $._config.storageType,
              severity_level: 'warning',
            },

message and description has almost the same information.

Maybe instead we just do description ?

also what is storage_type field, why does it matter during solving the alert?

I think if we would end up with something that looks like https://blog.pvincent.io/2017/12/prometheus-blog-series-part-5-alerting-rules/#provide-context-to-facilitate-resolution
would be best.

His example

alert: Lots_Of_Billing_Jobs_In_Queue
expr: sum(jobs_in_queue{service="billing-processing"}) > 100
for: 5m
labels:
   severity: major
annotations:
   summary: Billing queue appears to be building up (consistently more than 100 jobs waiting)
   dashboard: https://grafana.monitoring.intra/dashboard/db/billing-overview
   impact: Billing is experiencing delays, causing orders to be marked as pending
   runbook: https://wiki.intra/runbooks/billing-queues-issues.html

Also I think serverity should be a label not an annotation. Alertmanager can only use labels for alert routing https://prometheus.io/docs/alerting/configuration/#route

So if it's an annotation then you can't send warning alerts to email, critical alerts to opsgenie/pagerduty

Raise an alert if PGs in stuck state

inconsistent pgs represents an data reliability issue and needs action. this needs to be subtley different - the pgs not being active is a natural recovery state - but if the rate of recovery is 0 or a negative value that indicates a problem that needs follow up

	grafana+:: {
	dashboardDefinitions: configMapList.new(super.dashboardDefinitions),
	},

ceph / ceph-mixins Goto Github PK

ceph-mixins's Introduction

Prometheus Monitoring Mixin for Ceph

Prerequisites

How to use?

Manually generate configs and rules

Background

ceph-mixins's People

Contributors

Stargazers

Watchers

Forkers

ceph-mixins's Issues

Recommend Projects

Recommend Topics

Recommend Org