Code Monkey home page Code Monkey logo

ceph-mixins's Introduction

Prometheus Monitoring Mixin for Ceph

A set of Prometheus alerts for Ceph.

The scope of this project is to provide Ceph specific Prometheus rule files using Prometheus Mixins.

Prerequisites

  • Jsonnet [Install Jsonnet]

    Jsonnet is a data templating language for app and tool developers.

    The mixin project uses Jsonnet to provide reusable and configurable configs for Grafana Dashboards and Prometheus Alerts.

  • Jsonnet-bundler [Install Jsonnet-bundler]

    Jsonnet-bundler is a package manager for jsonnet.

  • Promtool

    1. Download Go (>=1.11) and install it on your system.
    2. Setup the GOPATH environment.
    3. Run $ go get -d github.com/prometheus/prometheus/cmd/promtool

How to use?

Manually generate configs and rules

You can clone this repository and manually generate Grafana Dashboard configs and Prometheus Rules files, and apply it according to your setup.

$ git clone https://github.com/ceph/ceph-mixins.git
$ cd ceph-mixins

To get dependencies

$ jb install

To generate Prometheus Alert file

$ make prometheus_alert_rules.yaml

To generate Prometheus Rule file

$ make prometheus_rules.yaml

The prometheus_alert_rules.yaml and prometheus_rules.yaml files then needs to be passed to your Prometheus Server.

Background

ceph-mixins's People

Contributors

anmolsachan avatar aruniiird avatar danpoltawski avatar devopsjonas avatar gowthamshanmugam avatar shtripat avatar umangachapagain avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ceph-mixins's Issues

Duplicate mixin locations

Is this repo abandoned or replaced?

The Ceph github project has two sets of mixins. The one at: https://github.com/ceph/ceph/tree/main/monitoring/ceph-mixin seems to be more maintained than the one here. There seems to be significant differences between the two collections. I'm not clear which one to use, and considering doing a mix of both since both seem to have valuable checks. Should the unique checks here be merged into the main ceph repo? Curious how others are handling this.

Issue with CephMonHighNumberOfLeaderChanges Alert Duration and Thresholds

  • Why was the alert duration changed from 15 minutes to 5 minutes?

As in the commits ae51e28, 082c58a

The previous duration might have provided a more extensive window to average out momentary stuff, ensuring only persistent issues were flagged.

  • Is the threshold of > 0.95 leader changes per minute too low?

We are seeing this alert a lot in our deployment, causing unnecessary noise.

Correct the messages and severity levels

There are few alerts raised with severity info. They are not actually alerts and should be removed. Also some spelling and grammatical typos should be corrected.

Make does not work for dashboard, alerts or rules

Makefile doesn't contain anything for prometheus_alerts, prometheus_rules or dashboards_out. when run you get:
ceph-mixins $ make prometheus_alerts.yaml
make: No rule to make target 'prometheus_alerts.yaml'. Stop.
ceph-mixins $ make prometheus_rules.yaml
make: No rule to make target 'prometheus_rules.yaml'. Stop.
ceph-mixins $ make dashboards_out
make: No rule to make target 'dashboards_out'. Stop.

When reading the Makefile there are no references to any of these items but when make is run w/o argument it only produces prometheus_alert_rules.yaml .

What happened? No more dashboard,etc?

Directions says to do the following:
To generate Prometheus Alert file

$ make prometheus_alerts.yaml

To generate Prometheus Rule file

$ make prometheus_rules.yaml

To generate Grafana Dashboard configs

$ make dashboards_out

Wrong pod name and hostname shown in alert CephMonHighNumberOfLeaderChanges

After restarting MON pods in the cluster, one after the other, the following alert messages were seen in the console.

Ceph Monitor "rook-ceph-mgr": instance 10.131.0.16:9283 has seen 3.86 leader changes per minute recently.
Ceph Monitor "rook-ceph-mgr": instance 10.131.0.16:9283 has seen 4.47 leader changes per minute recently.
Ceph Monitor "rook-ceph-mgr": instance 10.131.0.16:9283 has seen 2.85 leader changes per minute recently.

The alert mentions "rook-ceph-mgr", which is not a ceph monitor. The message needs to be updated to provide proper alerts.

Annotation convention

I think we are duplicating info in our current annotations:

For example

            annotations: {
              message: 'Network interface is down on {{ $labels.host }}',
              description: 'Interface {{ $labels.interface }} on host {{ $labels.host }} is down. Please check cabling, and network switch logs',
              storage_type: $._config.storageType,
              severity_level: 'warning',
            },

message and description has almost the same information.

Maybe instead we just do description ?

also what is storage_type field, why does it matter during solving the alert?

I think if we would end up with something that looks like https://blog.pvincent.io/2017/12/prometheus-blog-series-part-5-alerting-rules/#provide-context-to-facilitate-resolution
would be best.

His example

alert: Lots_Of_Billing_Jobs_In_Queue
expr: sum(jobs_in_queue{service="billing-processing"}) > 100
for: 5m
labels:
   severity: major
annotations:
   summary: Billing queue appears to be building up (consistently more than 100 jobs waiting)
   dashboard: https://grafana.monitoring.intra/dashboard/db/billing-overview
   impact: Billing is experiencing delays, causing orders to be marked as pending
   runbook: https://wiki.intra/runbooks/billing-queues-issues.html

Also I think serverity should be a label not an annotation. Alertmanager can only use labels for alert routing https://prometheus.io/docs/alerting/configuration/#route

So if it's an annotation then you can't send warning alerts to email, critical alerts to opsgenie/pagerduty

Raise an alert if PGs in stuck state

inconsistent pgs represents an data reliability issue and needs action. this needs to be subtley different - the pgs not being active is a natural recovery state - but if the rate of recovery is 0 or a negative value that indicates a problem that needs follow up

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.