flo-at / minmon Goto Github PK

MinMon - an opinionated minimal monitoring and alarming tool

License: Apache License 2.0

Dockerfile 0.45% Rust 99.55%

monitoring alarming uptime linux

minmon's Introduction

MinMon - an opinionated minimal monitoring and alarming tool (for Linux)

This tool is just a single binary and a config file. No database, no GUI, no graphs. Just monitoring and alarms. I wrote this because the existing alternatives I could find were too heavy, mainly focused on nice GUIs with graphs (not on alarming), too complex to setup, or targeted at cloud/multi-instance setups.

Checks

The checks read the measurement values that will be monitored by MinMon.

Actions

An action is triggered, when a check's alarm changes its state or a report event is triggered.

Report

The absence of alarms can mean two things: everything is okay or the monitoring/alarming failed altogether. That's why MinMon can trigger regular report events to let you know that it's up and running.

Design decisions

No complex scripting language.
No fancy config directory structure - just a single TOML file.
No users, groups or roles.
No cryptic abbreviations. The few extra letters in the config file won't hurt anyone.
There are no predefined threshold names like "Warning" or "Critical". You might want more than just two, or only one. So that's up to you to define in the config.
The same check plugin can be used multiple times. You might want different levels to trigger different actions for different filesystems at different intervals.
Alarms are timed in "cycles" (i.e. multiples of the interval of the check) instead of seconds. It's not very user-friendly but helps to keep the internal processing and the code simple and efficient.
Alarms stand for themselves - they are not related. This means that depending on your configuration, two (or more) events may be triggered at the same time for the same check. There are cases where this could be undesirable.
Simple, clean, bloat-free code with good test coverage.
Depending on your configuration, there may be similar or identical blocks in the config file. This is a consequence of the flexibility and simpleness of the config file format.
All times and dates are UTC. No fiddling with local times and time zones.
No internal state is stored between restarts.
As of now it's only for Linux but it should be easy to adapt to other *NIXes or maybe even Windows.
Some of the things mentioned above may change in the future (see Roadmap).

Config file

The config file uses the TOML format and has the following sections:

Architecture

System overview

graph TD
    A(Config file) --> B(Main loop)
    B -->|interval| C(Check 1)
    B -.-> D(Check 2..n)
    C -->|data| E(Alarm 1)
    C -.-> F(Alarm 2..m)
    E -->|cycles, repeat_cycles| G(Action)
    E -->|recover_cycles| H(Recover action)
    E -->|error_repeat_cycles| I(Error action)
    E --> J(Error recover action)

    style C fill:green;
    style D fill:green;
    style E fill:red;
    style F fill:red;
    style G fill:blue;
    style H fill:blue;
    style I fill:blue;
    style J fill:blue;

Alarm state machine

Each alarm has 3 possible states. "Good", "Bad" and "Error".
It takes cycles consecutive bad data points to trigger the transition from "Good" to "Bad" and recover_cycles good ones to go back. These transitions trigger the action and recover_action actions. During the "Bad" state, action will be triggered again every repeat_cycles cycles (if repeat_cycles is not 0).

The "Error" state is a bit special as it only "shadows" the other states. An error means that there is no data available at all, e.g. the filesystem usage for /home could not be determined. Since this should rarely ever happen, the transition to the error state always triggers the error_action on the first cycle. If there is valid data on the next cycle, the state machine continues as if the error state did not exist and the error_recover_action is triggered.

stateDiagram-v2
    direction LR

    [*] --> Good
    Good --> Good
    Good --> Bad: action/cycles
    Good --> Error: error_action

    Bad --> Good: recover_action/recover_cycles
    Bad --> Bad: repeat_action/repeat_cycles
    Bad --> Error: error_action

    Error --> Good: error_recover_action
    Error --> Bad: error_recover_action
    Error --> Error: error_repeat_action/error_repeat_cycles

Example

Check the mountpoint at /home every minute. If the usage level exceeds 70% for 3 consecutive cycles (i.e. 3 minutes), the "Warning" alarm triggers the "Webhook 1" action. The action repeats every 100 cycles until the "Warning" alarm recovers. This happens after 5 consecutive cycles below 70% which also triggers the "Webhook 1" action. If there is an error while checking the filesystem usage, the "Log error" action is triggered. This is repeated every 200 cycles.

Config

[[checks]]
interval = 60
name = "Filesystem usage"
type = "FilesystemUsage"
mountpoints = ["/home"]

[[checks.alarms]]
name = "Warning"
level = 70
cycles = 3
repeat_cycles = 100
action = "Webhook 1"
recover_cycles = 5
recover_action = "Webhook 1"
error_repeat_cycles = 200
error_action = "Log error"

[[actions]]
name = "Webhook 1"
type = "Webhook"
url = "https://example.com/hook1"
body = """{"text": "{{check_name}}: Alarm '{{alarm_name}}' for mountpoint '{{check_id}}' changed state to *{{alarm_state}}* at {{level}}."}"""
headers = {"Content-Type" = "application/json"}

[[actions]]
name = "Log error"
type = "Log"
level = "Error"
template = """{{check_name}} check didn't have valid data for alarm '{{alarm_name}}' and id '{{alarm_id}}': {{check_error}}."""

# This is a block comment. It demonstrates how to add another check and alarm.
# [[checks]]
# name = "System pressure"
# type = "PressureAverage"
# cpu = true
# avg60 = true
#
# [[checks.alarms]]
# name = "Warning"
# level = 80
# action = "Another action"

The webhook text will be rendered into something like "Warning: Filesystem usage on mountpoint '/home' reached 70%."

Diagram

graph TD
    A(example.toml) --> B(Main loop)
    B -->|every 60 seconds| C(FilesystemUsage 1: '/srv')
    C -->|level '/srv': 60%| D(LevelAlarm 1: 70%)
    D -->|cycles: 3, repeat_cycles: 100| E(Action: Webhook 1)
    D -->|recover_cycles: 5| F(Recover action: Webhook 1)
    D -->|error_repeat_cycles: 200| G(Error action: Log error)

    style C fill:green;
    style D fill:red;
    style E fill:blue;
    style F fill:blue;
    style G fill:blue;

Some (more exotic) ideas

Just to give some ideas of what's possible:

Run it locally on your workstation and let it send you notifications to your desktop environment using the Process action and notify-send when the filesystem fills up.
Use the report in combination with the Webhook action and telepush and let it send you "I'm still alive, since {{minmon_uptime_iso}}!" once a week to your Telegram messenger for the peace of mind.

Placeholders

To improve the reusability of the actions, it's possible to define custom placeholders for the report, events, checks, alarms and actions. When an action is triggered, the placeholders (generic and custom) are merged into the final placeholder map. Inside the action (depending on the type of the action) the placeholders can be used in one or more config fields using the {{placeholder_name}} syntax. There are also some generic placeholders that are always available. Placeholders that don't have a value available when the action is triggered will be replaced by an empty string.

Filters

Filters can be applied to transform the measurement data. This has different use cases. For example:

Compensate for fluctuations in the measurement.
Determine the total network traffic over a number of cycles.

They can be configured for checks, in which case they affect all alarms that belong to the check, or alarms individually. Having both options reduces duplication in the config file in some cases. The check is the preferred place for filtering because it's only done once for all alarms which reduces memory and CPU usage.

Installation

Docker image

To pull the docker image use

docker pull ghcr.io/flo-at/minmon:latest

or the example docker-compose.yml file.
In both cases, read-only mount your config file to /etc/minmon.toml.

Build and install using cargo

Make sure cargo and OpenSSL are correctly installed on your local machine.
You can either install MinMon from crates.io using

cargo install --all-features minmon

Or if you already checked out the repository, you can build and install your local copy like this:

cargo install --all-features --path .

Copy the systemd.minmon.service file to /etc/systemd/system/minmon.service and place your config file at path /etc/minmon.toml. You can enable and start the service with systemctl daemon-reload && systemctl enable --now minmon.service.

Install from the AUR (Arch Linux)

Use your package manager of choice to install the minmon package from the AUR.
Place your config file at path /etc/minmon.toml. You can enable and start the service with systemctl daemon-reload && systemctl enable --now minmon.service.

systemd integration (optional)

Build with --features systemd to enable support for systemd.

Logging to journal.
Notify systemd about start-up completion (Type=notify).
Periodically reset systemd watchdog (WatchdogSec=x).

lm_sensors integration (optional)

Build with --features sensors to enable support for lm_sensors.
For the docker image, optionally mount your lm_sensors config file(s) to /etc/sensors.d/.
Note: libsensors is not cooperative and might theoretically block the event loop.

Contributions

See CONTRIBUTING.md

minmon's People

Contributors

Stargazers

Watchers

Forkers

stappersg orinocoz simonsan opentinyhouse yonasbsd zmilan lattenlui bhardwajrahul bbx0 theevilroot heisenbergsupreme antedebaas

minmon's Issues

Adjust default values so everything is opt-in.

Everything that is monitored should be visible in the config file. That means everything should be opt-in. Right now some check options are enabled by default (e.g. "memory" in MemoryUsage). This is a breaking change regarding existing config files, so this has to be done before the v1.0.0 release.

Release with binary

Hi, would you consider publishing compiled binary with your releases please?

For people who are not rust developers, it is added complexity to compile minmon. I just followed your instructions, but my cargo install command ended with error

error: linker `cc` not found
  |
  = note: No such file or directory (os error 2)

error: could not compile `proc-macro2` (build script) due to 1 previous error
warning: build failed, waiting for other jobs to finish...
error: could not compile `libc` (build script) due to 1 previous error
error: failed to compile `minmon v0.9.0 (/home/urza/minmon/minmon)`, intermediate artifacts can be found at `/home/urza/minmon/minmon/target`.
To reuse those artifacts with a future compilation, set the environment variable `CARGO_TARGET_DIR` to that path.

I can imagine that it is not just me who would appreciate if you published the binary with releases that we could just drop into the system..

Thanks

Environment variables as placeholders

Discussed in #193

^{Originally posted by cascandaliato May 8, 2024}
I'm using minmon through Docker and some of the actions are Pushover webhooks that require API keys. I'd like to keep my Pushover keys in a .env file and inject them into the container environment.

Reminder: re-enable the Docker ARMv7 image

It does not build currently and I couldn't find the reason. I guess when a new image is released on the Docker Hub it will work again. For the next release the ARMv7 image generation should be possible again.

allowing placeholders in a Webhook headers

Discussed in #193

Allow placeholders in Webhook action header values.

Implement timeout for Process action

Currently the process will run without a timeout. It might run long enough for another instance to be started for the same action. This may lead to confusing notifications.

Build ARM Docker images

There should be support for Raspberry Pi and similar boards.

"Read file" check

I would like to be able to have minmon read a log file and check for keywords.

Use case: run a smartmontools script against my local hard drives for SMART, output a file and have minmon monitor the file for "Error".

Is this currently possible?

Update: or is the processexit feature what I should be doing?

Cross-platform support

Right now only Linux is supported. The changes required to also support other *NIXes should be small.

Optionally use rustls instead of OpenSSL/native-tls

Discussed in #60

^{Originally posted by kyoheiu March 2, 2023}
Thank you for this amazing project.
I wonder why openssl is needed: Is this because lettre's feature?
FWIW it seems lettre supports rustls: https://docs.rs/lettre/latest/lettre/#smtp-over-tls-via-the-rustls-crate

Rework the documentation

Documentation is incomplete and structure is not ideal.

Passing custom Environment variables on Report Action

I made a new Action for report that is a webhook and it requires passing an authentication header which I am storing as a secret. When I load it up with the new configuration, the action fails. Nginx logs a 403 status code so I assume the action isn't passing the environment variable to the header properly

Below is my report config sections

[[report.events]]
name = "Report"
action = "Ntfy Report"

[[actions]]
name = "Ntfy Report"
type = "Webhook"
url = "http://ntfy-svc.ntfy.svc.cluster.local/health"
body = "[{{env:MINMON_HOST}}] Monitoring uptime {{minmon_uptime_iso}} System uptime {{system_uptime_iso}}"
headers = {"Authorization" = "Bearer {{env:MINMON_NTFY_TOKEN}}"}

Add error_recover_action

Even though the error_action usually means you have to manually check what's wrong with the system because it only happens if something is configured incorrectly (e.g. because the system somehow changed), it would be good to have an error_recover_action to make the state-machine "complete" (i.e. actions on all transitions).

This requires #5.

Add info about shadowed state to error action placeholders

The error state possibly shadows the good/bad state. It would be good to have placeholders available in the error action that indicate which state is shadowed (uuid, maybe timestamp, ..).

Run cargo-clippy and rustfmt in a GitHub action

It doesn't need to be a separate action; might be part of the "test" action.

cargo clippy --all-features
cargo fmt --check --all

Update debian base image to bookworm

Debian bullseye reached its EOL so it's time to move to bookworm.

Documentation is Lacking for things missing in Config Example

As a user that understands the basics of TOML I don't understand the expectations for the format of the TOML for it to be right.

e.i. in the example config everything is [[x]] but when I go and do [[log]] I get and error and after making a discussion and only realizing it should be [log].

Another example:

`
[report]
interval = 6
events = ["Event 1"]

[event]
name = "Event 1"
action = "Webhook 1a"

[[actions]]
name = "Webhook 1a"
type = "Webhook"
url = "https://discord.com/api/webhooks/"
body = """{"username": "Webhook", "avatar_url": "https://i.imgur.com/4M34hi2.png", "content": "Report"}"""
headers = {"Content-Type" = "application/json"}
`

I get the error:

[+] Running 1/0 ✔ Container minmon-minmon-1 Created 0.0s Attaching to minmon-minmon-1 minmon-minmon-1 | Exiting due to error: Failed to parse config file: TOML parse error at line 36, column 11 minmon-minmon-1 | | minmon-minmon-1 | 36 | events = ["Event 1"] minmon-minmon-1 | | ^^^^^^^^^ minmon-minmon-1 | invalid type: string "Event 1", expected struct ReportEvent minmon-minmon-1 | minmon-minmon-1 | minmon-minmon-1 exited with code 1

I don't know how to format ReportEvent and trying to base it off of the examples doesn't show me enough to get a working example. More information would be appreciated.

Add random initial delay to checks

This is to avoid CPU or I/O spikes that could happen when there are many checks run at the same time that don't actually need to run at the same time if there was a random delay between them.

Startup delay

It would be good to have a configurable delay that MinMon waits after it starts, especially after the system booted.

The monitored services may still be starting up and error actions may trigger because of that.
Also the internet connection might not be ready right away.

With systemd there are ways around this but the configuration is non-trivial and it should be an easy thing to do even without systemd.

configurable start-up delay
configurable min. distance to system boot

Support config directories

It would be good to have the option to use a config directory as an alternative to the single config file. See #20

Add "alarm_duration" and maybe "good_duration" on recover/action

It would be nice to be able to add the duration an alarm was good/bad before its state changed to the placeholders. This way you can easily see how long e.g. a service was down or the RAM critically full after the alarm recovered again.

Rethink just a single toml file

The reasoning behind many/most monitoring solutions having "fancy" config directory structure is that its a requirement.

People need to be able to put monitoring in place for something they bring up with automated tools.

Editing a toml file is a nonstarter (and no concating it in doesn't work either).

I would recommend going through the mental or actual code exercise of parsing some example nagios configs. Steelman why things are the way they are and you'll find it illuminating (I did it a decade ago for an in house golang monitoring server).

https://en.wikipedia.org/wiki/Straw_man#Steelmanning

Testing an alert

I set the filesystem threshold to 85. My mount was 86 to see if my alert would work. Nothing happened and nothing in the logs. Shouldn't the log show an alert? I will post my config if necessary.

systemd journal logging with extra fields

The systemd journal supports storing extra fields (check/alarm/action name, ..) along with the message. To be able to implement this with the log crate, the currently unstable kv_unstable feature is required.

add tests for MemoryUsage check

rework the MemoryUsage module to match the PressureAverage check and add tests

Network device rx/tx byte counter

Check network device rx and/or tx byte count.
Handle counter wrap-around (64 bit, so interval should not really matter).
https://www.kernel.org/doc/html/latest/networking/statistics.html

See #24.

monitoring docker: no such container

Hello, would it be possible to consider non existent container as a standard alarm condition (i.e. not running)?

afaik container states can be:

running/null
running/healthy
running/unhealthy
exited
non existent

Right now, 1-4 are monitored as expected (1-2 OK, 3-4 ALARM), but condition 5 just trigger monitoring error. However I still consider container, which I explicitly state in minmon config to be monitored, even non existent, as not running, and hence deserving alarm (yes, I use stateless containers with bind mounts, so when they are stopped, they are removed hence docker does not know about their state). I've checked the code, and the docker library returns error and reply with "Docker error: Docker responded with status code 404: No such container: mycontainer", unfortunatelly I can't code in rust, but would it be possible to detect this special condition (for example by output regexp) and make it a standard alarm instead an error, equal to condition 4?

Make HTTP(S) dependencies optional

Add a new cargo build option for http related functionality like the WebHook. The goal is to have a minimal set of required dependencies and make everything else optional.

DBUS_SESSION_BUS_ADDRESS is not set for SystemdUnitStatus with user-units

The process spawned by the SystemdUnitStatus check is missing the DBUS_SESSION_BUS_ADDRESS environment variable if a uid is specified. This results in an invalid exit status code so this combination is not usable currently.

How DockerContainerStatus works?

Hi, thanks for the great app.

I'd like to know how DockerContainerStatus works: In my environment and config, that seems not work, even if there is a container with the name running.
I'm running minmon by sudo docker run -d -v ./minmon.toml:/etc/minmon.toml:ro --name minmon ghcr.io/flo-at/minmon:latest.

My config for DockerContainerStatus:

[[checks]]
interval = 60
name = "Docker containers"
type = "DockerContainerStatus"
containers = ["gollum"]

[[checks.alarms]]
name = "Warning"
cycles = 3
repeat_cycles = 100
action = "Webhook 3"
recover_cycles = 5
recover_action = "Webhook 3"
error_repeat_cycles = 200
error_action = "Log error"

[[actions]]
name = "Webhook 3"
type = "Webhook"
url = "https://discord.com/api/webhooks/foo"
body = """{"content": `{{check_name}}`: Alarm `{{alarm_name}}` for container `{{check_id}}` changed state to *{{state}}."}"""
headers = {"Content-Type" = "application/json"}

$ sudo docker ps -a
869484fff354   gollumwiki/gollum:v5.3.0       "/docker-run.sh"         10 days ago    Up 10 days             0.0.0.0:4567->4567/tcp, :::4567->4567/tcp                                                  gollum

Error message:

2023-07-27T21:12:18Z [INFO] Alarm 'Warning' from check 'Docker containers' will be triggered after 3 bad cycles and recover after 5 good cycles.
2023-07-27T21:12:18Z [INFO] Check 'Docker containers' will be triggered every 60 seconds.
2023-07-27T21:12:44Z [WARN] Check 'Docker containers' got no data for id 'gollum': Docker error: error trying to connect: No such file or directory (os error 2)
2023-07-27T21:12:44Z [ERROR] Alarm 'Warning', id 'gollum' from check 'Docker containers' got an error: Docker error: error trying to connect: No such file or directory (os error 2)
2023-07-27T21:12:44Z [WARN] Alarm 'Warning', id 'gollum' from check 'Docker containers' changing from good to error state.
2023-07-27T21:12:44Z [INFO] Action 'Log error' triggered for alarm 'Warning', id 'gollum' from check 'Docker containers'.
2023-07-27T21:12:44Z [ERROR] Docker containers check didn't have valid data for alarm 'Warning' and id 'gollum': Docker error: error trying to connect: No such file or directory (os error 2).

It seems this line Docker error: error trying to connect: No such file or directory (os error 2) is the cause of this error, but I don't get what that means in this case. Container named gollum is running as docker ps -a shows, but minmon cannot get that information, or do I miss something?

Thanks in advance.

FD leak on systemd notify

Everytime sd_notify is called, an unbound dgram socket is leaked. This is probably not a bug in MinMon but one of the dependencies. For now it's better to disable the watchdog until I figured out where exactly this happens and how to avoid it.

Add uptime placeholders to report.

Implement new placeholders system_uptime and minmon_uptime for report events.

Point in time in addition to interval for report trigger

Hi,

I would love to have the possibility to have a point in time configurable instead of an interval. This would, e.g., enable that alive reports from different servers arrive at the same time so that it is easier to see if a report is missing.

A crontab-like definition style would be an option to define it.

Best regards,
Thomas

Improve documentation for example config(s)

I'm very interested in trying out MinMon to monitor my selfhosted servers, but it is not very clear to me how I can set up /etc/minmon.toml to have multiple checks/alarms/actions. It could very well just be my lack of experience/understanding of TOML, though I still think it would be beneficial for possible users to have some more examples.

For instance, when I see the example given in the README, It's very unclear how I could adapt it to have multiple checks

[[checks]]
interval = 60
name = "Filesystem usage"
type = "FilesystemUsage"
mountpoints = ["/home"]

If I understand correctly, this just correlates to a simple hashmap, which would prevent me from declaring the "name" key multiple times (to create additional checks). Do we create nested hashmaps, how do we name them (maybe [[checks.FilesystemUsage]])?

Retry action on next trigger if failed previously

If an action fails, it will be logged but that's all right now. It would be better to retry the action on the next trigger.

Fail to install on debian

Hello

I wanted tor try minmon but I can't get to install it on Debian 11.

# cargo install --all-features minmon
    Updating crates.io index
  Downloaded minmon v0.4.1
error: failed to parse manifest at `/root/.cargo/registry/src/github.com-1ecc6299db9ec823/minmon-0.4.1/Cargo.toml`

Caused by:
  invalid type: unit variant, expected string only for key `profile.release.strip`
root@pve2:~# cargo install --all-features minmon
    Updating crates.io index
error: failed to download `minmon v0.4.1`

Caused by:
  unable to get packages from source

Caused by:
  failed to parse manifest at `/root/.cargo/registry/src/github.com-1ecc6299db9ec823/minmon-0.4.1/Cargo.toml`

Caused by:
  invalid type: unit variant, expected string only for key `profile.release.strip`

The Cargo.toml looks like

...
[profile.release]
lto = true
panic = "abort"
strip = true

[dependencies.async-trait]
...

I'm using cargo 1.46.0

Am I doing something wrong ?

flo-at / minmon Goto Github PK

minmon's Introduction

MinMon - an opinionated minimal monitoring and alarming tool (for Linux)

Checks

Actions

Report

Design decisions

Config file

Architecture

System overview

Alarm state machine

Example

Config

Diagram

Some (more exotic) ideas

Placeholders

Filters

Installation

Docker image

Build and install using cargo

Install from the AUR (Arch Linux)

systemd integration (optional)

lm_sensors integration (optional)

Contributions

minmon's People

Contributors

Stargazers

Watchers

Forkers

minmon's Issues

Discussed in #193

Discussed in #193

Discussed in #60

Recommend Projects

Recommend Topics

Recommend Org