Code Monkey home page Code Monkey logo

moira's Introduction

Moira 2.0 Build Status codecov Documentation Status Telegram Go Report Card

Moira is a real-time alerting tool, based on Graphite or Prometheus/VictoriaMetrics metrics.

Installation

Docker Compose is the easiest way to try:

git clone https://github.com/moira-alert/docker-compose.git
cd docker-compose
docker-compose pull
docker-compose up

See more on our documentation page.

Feed data in Graphite format to localhost:2003:

echo "local.random.diceroll 4 `date +%s`" | nc localhost 2003

Configure triggers at localhost:8080 using your browser.

Other installation methods are available, see documentation.

Contribution

Check our contribution guideline

Getting Started

See our user guide that is based on a number of real-life scenarios, from simple and universal to complicated and specific.

What is in the other repositories

Code in this repository is the backend part of Moira monitoring application.

Contact us

If you have any questions, you can ask us on Telegram.

Thanks

SKB Kontur

Moira was originally developed and is supported by SKB Kontur, a B2G company based in Ekaterinburg, Russia. We express gratitude to our company for encouraging us to opensource Moira and for giving back to the community that created Graphite and many other useful DevOps tools.

moira's People

Contributors

alexakulov avatar almostinf avatar androndo avatar arxa1l avatar aswinmprabhu avatar beevee avatar borovskyav avatar dependabot[bot] avatar dimedrolity avatar dmitryanchikov avatar errx avatar idoqo avatar ifireice avatar imavroukakis avatar jiexa24 avatar kamaev avatar kiskachimaria avatar kissken avatar kotbauk avatar litleleprikon avatar maksgalimz avatar metikovvadim avatar msaf1980 avatar nixolay avatar oxoxoekb avatar pliner avatar santflamel avatar tetrergeru avatar titusjaka avatar zhelyabuzhsky avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

moira's Issues

Debug wrong time range in remote trigger requests

Сейчас по UTC - 1542799764

Есть вот такой триггер
https://moira.skbkontur.ru/trigger/VostokProductionConsumerMessagesCountTooLow

В нём вот такой таргет
aliasByNode(movingMax(vostok.prod.*.grouphost.vostok*.message_count,'6h'),2,4)

Если запросить это в carbonapi (https://graphite.skbkontur.ru/carbonapi/render?format=json&target=aliasByNode(movingMax(vostok.prod.*.grouphost.vostok04.message_count,'6h'),2,4)&from=1542797505), получим вот что

[
    {
        "target": "consumer-metric.vostok04",
        "datapoints": [...
        ]
    },
    {
        "target": "consumer-tracing.vostok04",
        "datapoints": [...
        ]
    }
]

Если же не указать время from, то прилетают куча левых точек (смотри datapoints-vostok04.json )

В логах карбонапи:

{
  "level": "INFO",
  "timestamp": "2018-11-21T14:18:11.478+0300",
  "logger": "access",
  "message": "request served",
  "data": {
    "handler": "render",
    "carbonapi_uuid": "b669a0da-52e8-4c17-bc33-f3a7b50290b7",
    "username": "devops",
    "url": "/render?format=json&from=1542798465&target=aliasByNode%28movingMax%28vostok.prod.%2A.grouphost.vostok%2A.message_count%2C%276h%27%29%2C2%2C4%29&until=1542799091",
    "peer_ip": "127.0.0.1",
    "host": "carbonapi",
    "format": "json",
    "use_cache": true,
    "request_method": "GET",
    "targets": [
      "aliasByNode(movingMax(vostok.prod.*.grouphost.vostok*.message_count,'6h'),2,4)"
    ],
    "cache_timeout": 60,
    "metrics": [
      "vostok.prod.*.grouphost.vostok*.message_count"
    ],
    "runtime": 0.070850722,
    "http_code": 200,
    "carbonzipper_response_size_bytes": 23950,
    "send_globs": true,
    "from": 1542798465,
    "until": 1542799091,
    "from_raw": "1542798465",
    "until_raw": "1542799091",
    "uri": "/render?format=json&from=1542798465&target=aliasByNode%28movingMax%28vostok.prod.%2A.grouphost.vostok%2A.message_count%2C%276h%27%29%2C2%2C4%29&until=1542799091",
    "from_cache": false,
    "zipper_requests": 2
  }
}

То есть время в CarbonAPI тоже правильное.

State триггера:

{
  "metrics": {
    "consumer-logs.vostok04": {
      "event_timestamp": 1538487834,
      "state": "OK",
      "suppressed": false,
      "timestamp": 1542721005,
      "value": 12985
    },
    "consumer-logs.vostok06": {
      "event_timestamp": 1538980093,
      "state": "OK",
      "suppressed": false,
      "timestamp": 1542798015,
      "value": 1231950
    },
    "consumer-metric.vostok04": {
      "event_timestamp": 1542795105,
      "state": "OK",
      "suppressed": false,
      "timestamp": 1542798015,
      "value": 106
    },
    "consumer-metric.vostok06": {
      "event_timestamp": 1538980093,
      "state": "OK",
      "suppressed": false,
      "timestamp": 1542798015,
      "value": 4426
    },
    "consumer-sentry.vostok04": {
      "event_timestamp": 1542720975,
      "state": "WARN",
      "suppressed": false,
      "timestamp": 1542721005,
      "value": 0,
      "maintenance": 1543401956
    },
    "consumer-sentry.vostok06": {
      "event_timestamp": 1538980093,
      "state": "OK",
      "suppressed": false,
      "timestamp": 1542798015,
      "value": 2559585
    },
    "consumer-tracing.vostok04": {
      "event_timestamp": 1542795225,
      "state": "OK",
      "suppressed": false,
      "timestamp": 1542798015,
      "value": 179
    },
    "consumer-tracing.vostok06": {
      "event_timestamp": 1540967175,
      "state": "OK",
      "suppressed": false,
      "timestamp": 1542798015,
      "value": 2585
    },
    "metrics-aggregator.vostok04": {
      "event_timestamp": 1542720855,
      "state": "WARN",
      "suppressed": false,
      "timestamp": 1542721005,
      "value": 0,
      "maintenance": 1543401958
    },
    "metrics-aggregator.vostok06": {
      "event_timestamp": 1540994235,
      "state": "OK",
      "suppressed": false,
      "timestamp": 1542798015,
      "value": 2585
    }
  },
  "score": 2,
  "state": "OK",
  "timestamp": 1542798105,
  "event_timestamp": 1538141709,
  "last_successful_check_timestamp": 1542798105,
  "trigger_id": "VostokProductionConsumerMessagesCountTooLow"
}

Graceful shutdown of telebot listener

If you use telegram bot, you make sure that you use telebot listener only on one instance of Moira. For this we are using distributed lock and extend it every half-ttl time interval. But if we can not extend it while lock record not foul, another instance can register bot and start new listener. In this case we should stop listener on addled instance. But telebot does not support graceful shutdown of pooling goroutine.
Now i use dirty hack and try to register bot again, after fail to extend of lock record. In most cases this solution can help. But such situation is possible.

Do not allow creating simple mode trigger with several targets via API

There are two types of triggers in Moira: simple mode and advanced mode.

In simple mode you can specify one target, and in advanced mode you can specify several targets. Web UI enforces this, but API does not. Therefore, using API you can create simple mode triggers with several targets.

Consider this trigger, created from API:

{
  "id": "GraphiteStoragesFreeSpace",
  "name": "Graphite storage free space low",
  "desc": "Graphite ClickHouse",
  "targets": [
    "aliasByNode(DevOps.system.graphite01.disk._mnt_data.gigabyte_percentfree, 2, 4)",
    "aliasByNode(DevOps.system.sd2-graphite01.disk._mnt_data.gigabyte_percentfree, 2, 4)",
    "aliasByNode(DevOps.system.bst-graphite01.disk.root.gigabyte_percentfree, 2, 4)",
    "aliasByNode(DevOps.system.dtl-graphite01.disk._mnt_data.gigabyte_percentfree, 2, 4)"
  ],
  "warn_value": 10,
  "error_value": 5,
  "trigger_type": "falling",
  "tags": [
    "Normal",
    "DevOps",
    "DevOpsGraphite-duty"
  ],
  "ttl_state": "NODATA",
  "ttl": 600,
  "sched": ...,
  "expression": "",
  "patterns": [
    "DevOps.system.graphite01.disk._mnt_data.gigabyte_percentfree",
    "DevOps.system.sd2-graphite01.disk._mnt_data.gigabyte_percentfree",
    "DevOps.system.bst-graphite01.disk.root.gigabyte_percentfree",
    "DevOps.system.dtl-graphite01.disk._mnt_data.gigabyte_percentfree"
  ],
  "is_remote": false,
  "mute_new_metrics": false,
  "throttling": 0
}

It is a simple mode trigger, created via API. Simple triggers have only simple thresholds (warn_value and error_value), but no expression. Yet this trigger has four targets. This leads to a non-obvious error: Moira silently ignores all targets for this trigger, except the first one.

We should not allow creating triggers in simple mode with several targets via API. Instead, we should return an error.

Logical expressions for tags

Now tags are always united with "and", it would be cool to have an opportunity to unite tags in subscriptions with logical expressions like "and", "or", "not", etc.
For example:

  • tag1 && (tag2 || tag3)
  • tag1 && tag2 && !tag3
  • (tag1 && tag2) || ( tag1 && tag3)

Slack notifications: utilize threads feature to organize a channel as a stream of incidents

It's often convenient to include an actionable description for each trigger to improve learning curve for new on-call engineers. However, these descriptions are generally quite long. This causes messages to occupy a lot of screen space and create unnecessary bloat:

image
image

There's also a good practice to put discussion regarding each alert into its own thread, which makes the channel easier to navigate:

image

We propose two improvements in slack notification system to help establish the thread-per-alert practice in on-call channels:

  1. An option to automatically start a thread each time a trigger fires and post its description in this thread as a first message. This only applies to harmful transitions (like X --> ERROR).

  2. Weave messages about transitions which bring the trigger back into normal state (like X --> OK) into the threads caused by respective harmful transitions.

With these capabilities, one would be able to organize the channel as a stream of "incidents", each with its own history thread.

алерты в Telegram падают вне зависимости от настроек в "watch time"

Заметили странное поведение - алерты от Мойра приходят во интервал времени, который не входит в "watch time".
Т.е. например, в watch time указано "At specific interval: 12:00 - 23:59"
а уведомление в телегу прилетело в 10:09.
Проверьте, пожалуйста.
Спасибо.

Change color and state of metric when turn on the maintenance

Hello,

Is there any way to change color and state of metric when turning on the maintenance? For example, if we have error (red) state for some metric but we turn maintenance for it (because we wan't to ignore it for some time), as I understand we won't get any notification for it. But is it possible to change state from error to another?

Restore 'expression' key in API

Now on moira-test there isn't "expression" key in response on request for trigger data. In dev- and production-moira this key exist. Where is truth?

Unknown contact type

config notifire

  host: localhost
  port: "6379"
  dbid: 0
graphite:
  enabled: ""
  uri: localhost:2003
  prefix: moira
  interval: 60s0ms
log:
  log_file: stdout
  log_level: debug
notifier:
  sender_timeout: 10s0ms
  resending_timeout: "24:00"
  senders:
    - type: mail
      smtp_host: mail.example.ru
      smtp_port: 25
      mail_from: [email protected]
      insecure_tls: true
  moira_selfstate:
    enabled: "false"
    redis_disconect_delay: 30
    last_metric_received_delay: 60
    last_check_delay: 60
    contacts: []
    notice_interval: 300
  front_uri: http://alert.example.ru
timezone: Europe/Moscow```

add delivery channel and test Email [email protected] 
in log 
Nov  8 11:56:48 moira moira-notifier[32677]: 2017-11-08 11:56:48.411#011notifier#011WARNING#011Can't send message after 3 try: Unknown contact type [package of 1 notifications to [email protected]]. Retry again after 1 min
Nov  8 11:56:48 moira moira-notifier[32677]: 2017-11-08 11:56:48.411#011notifier#011DEBUG#011Scheduled notification for contact email:[email protected] trigger  at 2017/11/08 11:57:48 (1510131468)

Aggregate notifications for mass problems

Sometimes something bad happens. And many of your triggers fall into NODATA (for example).
It would be nice if Moira could notice the massive problems and replace one million messages with one. One message with the text "everywhere everything is bad" is much more informative than a flurry of alerts.

Fix eternal loader on nonexisting trigger page (mobile)

При открытии мобильной мойры по ссылке на несуществующий триггер видим вечную крутяшку. Хочется видеть нормальное поведение интерфейса.

storage-schemas.conf in all DEB package

file storage-schemas.conf - needed in package moira-checker, moira-cli, moira-filter, moira-notifier?
this is problem with install - error:
attempt to overwrite "/etc/moira/storage-schemas.conf", which is already available in the moira-api 2.0-beta1.2-1 package

Add images to notify

Всем привет!
Во всех API интерфейсах графита, есть возможность рендерить картинку относительно метрики.
Очень хочется, иметь возможность, в нотификаторе, включить рассылку картинок для Slack и почты, как это сделано к примеру в Графане.
Картинку можно генерить онлайни и заливать её на расшаренный WEBDAV сервер (для начала). После чего постить ссылку на неё в слак и почту.

error timezone in email

in config notifier.yml
timezone: Europe/Moscow
in web interface - all ok
in email notifification - time UTC

Process last-checks that do not have any metrics stored

При удалении last check'а мы удаляем все метрики по паттернам триггера. После этого, если по оставшимся каким-то из last check'ов больше не приходят метрики, то у этих last check'ов больше не будет происходить поверка и никогда не поменяется состояние.

Как это сказывается на пользователе:

  1. Мы никогда больше не отравим по таким метрикам эвент, о том что метрика в NODATA
  2. Мы не удалим такие метрики автоматически, если триггер имеет NODATA state == DEL
  3. Мы никогда не уберем последнее значение метрики, если оно не было равно NIL

Как решать проблему:
Если после получения метрик мы понимаем, что у нас есть какие-то last check'и по которым нет метрик, сделать на них checkForNoData

moira-cli-2.0.0-1.x86_64.rpm doesn't work

Trying to install moira-cli using
rpm -i moira-cli-2.0.0-1.x86_64.rpm
the command fails with the message

file/usr/lib/systemd/system from install of moira-cli-2.0.0-1.x86_64 conflicts with file from package systemd-219-42.el7_4.4.x86_64

this happens because the package contains the empty directory /usr/lib/systemd/system/

[root@moira02 moira_pkg]# rpm -qlp moira-cli-2.0.0-1.x86_64.rpm
/etc/moira/cli.yml
/usr/bin/moira-cli
/usr/lib/systemd/system

telegram notification does not work

notifier.yml

  senders:
    - type: telegram
      api_token: "xxxxxxxxxxxxxxxxx"

token is work for grafana bot - 100%
in log
Nov 8 17:30:23 moira moira-notifier[8659]: 2017-11-08 17:30:23.243#011notifier#011WARNING#011Can't send message after 12 try: Failed to send message to telegram contact @ihard: failed to get username uuid: Nil returned. . Retry again after 1 min

Debug weird metric spike

Разобраться, чем был вызван этот всплеск, который непонятно откуда появился. Кажется на основе данных во сложениях можно спокойно писать тест

Can't send message WARNING

Hi

I'm trying to set up Moira.
Install it using released deb packages (from here https://github.com/moira-alert/moira/releases). Version 2.2.0-1

Trying to set notification, but no success.

notifier log shows:
WARNING Can't send message after 41 try: Unknown contact type 'mail' [package of 1 notifications to [email protected]]. Retry again after 1 min

--- notifier.yml

redis:
  host: localhost
  port: "6379"
  dbid: 0
graphite:
  enabled: true
  uri: "localhost:2003"
  prefix: DevOps.Moira
  interval: 60s
log:
  log_file: stdout
  log_level: debug
notifier:
  sender_timeout: 10s
  resending_timeout: "1:00"
  senders:
    - type: mail
      mail_from: moira
      smtp_host: smtp.gmail.com
      smtp_port: 465
      # Skip SMTP server certificate chain validation if false
      insecure_tls: true
      # Uses "mail_from" if empty
      smtp_user: [email protected]
      smtp_pass: pass
      # Email template file path (standard Go templates), if empty use default template
      # template_file: ...
  moira_selfstate:
    enabled: true
    redis_disconect_delay: 60s
    last_metric_received_delay: 120s
    last_check_delay: 120s
    notice_interval: 300s
  front_uri: http://localhost
  timezone: Europe/Moscow

--- web.json

{
  "api_url": "/api",
  "contacts": [
    {"type": "mail", "validation": "^.+@.+\\..+$", "icon": "email"},
    {"type": "pushover", "validation": "", "img": "pushover.ico"},
    {"type": "slack", "validation": "^[@#].+$", "img": "slack.ico"},
    {"type": "telegram", "validation": "^@.+$", "img": "telegram.ico", "title": "@channel only", "help": "required to grant @KonturMoiraBot admin privileges"},
    {"type": "twilio sms", "img": "twilio_sms.ico"},
    {"type": "twilio voice", "img": "twilio_voice.ico"}
  ],
  "supportEmail": "[email protected]"
}

notifier/mail doesn't understand html tags in description field

Hello.
I have the notifications with notifier/mail. The trouble is that the text lines have been concatinated in one like:
Description: Heap Size Usage : http://server/dashboard/db/call-manager Please, don’t use mon-alert.link - it is for monitoring admins group purpose.
The < br > or \n line break are not works and transfer as is:
Description: Heap Size Usage :<br> http://server/dashboard/db/call-manager \nPlease, don’t use mon-alert.link - it is for monitoring admins group purpose.
According to source code all the tags are forcibly rewritten.
I need ability of escaping html tags inside Description field in trigger for manual formating own messages.

Review tag sorting in dropdown lists

Хочется в результате поиска выводить теги совпадающие по префиксу, а потом уже все остальные.

prometheus and moira

Добрый день!

Скажите, а вы не рассматривали использование prometheus (https://prometheus.io) как дополнительный источник метрик?

Tell user why the trigger is in NODATA state

У триггера сейчас нет метрик из за этого они не переходят в состояние NODATA (Это другой баг). Но пользователю не понятно почему сам триггер в NODATA

ToDo:

  • дизайн
  • фронтенд
  • бэкхенд

Expected to see a webhooks in moira (use case inside)

We want a service (daemon) to track statistics from different sources (graphite is one of the sources).

In case the statistics indicate problems in our application, the daemon will help to display a message for users on the application page.

malformed config files installed by rpm

The RPMs of moira-filter, moira-checker and moira-notifier install malformed config files; in each there is at row 6:

graphite:
  enabled: ""

instead of

graphite:
  enabled: true

so starting services fails with errors like these:

gen 09 09:21:19 moira-filter[2044]: Can not read settings: Can't parse config file [/etc/moira/filter.yml] [yaml: unmarshal errors:
gen 09 09:21:19 moira-filter[2044]: line 6: cannot unmarshal !!str `` into bool]

/etc/moira/notifier.yml has one more problem:

  moira_selfstate:
    enabled: "false"

instead of

  moira_selfstate:
    enabled: false

so strarting the service returns this error

gen 09 10:34:57 moira-notifier[4574]: line 18: cannot unmarshal !!str false into bool]

Types of Triggers

We could add different types of triggers

  • Push Type for real time alerting. It is like current trigger type.
  • Pull Type for simple detect anomalies. This trigger has to pull data from graphite storage periodically, like all other monitoring systems based on graphite. (bosun, beacon)
  • HeathCheck Type is a standard shedulled check HTTP, FTP and some other popular protocols like (statuscake.com, uptimerobot.com)
  • WatchDog is a trigger that waits for periodical http requests that sets its state. It is necessary for tracking the state of sparse and periodical operations like backups (healthchecks.io, deadmanssnitch.com)

Add status handler

For example:

GET /api/status
{
"status":"fail",
"redis":"ok",
"filter": "ok", 
"checker": "fail"
}

Notifier tries to deliver messages to the contact that has been deleted

Notifier tries to deliver messages to the contact that has been deleted.
After deleting the subscription - messages are still trying to get.
Is there a ttl?

in log:

2018-01-09 22:18:47.830 notifier        WARNING Can't send message after 106 try: Failed to send message to telegram contact Moira-alerts: failed to get username uuid: Nil returned. . Retry again after 1 min
2018-01-09 22:18:47.830 notifier        WARNING Can't send message after 85 try: Failed to send message to telegram contact Moira-alerts: failed to get username uuid: Nil returned. . Retry again after 1 min

Enable alerts only for selected days interval

Hello,
We need alerts only for interval 1-25 jan, 1-25 apr, 1-25 jul, 1-25 oct (да, да, это отчетсность по НДС).
This is because our metrics could show 0 value on days outside these intervals.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.