Code Monkey home page Code Monkey logo

9volt's Introduction

9volt

Build Status Go Report Card

A modern, distributed monitoring system written in Go.

Another monitoring system? Why?

While there are a bunch of solutions for monitoring and alerting using time series data, there aren't many (or any?) modern solutions for 'regular'/'old-skool' remote monitoring similar to Nagios and Icinga.

9volt offers the following things out of the box:

  • Single binary deploy
  • Fully distributed
  • Incredibly easy to scale to hundreds of thousands of checks
  • Uses etcd for all configuration
  • Real-time configuration pick-up (update etcd - 9volt immediately picks up the change)
  • Support for assigning checks to specific (groups of) nodes
    • Helpful for getting around network restrictions (or requiring certain checks to run from a specific region)
  • Interval based monitoring (ie. run check XYZ every 1s, 1y, 1d or even 1ms)
  • Natively supported monitors:
    • TCP
    • HTTP
    • Exec
    • DNS
  • Natively supported alerters:
    • Slack
    • Pagerduty
    • Email
  • RESTful API for querying current monitoring state and loaded configuration
  • Comes with a built-in, react based UI that provides another way to view and manage the cluster
  • Comes with a built-in monitor and alerter config management util (that parses and syncs YAML-based configs to etcd)
    • ./9volt cfg --help

Usage

  • Install/setup etcd
  • Download latest 9volt release
  • Start server: ./9volt server -e http://etcd-server-1.example.com:2379 -e http://etcd-server-2.example.com:2379 -e http://etcd-server-3.example.com:2379
  • Optional: use 9volt cfg for managing configs
  • Optional: add 9volt to be managed by supervisord, upstart or some other process manager
  • Optional: Several configuration params can be passed to 9volt via env vars

... or, if you prefer to do things via Docker, check out these docs.

H/A and scaling

Scaling 9volt is incredibly simple. Launch another 9volt service on a separate host and point it to the same etcd hosts as the main 9volt service.

Your main 9volt node will produce output similar to this when it detects a node join:

node join

Checks will be automatically divided between the all 9volt instances.

If one of the nodes were to go down, a new leader will be elected (if the node that went down was the previous leader) and checks will be redistributed among the remaining nodes.

This will produce output similar to this (and will be also available in the event stream via the API and UI):

node-leave

API

API documentation can be found here.

Minimum requirements (can handle ~1,000-3,000 <10s interval checks)

  • 1 x 9volt instance (1 core, 256MB RAM)
  • 1 x etcd node (1 core, 256MB RAM)

Note In the minimum configuration, you could run both 9volt and etcd on the same node.

Recommended (production) requirements (can handle 10,000+ <10s interval checks)

  • 3 x 9volt instances (2+ cores, 512MB RAM)
  • 3 x etcd nodes (2+ cores, 1GB RAM)

Configuration

While you can manage 9volt alerter and monitor configs via the API, another approach to config management is to use the built-in config utility (9volt cfg <flags>).

This utility allows you to scan a given directory for any YAML files that resemble 9volt configs (the file must contain either 'monitor' or 'alerter' sections) and it will automatically parse, validate and push them to your etcd server(s).

By default, the utility will keep your local configs in sync with your etcd server(s). In other words, if the utility comes across a config in etcd that does not exist locally (in config(s)), it will remove the config entry from etcd (and vice versa). This functionality can be turned off by flipping the --nosync flag.

cfg run

You can look at an example of a YAML based config here.

Docs

Read through the docs dir.

Suggestions/ideas

Got a suggestion/idea? Something that is preventing you from using 9volt over another monitoring system because of a missing feature? Submit an issue and we'll see what we can do!

9volt's People

Contributors

caledhwa avatar dselans avatar jessedearing avatar jondowdle avatar relistan avatar talpert avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

9volt's Issues

Add event API endpoint

Add event API endpoints.

  • GET /api/v1/event -- fetch all events
  • GET /api/v1/event?type=foo,bar" -- fetch events for a given type

Multiple alerter problem

The fact that we allow multiple alerters for both critical and warning means that if someone configures "slack" and "slack" for both warning and critical, you will get two "resolve" notifications whenever a check resolves.

In other words:

WarningAlerter: ["slack", "email"]
CriticalAlerter: ["pagerduty"]

Would work fine - when the check resolves, it will send a resolve to pagerduty, a resolve to slack and to email.

If however we configured:

WarningAlerter: ["slack", "email"]
CriticalAlerter: ["slack", "pagerduty"]

You will get two similar notifications in Slack about the resolve.

It's not horrible, but annoying. I can't yet think of a way to get around this case without tearing down the ability to have multiple alerters...

Add "DisableUntil" functionality

Not 100% convinced this is needed (or something that should be controlled via an external entity) but dropping it here so we don't forget about it.

Single or multiple dalClient's?

Not sure if there is any benefit or downside to using a single dalClient VS spawning one for each "big" component. I believe I read that the etcd client is thread safe, but I am not sure if there are any other implications to using the same one for performing work across 20 goroutines.

Need to do some research

UI: Clean up/update event view

Event view should differentiate between different types of events (by displaying a different icon). Potentially provide a 'filter' or a 'search' to only display events that match certain strings.

UI

Let's figure out how the UI will work.

  • Is it a separate application?
  • Is it client-side JS bundled with the 9volt server?
  • Does the UI talk to the API or etcd directly?
  • What is the basic functionality the UI needs?

TCP check

Write a TCP check that performs an outbound connection to a specific port and optionally can expect a string.

Write state functionality

State is going to be updated by checks.

State is used for being able to provide the API with information about what the member/cluster is doing.

Idea is that the state should be written out (periodically) by each member to etcd; API then accesses and exposes data from etcd.

Semi-random initial startup failures

Seeing sort of random startup failures:

fullstop:9volt dselans$ go run main.go -e http://localhost:2379 -d
DEBU[0000] cluster: Launching cluster engine components...
DEBU[0000] director: Launching director components...
INFO[0000] manager: Starting manager components...
INFO[0000] alerter: Starting alerter components...
DEBU[0000] cluster: Launching director heartbeat...
DEBU[0000] cluster: Launching member monitor...
INFO[0000] 9volt has started! API address: http://0.0.0.0:8080 MemberID: 365c6d2c
DEBU[0000] cluster: Launching director monitor...
DEBU[0000] api: Starting API server
DEBU[0000] cluster: Launching member heartbeat...
INFO[0000] cluster-directorMonitor: Not a director, but etcd says we are (updating state)!
INFO[0000] cluster-directorMonitor: Taking over director role
INFO[0000] director-stateListener: Starting up etcd watchers
DEBU[0000] director-distributeChecks: Performing member existence verification
DEBU[0000] director-checkConfigWatcher: Launching...
DEBU[0000] CollectCheckStats: map[]
DEBU[0000] director-verifyMemberExistence: Detected 'set' action for key /9volt/cluster/members/365c6d2c
INFO[0000] director-distributeChecks: Performing check distribution across members in cluster
DEBU[0000] director-distributeChecks: Distributing checks between 1 cluster members
DEBU[0000] director: Assigning check '/9volt/monitor/monitor_config_4' to member '365c6d2c'
DEBU[0000] manager: Received a 'set' watcher event for '/9volt/cluster/members/365c6d2c/config/Lzl2b2x0L21vbml0b3IvbW9uaXRvcl9jb25maWdfNA==' (value: '/9volt/monitor/monitor_config_4')
DEBU[0000] director: Assigning check '/9volt/monitor/monitor_config_1' to member '365c6d2c'
FATA[0000] cluster-memberHeartbeat: Unable to create initial member dir: Creating member config dir failed: 102: Not a file (/9volt/cluster/members/365c6d2c/config) [47541]
exit status 1

Definitely a race of some sorts; didn't gain any insight from a quick overview of things. Need to do a deepdive.

HTTP alerter

Idea - an alerter that can perform a GET/POST/PUT to somewhere on an incoming alert.

Write manager functionality

Write manager functionality.

Manager should:

  1. Monitor our own member dir - watch for adds/removes/updates
  2. Launch/shutdown checkers

Exec check

Jesse wrote the initial exec check - need to elaborate on it a bit more:

  • Should have a max runtime
  • Expected return code
  • Expected output

NRPE support

Not sure what all of this entails, leaving a ticket just to keep this in mind.

Would be nice if we were able to piggy back off of existing NRPE configs. Does this mean we implement a NRPE receiver? I don't recall how NRPE works exactly but I recall the protocol being pretty simple.

Config translator utility

Need some sort of a utility for translating configs from something human readable and dropping them into etcd.

Write alerter functionality

Alerter should support different kinds of alerting mechanisms (slack, pagerduty, email to start?).

Spawn an alert channel; checks write to the alert channel when they want to send out an alert; alerter listens on alert channel for incoming messages and spins up alerter workers to handle the message sending via whatever alerting method.

We could cache the entirety of the alerting config in mem to save on a bunch of fetchAlertConfig() calls to etcd. Not sure yet. Bonus points for this. Oh. If etcd ever goes away, so does the node's alerting abilities - so we should probably just cache it from the getgo - that way we can survive an etcd outage/restart without (too many) issues.

Write basic checks

@jessedearing has already done a good chunk of it, but I don't remember where we left off.

Creating ticket to just track progress.

Checks that should be written out of the gate:

  1. Basic TCP check (w/ banner check?)
  2. HTTP Status code check
  3. HTTP content + status code check (maybe this goes into the main http check?)
  4. ICMP ... (this would be nice, but requires root privs...)

Bonus:

  1. MySQL check

Improve individual monitor config validation

monitor.go currently performs a singular "monitor config" validation check; improve this so that each individual monitor has their own validation check.. or something along those lines.

ICMP check

Write an ICMP check.

Need to think this through as it'll require root privs.

Propagate internal errors/warnings to event queue

This should be pretty quick - propagate useful errors/warnings/state changes to the event queue.

Note: Lots of places where we simply log.Errorf() and move on - potentially good places to add a message to the event queue instead.

Update/improve/fix alert message contents

As of right now, the pagerduty alerter title event does not contain a whole lot of useful info (no idea what error condition occurred, what check, what host, etc). The slack alerter has a similar problem.

Should probably get that taken care of soon.

Make setup script idempotent

The setup script should refuse to overwrite the existing values in the etcd cluster unless a flag is passed in indicating the user is fully aware of what he/she is doing.

Expand alert messages

Alert messages are pretty plain - they should be a bit more informative (and ultimately have links that point to the actual state data (via the API?)). This probably makes the most sense to do after another alerter (or two) are complete.

Refactor and expand validation in monitor.go

Validation could be a lot more elegant in monitor.go (maybe relying on the check itself to expose some validation method via the interface?).

Need to refactor this and make it generally better.

DNS check

Write a DNS server check:

  • Check that given DNS server is able to resolve something
  • Maybe check that it's able to resolve something in a given timeframe?

SQL check

Don't know if this should be driver specific or left generic. Some ideas:

  • Run $query on given database server, expect results
  • Run $query on given database server, receive data in $period of time
  • I would really like to have something that'd be able to monitor show slave status and ensure that replication is up
  • Likewise, it'd be nice to alert on lagging replication

So some questions here:

  • Separate check for different drivers?
  • Maybe two/three separate checks - sql_check, sql_repl_check, sql_stats_check?

Lot's of room for discussion here.

Add 'sync' feature in 9volt-cfg

Right now, when you remove an entry from your yaml configs and run 9volt-cfg - it won't remove the old entries. I think this is unwanted behavior. I want to be able to hit 9volt-cfg -e http://localhost:2379 . and the latest and greatest configs are applied - any others are wiped.

Will stick this behind a --sync flag with a default set to true.

Unit tests

Once baseline functionality is in place, we can start working out unit tests.

Implement global error list

Implement some sort of a circular buffer error list that core components (ie. manager, director, cluster, etc.) can append to. The purpose of the global error list is to be able to expose it via the API and remove the necessity for users to sift through logs to determine "why did my config not get loaded" and so forth.

Figure out tag usage

Figure out how we can utilize "tags"; there's lots of options here - need to think through use cases.

Support 'disable' clause in checks

'enable' set to 'false' allows checks to be disabled but stay defined in 'etcd'. When a check is enabled, it will begin getting monitored.

Ability to tag checks for specific nodes

Not sure how this functionality should work but the idea is to allow folks to tie a check to a specific set of nodes.

Maybe add node tags during startup via flags or env vars? And then the checks can be tied to those nodes?

Exec alerter

Idea - have an alerter that can exec some script/bin upon an alert.

Flesh out API

API should:

  • Work through ALL members of the cluster
  • Provide statistics about:
    • How many checks each member is responsible for
  • Allow you to:
    • (Un)silence alerts
    • Enable/disable checks
    • Perform CRUD operations for both alerts and checks (?)

UI: State view

Implement a "state" view - something akin to nagios main view which displays any warning/critical checks.

Some things it should have:

  • continuous update in the background
  • (selectable) pagination -- ie. display X amount of results per page; allow that to be selected in UI

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.