9corp / 9volt Goto Github PK

View Code? Open in Web Editor NEW

168.0 15.0 17.0 11.5 MB

A modern, distributed monitoring system written in Go

License: MIT License

Go 68.88% Makefile 0.11% JavaScript 18.86% CSS 12.15%

monitoring distributed golang high-performance high-availability monitoring-server

9volt's Introduction

9volt

A modern, distributed monitoring system written in Go.

Another monitoring system? Why?

While there are a bunch of solutions for monitoring and alerting using time series data, there aren't many (or any?) modern solutions for 'regular'/'old-skool' remote monitoring similar to Nagios and Icinga.

9volt offers the following things out of the box:

Single binary deploy
Fully distributed
Incredibly easy to scale to hundreds of thousands of checks
Uses etcd for all configuration
Real-time configuration pick-up (update etcd - 9volt immediately picks up the change)
Support for assigning checks to specific (groups of) nodes
- Helpful for getting around network restrictions (or requiring certain checks to run from a specific region)
Interval based monitoring (ie. run check XYZ every 1s, 1y, 1d or even 1ms)
Natively supported monitors:
- TCP
- HTTP
- Exec
- DNS
Natively supported alerters:
- Slack
- Pagerduty
- Email
RESTful API for querying current monitoring state and loaded configuration
Comes with a built-in, react based UI that provides another way to view and manage the cluster
- Access the UI by going to http://hostname:8080/ui
Comes with a built-in monitor and alerter config management util (that parses and syncs YAML-based configs to etcd)
- ./9volt cfg --help

Usage

Install/setup etcd
Download latest 9volt release
Start server: ./9volt server -e http://etcd-server-1.example.com:2379 -e http://etcd-server-2.example.com:2379 -e http://etcd-server-3.example.com:2379
Optional: use 9volt cfg for managing configs
Optional: add 9volt to be managed by supervisord, upstart or some other process manager
Optional: Several configuration params can be passed to 9volt via env vars

... or, if you prefer to do things via Docker, check out these docs.

H/A and scaling

Scaling 9volt is incredibly simple. Launch another 9volt service on a separate host and point it to the same etcd hosts as the main 9volt service.

Your main 9volt node will produce output similar to this when it detects a node join:

Checks will be automatically divided between the all 9volt instances.

If one of the nodes were to go down, a new leader will be elected (if the node that went down was the previous leader) and checks will be redistributed among the remaining nodes.

This will produce output similar to this (and will be also available in the event stream via the API and UI):

API

API documentation can be found here.

Minimum requirements (can handle ~1,000-3,000 <10s interval checks)

1 x 9volt instance (1 core, 256MB RAM)
1 x etcd node (1 core, 256MB RAM)

Note In the minimum configuration, you could run both 9volt and etcd on the same node.

Recommended (production) requirements (can handle 10,000+ <10s interval checks)

3 x 9volt instances (2+ cores, 512MB RAM)
3 x etcd nodes (2+ cores, 1GB RAM)

Configuration

While you can manage 9volt alerter and monitor configs via the API, another approach to config management is to use the built-in config utility (9volt cfg <flags>).

This utility allows you to scan a given directory for any YAML files that resemble 9volt configs (the file must contain either 'monitor' or 'alerter' sections) and it will automatically parse, validate and push them to your etcd server(s).

By default, the utility will keep your local configs in sync with your etcd server(s). In other words, if the utility comes across a config in etcd that does not exist locally (in config(s)), it will remove the config entry from etcd (and vice versa). This functionality can be turned off by flipping the --nosync flag.

You can look at an example of a YAML based config here.

Docs

Read through the docs dir.

Suggestions/ideas

Got a suggestion/idea? Something that is preventing you from using 9volt over another monitoring system because of a missing feature? Submit an issue and we'll see what we can do!

9volt's People

Contributors

Stargazers

Watchers

Forkers

cuulee gophersgang relistan wincus hanqian200705 prabhatcs thiagozs cloudxtreme himanshpal enver-yilmaz ajleirer bitnodesnet godeps bubblemanwang hien dushuaihua1999 iq-scm

9volt's Issues

Nagios config translation

Should be able to convert existing nagios configs into 9volt configs. Proceed with hand waving.

Add event API endpoint

Add event API endpoints.

GET /api/v1/event -- fetch all events
GET /api/v1/event?type=foo,bar" -- fetch events for a given type

Multiple alerter problem

The fact that we allow multiple alerters for both critical and warning means that if someone configures "slack" and "slack" for both warning and critical, you will get two "resolve" notifications whenever a check resolves.

In other words:

WarningAlerter: ["slack", "email"]
CriticalAlerter: ["pagerduty"]

Would work fine - when the check resolves, it will send a resolve to pagerduty, a resolve to slack and to email.

If however we configured:

WarningAlerter: ["slack", "email"]
CriticalAlerter: ["slack", "pagerduty"]

You will get two similar notifications in Slack about the resolve.

It's not horrible, but annoying. I can't yet think of a way to get around this case without tearing down the ability to have multiple alerters...

Add license

Discussed with @jessedearing, rolling with MIT.

Add "DisableUntil" functionality

Not 100% convinced this is needed (or something that should be controlled via an external entity) but dropping it here so we don't forget about it.

Single or multiple dalClient's?

Not sure if there is any benefit or downside to using a single dalClient VS spawning one for each "big" component. I believe I read that the etcd client is thread safe, but I am not sure if there are any other implications to using the same one for performing work across 20 goroutines.

Need to do some research

UI: Clean up/update event view

Event view should differentiate between different types of events (by displaying a different icon). Potentially provide a 'filter' or a 'search' to only display events that match certain strings.

Write Slack alerter

Write the slack alerter

Push container image to a registery

Once the image has been built, utilize CI system to push the container to a Docker registry.

See: Info on TravisCI and Docker

UI

Let's figure out how the UI will work.

Is it a separate application?
Is it client-side JS bundled with the 9volt server?
Does the UI talk to the API or etcd directly?
What is the basic functionality the UI needs?

Write the pagerduty alerter

Write the pagerduty alerter.

TCP check

Write a TCP check that performs an outbound connection to a specific port and optionally can expect a string.

Write state functionality

State is going to be updated by checks.

State is used for being able to provide the API with information about what the member/cluster is doing.

Idea is that the state should be written out (periodically) by each member to etcd; API then accesses and exposes data from etcd.

Semi-random initial startup failures

Seeing sort of random startup failures:

fullstop:9volt dselans$ go run main.go -e http://localhost:2379 -d
DEBU[0000] cluster: Launching cluster engine components...
DEBU[0000] director: Launching director components...
INFO[0000] manager: Starting manager components...
INFO[0000] alerter: Starting alerter components...
DEBU[0000] cluster: Launching director heartbeat...
DEBU[0000] cluster: Launching member monitor...
INFO[0000] 9volt has started! API address: http://0.0.0.0:8080 MemberID: 365c6d2c
DEBU[0000] cluster: Launching director monitor...
DEBU[0000] api: Starting API server
DEBU[0000] cluster: Launching member heartbeat...
INFO[0000] cluster-directorMonitor: Not a director, but etcd says we are (updating state)!
INFO[0000] cluster-directorMonitor: Taking over director role
INFO[0000] director-stateListener: Starting up etcd watchers
DEBU[0000] director-distributeChecks: Performing member existence verification
DEBU[0000] director-checkConfigWatcher: Launching...
DEBU[0000] CollectCheckStats: map[]
DEBU[0000] director-verifyMemberExistence: Detected 'set' action for key /9volt/cluster/members/365c6d2c
INFO[0000] director-distributeChecks: Performing check distribution across members in cluster
DEBU[0000] director-distributeChecks: Distributing checks between 1 cluster members
DEBU[0000] director: Assigning check '/9volt/monitor/monitor_config_4' to member '365c6d2c'
DEBU[0000] manager: Received a 'set' watcher event for '/9volt/cluster/members/365c6d2c/config/Lzl2b2x0L21vbml0b3IvbW9uaXRvcl9jb25maWdfNA==' (value: '/9volt/monitor/monitor_config_4')
DEBU[0000] director: Assigning check '/9volt/monitor/monitor_config_1' to member '365c6d2c'
FATA[0000] cluster-memberHeartbeat: Unable to create initial member dir: Creating member config dir failed: 102: Not a file (/9volt/cluster/members/365c6d2c/config) [47541]
exit status 1

Definitely a race of some sorts; didn't gain any insight from a quick overview of things. Need to do a deepdive.

Director: handleAlertConfigChange()

Write this.

HTTP alerter

Idea - an alerter that can perform a GET/POST/PUT to somewhere on an incoming alert.

Periodically perform check redistribution

Perform periodic check redistribution if one of the member nodes has disproportionately more/less checks than other members.

Write manager functionality

Write manager functionality.

Manager should:

Monitor our own member dir - watch for adds/removes/updates
Launch/shutdown checkers

Exec check

Jesse wrote the initial exec check - need to elaborate on it a bit more:

Should have a max runtime
Expected return code
Expected output

Build binary and include in container

Build the 9volt binary and include in the resulting container.

NRPE support

Not sure what all of this entails, leaving a ticket just to keep this in mind.

Would be nice if we were able to piggy back off of existing NRPE configs. Does this mean we implement a NRPE receiver? I don't recall how NRPE works exactly but I recall the protocol being pretty simple.

Config translator utility

Need some sort of a utility for translating configs from something human readable and dropping them into etcd.

Write alerter functionality

Alerter should support different kinds of alerting mechanisms (slack, pagerduty, email to start?).

Spawn an alert channel; checks write to the alert channel when they want to send out an alert; alerter listens on alert channel for incoming messages and spins up alerter workers to handle the message sending via whatever alerting method.

We could cache the entirety of the alerting config in mem to save on a bunch of fetchAlertConfig() calls to etcd. Not sure yet. Bonus points for this. Oh. If etcd ever goes away, so does the node's alerting abilities - so we should probably just cache it from the getgo - that way we can survive an etcd outage/restart without (too many) issues.

Write basic checks

@jessedearing has already done a good chunk of it, but I don't remember where we left off.

Creating ticket to just track progress.

Checks that should be written out of the gate:

Basic TCP check (w/ banner check?)
HTTP Status code check
HTTP content + status code check (maybe this goes into the main http check?)
ICMP ... (this would be nice, but requires root privs...)

Bonus:

MySQL check

Improve individual monitor config validation

monitor.go currently performs a singular "monitor config" validation check; improve this so that each individual monitor has their own validation check.. or something along those lines.

ICMP check

Write an ICMP check.

Need to think this through as it'll require root privs.

Normalize JSON names for monitor configs

JSON configs are all over the place atm; normalize them by lowercasing everything and using.. dashes? underscores? for multiple-word vars.

HipChat alerter

Add a hipchat alerter.

Benchmarks

Document 9volt benchmarking.

Propagate internal errors/warnings to event queue

This should be pretty quick - propagate useful errors/warnings/state changes to the event queue.

Note: Lots of places where we simply log.Errorf() and move on - potentially good places to add a message to the event queue instead.

Director: handleCheckConfigChange()

Flesh this out.

Validation in 9volt-cfg

Validation functionality needs to be added to the 9volt-cfg util.

Update/improve/fix alert message contents

As of right now, the pagerduty alerter title event does not contain a whole lot of useful info (no idea what error condition occurred, what check, what host, etc). The slack alerter has a similar problem.

Should probably get that taken care of soon.

Make setup script idempotent

The setup script should refuse to overwrite the existing values in the etcd cluster unless a flag is passed in indicating the user is fully aware of what he/she is doing.

Expand alert messages

Alert messages are pretty plain - they should be a bit more informative (and ultimately have links that point to the actual state data (via the API?)). This probably makes the most sense to do after another alerter (or two) are complete.

Refactor and expand validation in monitor.go

Validation could be a lot more elegant in monitor.go (maybe relying on the check itself to expose some validation method via the interface?).

Need to refactor this and make it generally better.

DNS check

Write a DNS server check:

Check that given DNS server is able to resolve something
Maybe check that it's able to resolve something in a given timeframe?

SQL check

Don't know if this should be driver specific or left generic. Some ideas:

Run $query on given database server, expect results
Run $query on given database server, receive data in $period of time
I would really like to have something that'd be able to monitor show slave status and ensure that replication is up
Likewise, it'd be nice to alert on lagging replication

So some questions here:

Separate check for different drivers?
Maybe two/three separate checks - sql_check, sql_repl_check, sql_stats_check?

Lot's of room for discussion here.

Add 'sync' feature in 9volt-cfg

Right now, when you remove an entry from your yaml configs and run 9volt-cfg - it won't remove the old entries. I think this is unwanted behavior. I want to be able to hit 9volt-cfg -e http://localhost:2379 . and the latest and greatest configs are applied - any others are wiped.

Will stick this behind a --sync flag with a default set to true.

Unit tests

Once baseline functionality is in place, we can start working out unit tests.

Implement global error list

Implement some sort of a circular buffer error list that core components (ie. manager, director, cluster, etc.) can append to. The purpose of the global error list is to be able to expose it via the API and remove the necessity for users to sift through logs to determine "why did my config not get loaded" and so forth.

Figure out tag usage

Figure out how we can utilize "tags"; there's lots of options here - need to think through use cases.

Support 'disable' clause in checks

'enable' set to 'false' allows checks to be disabled but stay defined in 'etcd'. When a check is enabled, it will begin getting monitored.

Ability to tag checks for specific nodes

Not sure how this functionality should work but the idea is to allow folks to tie a check to a specific set of nodes.

Maybe add node tags during startup via flags or env vars? And then the checks can be tied to those nodes?

Figure out API documentation

Figure out what we'll use for API documentation:

Swagger?
RAML?
API Blueprint?
What tooling?

Exec alerter

Idea - have an alerter that can exec some script/bin upon an alert.

Flesh out API

API should:

Work through ALL members of the cluster
Provide statistics about:
- How many checks each member is responsible for
Allow you to:
- (Un)silence alerts
- Enable/disable checks
- Perform CRUD operations for both alerts and checks (?)

continuous update in the background
(selectable) pagination -- ie. display X amount of results per page; allow that to be selected in UI