skx / overseer Goto Github PK

View Code? Open in Web Editor NEW

62.0 62.0 12.0 355 KB

A golang-based remote protocol tester for testing sites & service availability

License: GNU General Public License v2.0

Go 99.40% Shell 0.60%

golang monitoring networking testing

overseer's Introduction

Steve Kemp

I've been programming for over half my life, I'm comfortable creating, developing, maintaining, and improving software written in C, C++, Emacs Lisp, Perl, Ruby, Java, Shell, TCL, etc.
- Most of my new projects are written in Golang.
- But also I've been writing code in Z80 assembly language, to run under CP/M and the humble ZX Spectrum 48k.
My interests primarily revolve around compilers, interpreters, domain-specific languages, and virtual machines.
- Examples of scripting languages include a simple BASIC, a simple FORTH, a simple Lisp, and a simple TCL.
- DSLs are great tools for automation, etc.
Location: Helsinki, Finland.
Occupation: Sysadmin / Devops / Cloud-person.

overseer's People

Contributors

Stargazers

Watchers

Forkers

antf256 timypcr nelsonov cmaster11 bryanchance

overseer's Issues

Add dns probe

This should allow specifying a name-server, a name/type to lookup, and an expected result.

Examine fork, and take "inspiration" :)

Nice fork of this project here:

https://github.com/cmaster11/overseer

Need to examine it in detail, but this change looks great, and worth folding back:

cmaster11@47ffe39

Add queue-length test.

For paranoia purposes I have a cronjob which monitors the queue length of pending-jobs.

Add a test to measure a queue length to overseer and it can monitor itself. Unless the worker(s) are dead, of cours!

It should be possible to have persistent notifiers

Right now there are three notifiers:

IRC
MQ
Purppura

The first two do the same thing when a notification happens:

Join/connect to a server
Send a notification.
Disconnect

It would be neater if they could join in their constructor, and remain connected thereafter.

So the IRC notifier would join the channel and idle
The MQ notifier would connect to the queue.

And only at the notification stage would they actually notify. I'd suggest changing/extending the notifier class to have:

Setup()
Notify()

In the case of purppura the Setup() method would be a NOP, but for IRC/MQ they would join/connect there, and trigger notifications on demand via those persistent connections.

We use both mq & redis - use only one.

We use the MQ-queue for posting results.
We use redis as a queue for jobs.

It seems odd to use two queues, so it makes sense we should only use one.

I'd lean towards using redis for both ..

Later arguments should override earlier ones

The following input gives surprising results:

 https://steve.fi/ must run http with status 301 with status 302 with status 'any'

I expect the end result is status to be equal to any, however the parser parsers the arguments in reverse, and the actual value of status is 301.

This is important for executable/macro-aware files are added as per #25.

Add IRC and/or XMPP notifiers

These will just issue updates on failures.

HTTP test needs to be improved

The HTTP test is a little special because it does its own DNS resolution, because you can't easily call a HTTP-site by IP address

i.e. "get http://example.com" is different from "get http://127.0.0.1". It could be possible to send the suitable Host: header, but I chose to do it a different way.

Anyway distraction about implementation aside we lookup IPv4 and IPv6 addresses but we only run the test against the first. We should test all of them.

Password details logged when DNS lookups fail.

We allow users to define tests with parameters, in the input file. The parameters are used for the more specific tests, such as testing SMTP, IMAP, etc.

When a test is executed we submit the result of that test to redis, where it is assumed another component will route the result to the appropriate person (if the test fails. Typically nobody cares about tests that pass).

It has never been an explicit goal of the project to assume any particular deployment, but it has been assumed that results might be sent to "junior" people, so an attempt was made to censor password-parameters that were used in tests, before they were submitted to Redis.

I recently migrated my email from my own server to Google, and once that migration was complete I was suprised to receive an SMS which contained my cleartext password.

Given input like this:

mail.steve.org.uk must run smtp with port 587 with username 'steve' with password 'secret'

I had expected an alert to show with password 'CENSORED'. However there is a case when this censoring doesn't happen: When DNS resolution fails. After my migration was complete I dropped the DNS-record for mail.steve.org.uk, and that meant the test couldn't be executed because the target couldn't be determined. That short-circuited the testing-process and resulted in an alert of the form:

Test failed - mail.steve.org.uk could not be resolved

mail.steve.org.uk must run smtp with port 587 with username 'steve' with password 'secret'

This isn't a major security issue, because there is no explicit threat-model defined, but it is worth noting and fixing for the future.

Our FTP probe could be improved

Right now we:

Connect
Look for a 220 banner.

Instead we should allow actually retrieving a file, post-login, and content-matching the response as we do in the HTTP-test.

Allow processing executable files

If an input file is regular then parse as-is.
If an input file is executable then parse the output of running it

This allows macros:

 #!/bin/sh
 /usr/bin/cpp <<EOF

 #define tinc(x) x must run tcp with port 655

 tinc(1.2.3.4)
 tinc(2.3.4.5) with port 666
 EOF

HTTP-probe should allow regexp testing of body

I'm currently testing the blogspam stats via:

  https://blogspam.net/xml/stats must run http with content '<spam>'

That works, but it would be better to test that actual numbers were returned.

To resolve this I'd suggest we allow:

  with regexp '<spam>^[0-9]+$</spam>'

The regexp option would work like the existing content option, but would be a regular expression which must match.

Allow timeout settings

There are two commands that work with tests:

overseer local [file1 .. fileN]
overseer worker -redis=central.queue.example.com:6379

Both should allow the timeout period to be specified.

We don't need the notifier/ package

This was useful when we had different notification methods, but now it splits the code in ways that aren't required.

We could drop that, and move the Notify method into cmd_worker.go.

What would be useful is to have an redis_queue package for adding/awaiting entries, that could be used by the enqueue/worker processes.

Deploy a telegram bot ..

At the moment we have a couple of bridges:

When failures occur post to purppura
When failures occur post to IRC

We should add a new one:

When failures occur post to telegram

This will be the default bridge, as it is common to use telegram, and this will mean we can show a decent docker setup without any external dependency (beyond the telegram bot config).

Delivery of notifications via MQ sometimes fails on the last test which is applied.

NOTE: Before reading this particular issue note that it is 100% specific to the "local" worker

This morning I received a pair of alerts informing me that blogspam had an SSL certificate nearing expiration (one for IPv4 and one for IPv6). This alert was expected, and once I renewed the certificate I expected the notifications to clear, but they did not.

The test is in one file:

   root@www ~ # cat /opt/overseer/tests.d/blogspam.conf 
   # BlogSpam
   https://blogspam.net/xml/stats must run http with content '<spam>'

Running this test manually, like so, should have triggered an MQ notification:

    root@www ~ # /opt/overseer/bin/overseer local -verbose -mq localhost:1883 /opt/overseer/tests.d/blogspam.conf 
    Running 'http' test against blogspam.net (2a01:4f8:151:6083::101)
    SSLExpiration testing: blogspam.net:443
    SSLExpiration - certificate: blogspam.net expires in 2158 hours (89 days)
    SSLExpiration - certificate: Let's Encrypt Authority X3 expires in 24899 hours (1037 days)
    SSLExpiration - certificate: DST Root CA X3 expires in 29625 hours (1234 days)
    	[1/5] - Test passed.
    Running 'http' test against blogspam.net (176.9.183.101)
    SSLExpiration testing: blogspam.net:443
    SSLExpiration - certificate: blogspam.net expires in 2158 hours (89 days)
    SSLExpiration - certificate: Let's Encrypt Authority X3 expires in 24899 hours (1037 days)
    SSLExpiration - certificate: DST Root CA X3 expires in 29625 hours (1234 days)
    	[1/5] - Test passed.

So what went wrong? Well this is what should have happened:

Open the notifier (i.e. MQ connection)
Parse the tests.
- For each test run it
- For each test publish the result over MQ
No more tests? Exit

It's the last bit that is the problem:

The result of the final test was published to MQ
The process exited

However the MQ publishing didn't await an ack, or confirmation, so the actual action was:

Fire the message at MQ
Exit
- Before that message was delivered to MQ.

This behaviour explains why the overseer worker mode of operation wasn't affected - because in that mode the worker keeps running forever, and the persistent notification setup (as implemented in #17) meant that there was no join/part to the MQ server.

In fact if you look at an older commit you can see where I added some code to work around this problem:

//
// This seems to be necessary ..  Sigh
//
    time.Sleep(500 * time.Millisecond)

Adding a sleep is a bad solution because you never know how long you need to sleep - what you actually need to do is await the MQ-delivery, or otherwise have an acknowledgement of some kind.

In conclusion:

Our MQ-publish must await a successful delivery.
- If that means a new/different client library then so be it.

Drop the local-mode

Currently there are two modes of operation:

local
- Parses test-files and locally executes them.
enqueue + worker
- Parses test-files, adds them to a queue, then fetches them.

To ease documentation, and simplify things generally I think the first mode should go away. This was flagged up in #27 although that was a distinct issue with transient MQ failures caused by the local-process exiting before MQ messages had been submitted.

Panic in worker

If using a distributed setup, i.e. redis, then the following is enough to crash a worker:

    $ overseer enqeueu ./pop.txt
    $ overseer worker -verbose

Where pop.txt contains:

    pop.gmx.com must run pop3
    pop.gmx.com must run pop3s insecure

This is a fun one to diagnose, it turns out that run_test_string is invoked to parse the output from redis - which is breaking state - and the regexp used in that parsing is [a-z]. So pop3 doesn't match, instead the system wants to invoke the pop handler. We've recently decided to trust the parser, such that run_test looks up the handler and invokes it:

tmp := protocols.ProtocolHandler(test_type)
tmp.SetLine(input)

So this bug comes from two things:

Parsing the input outside the parser - and not handling numbers.
Not checking for validity.

Suggest we update the parser package to export:

New()
ParseFile()
ParseLine()

Then the run_test_string can go away, and the worker can use ParseLine. All existing users of Parse will be updated to use ParseFile instead.

This will cleanup our API, avoid breaking state - by parsing outside parser/ - and allow the crash to be resolved.

Notification system could do with some love

Currently we have a single method, in core.go for raising/clearing events.

This should be moved to something more flexible for other users.

Protocol-tests should be self-documenting

We should add an Example() method to the protocol-handler API, to output a human-readable overview of the test.

This would allow viewing the documentation & examples with commands like this:

 overseer examples

Or:

 overseer examples ftp

We should be able to read a configuration file.

Most of the commands are very similar, so we should be able to read the same file for all of them.

Perhaps a JSON file? Of course overriding the defaults via the command-line "-config $file" might be hard, so cheat and use $OVERSEER to point to a configuration file ..

SSL-fetch should fail if certificate is expiring "soon".

This is something I'm missing.

HTTP Probe should support making POST requests

We'll control this via:

   with data 'foo=this will be posted'

Merge in tag-support

The only part of this code, in production, which is not published is the tag-support. Mostly because it was a quick solution to the problem of determining whether a test-failure is global, or specific to one location.

Assume you're running distributed testing, from multiple AWS regions, and you see a failure. You might want to look for that same failure from other nodes.

If you keep state in your alert-gateway that's trivial, but you do need to be able to tag "identical" tests such that you can recognize they're identical. The way I do that is adding a per-node tag to each test-result.

Merge in from git.steve.org.uk to github.com. Then we're all open and stuff :)

Allow sending emails on failure.

overseer stores the results of tests in a redis-queue. From there we expect an external process to pop the notifications off and handle them.

We currently have two examples which do that; one to post to my own monitoring system purppura, and one to post to an IRC channel. It might be nice to have an example of sending emails on failure.

Email authentication will be hard, so we'll just assume we can invoke /usr/lib/sendmail, or /usr/sbin/sendmail as appropriate.

Need to update the bridge-documentation to include this, also note that the testing is stateless so we might have a lot of emails for a flapping service.

Parser should reject bogus options

Most of our protocol tests allow extra options to be specified:

   target.example.com must run $SERVICE with OPTION_NAME OPTION_VALUE

The protocol tests know which options they support, so this should be a parser error, because header is not an option name which is currently supported:

   http://example.com/ must run http with header "Foo: 1"

Add a method to our protocol-test to return supported-options and we can rule out lines that have bogus entries.

Allow the use of redis password for queue-operations

Currently we assume that the redis-host used for the queue-based operations doesn't need a password.

That should be permitted.

Passwords leaked in the notifiers

Assuming you have the following input:

  imap.company.com must run imap with username '[email protected]' with password 'secret'

If you're using the MQ / Purppura notifiers then they will receive a copy of the input. In the case of MQ you'll see this logged:

    {"input":"1.2.3.4 must run imaps with password 'secret' with username '[email protected]'",
     ...}

This is because the raw input is given to the notifier. We should censor out passwords (as used in MySQL, HTTP, POP3(s), IMAP(s), etc) in our notifications.

Remove the notifiers

I'm leaning towards simplicity and removing all notifiers - instead solely announcing pass/fails via an MQ-queue.

External processes can watch the queue and inject results elsewhere easily enough, and this cuts down on complexity.

Redis connection to unix-socket

Hello,
i really like this idea of monitoring systems. (I don't want any graphics etc. Just a mail or so if something is broken.)

I want to run the software on a shared host. So it would be cool to use a redis server over a unix socket.

Unfortunately i didn't found a way to enable it. - Do i miss something?

Kind regards

Our implementation could be improved

Right now every protocol handler has three methods:

RunTest( string target )
SetInput
SetOptions

Setting the objects is clearly pointless if we could pass them into the RunTest method. Similarly the input-line should be part of the test.

I'd suggest we rework the API such that the protocol-testing methods would have the single method:

RunTest(tst parser.Test, target string, opts TestOptions ) error

i.e. We pass the test-Object into the method, and the test-options too.

Similarly each method calls ParseArguments to parse ports, passwords, etc. That should be a job done by the parser - half the reason for unifying the options was so that we'd only need to parse them once.

So our updated parser.Test object would look like this:

   type Test struct {
   	Target string
   	Type   string
   	Input  string
   	Arguments map[string]string
   }

i.e. We'd add input rather than using SetInput, and we'd add the map of arguments.

Lots of churn, but should be a simple change and one that benefits us.

Our SMTP test should allow authentication validation

Rather than merely connecting to the SMTP-port we should also test a username/password combination.

This will require using STARTTLS, because the golang net/smtp won't even try a (plain) auth attempt without it.

Add tftp probe

Should be simple.

HTTP-probe should support basic-authentication

Via with username 'xx' with password 'yy'.

This site can be used for testing:

https://jigsaw.w3.org/HTTP/Digest/

Further improve argument parsing

We've now updated each protocol-test with a list of arguments that it can support. This improves our parser because it means this is an error that is discovered at parse-time:

  ftp.example.com must run ftp with username moi

(Because the FTP-probe only supports the single, optional, argument: port:

   ftp.example.com must run ftp with port 2121

We can improve things further by restricting the arguments to match patterns. ie. To differentiate between these two case:

ftp.example.com must run ftp with port ftp
ftp.example.com must run ftp with port 21

In this case we'd change from port being a valid argument to port must match \d+.

In short our API would return a map of names/regexps, not just an array of names.

The HTTP test is more complex than it needs to be.

Unlike all other tests the HTTP-test performs its own DNS-resolution, this is because we actually always test against IP address rather than hostnames.

So the code to process jobs has to lookup the targets and enqueue a test for each discovered IPv4/IPv6 address. However resolving "http://example.com" fails, so I chose to make the code special-case the HTTP test.

Instead we should look to see if the target is a URI, and if so lookup against the hostname. That would also allow:

  ftp://ftp.example.com/ must run ftp

Instead of just:

 ftp.example.com must run ftp

URIs cannot have custom ports

The following fails:

   http://localhost:8080/ must run http

There is a DNS-failure trying to resolve the hostname localhost:8080. Even if that worked though we have a second issue - we look at the URI to find the port:

port := 80
if strings.HasPrefix(tst.Target, "https:") {
	port = 443
}

So that needs updating too.

Add staticcheck to the CI script.

I ported over the CI system to the new YAML-syntax, but didn't add the staticcheck tool as I did for some other projects - largely because there were a lot of issues ;)

Add the check.

Add pop3 probe

This should be simple.

Missing protocol-test for psql

This should be simple enough, and should be based upon the MySQL probe, I guess.

Misleading error message for SSL certificate expiration

Not necessarily an issue, because I think it's really an edge case, but reporting for the sake of knowledge.

On May 30th a root CA SSL cert will expire and will be replaced by a new one, valid through 2038, which is already available.

https://support.sectigo.com/articles/Knowledge/Sectigo-AddTrust-External-CA-Root-Expiring-May-30-2020

The curious case here is, I've been running Overseer on a quite old Docker build, ~180d, which probably did not already include the new CA certs. So, because of this situation, Overseer started throwing SSL cert expiring for my domain (24 days left), while in reality it was the CA one (11 days) that was marked for expiry.

So the issue is just that Overseer was not reporting that the expiring SSL cert was the CA one and not the mydomain.com.

This is absolutely and edge case, but relevant in long-running systems (especially docker, where you have to rebuild an image to update SSL certs).

The logs were:

SSLExpiration - certificate: *.mydomain.com expires in 596 hours (24 days)
SSLExpiration - certificate: Sectigo RSA Domain Validation Secure Server CA expires in 93092 hours (3878 days)
SSLExpiration - certificate: USERTrust RSA Certification Authority expires in 154892 hours (6453 days)
SSLExpiration - certificate: *.mydomain.com expires in 596 hours (24 days)
SSLExpiration - certificate: Sectigo RSA Domain Validation Secure Server CA expires in 93092 hours (3878 days)
SSLExpiration - certificate: USERTrust RSA CertifiAddTrust External CA Rootcation Authority expires in 271 hours (11 days)
SSLExpiration - certificate:  expires in 271 hours (11 days)

The reported error was:

Error: SSL certificate will expire in 287 hours (11 days)