NOTE: Before reading this particular issue note that it is 100% specific to the "local" worker
This morning I received a pair of alerts informing me that blogspam had an SSL certificate nearing expiration (one for IPv4 and one for IPv6). This alert was expected, and once I renewed the certificate I expected the notifications to clear, but they did not.
The test is in one file:
root@www ~ # cat /opt/overseer/tests.d/blogspam.conf
# BlogSpam
https://blogspam.net/xml/stats must run http with content '<spam>'
Running this test manually, like so, should have triggered an MQ notification:
root@www ~ # /opt/overseer/bin/overseer local -verbose -mq localhost:1883 /opt/overseer/tests.d/blogspam.conf
Running 'http' test against blogspam.net (2a01:4f8:151:6083::101)
SSLExpiration testing: blogspam.net:443
SSLExpiration - certificate: blogspam.net expires in 2158 hours (89 days)
SSLExpiration - certificate: Let's Encrypt Authority X3 expires in 24899 hours (1037 days)
SSLExpiration - certificate: DST Root CA X3 expires in 29625 hours (1234 days)
[1/5] - Test passed.
Running 'http' test against blogspam.net (176.9.183.101)
SSLExpiration testing: blogspam.net:443
SSLExpiration - certificate: blogspam.net expires in 2158 hours (89 days)
SSLExpiration - certificate: Let's Encrypt Authority X3 expires in 24899 hours (1037 days)
SSLExpiration - certificate: DST Root CA X3 expires in 29625 hours (1234 days)
[1/5] - Test passed.
So what went wrong? Well this is what should have happened:
- Open the notifier (i.e. MQ connection)
- Parse the tests.
- For each test run it
- For each test publish the result over MQ
- No more tests? Exit
It's the last bit that is the problem:
- The result of the final test was published to MQ
- The process exited
However the MQ publishing didn't await an ack, or confirmation, so the actual action was:
- Fire the message at MQ
- Exit
- Before that message was delivered to MQ.
This behaviour explains why the overseer worker
mode of operation wasn't affected - because in that mode the worker keeps running forever, and the persistent notification setup (as implemented in #17) meant that there was no join/part to the MQ server.
In fact if you look at an older commit you can see where I added some code to work around this problem:
//
// This seems to be necessary .. Sigh
//
time.Sleep(500 * time.Millisecond)
Adding a sleep is a bad solution because you never know how long you need to sleep - what you actually need to do is await the MQ-delivery, or otherwise have an acknowledgement of some kind.
In conclusion:
- Our MQ-publish must await a successful delivery.
- If that means a new/different client library then so be it.