andreas-schroeder / kafka-health-check Goto Github PK

View Code? Open in Web Editor NEW

240.0 12.0 84.0 117 KB

Health Check for Kafka Brokers.

License: MIT License

Makefile 0.53% Go 97.09% Shell 0.90% Dockerfile 1.48%

kafka health-check cluster-health

kafka-health-check's Introduction

Kafka Health Check

Health checker for Kafka brokers and clusters that operates by checking whether:

a message inserted in a dedicated health check topic becomes available for consumers,
the broker can stay in the ISR of a replication check topic,
the broker is in the in-sync replica set for all partitions it replicates,
under-replicated partitions exist,
out-of-sync replicas exist,
offline partitions exist, and
the metadata of the cluster and the ZooKeeper metadata are consistent with each other.

Status

Release version is 0.1.0

Compiled binaries are available for Linux, macOS, and FreeBSD.

Use Cases

Submit a pull request to have your use case listed here!

Self-healing cluster

At AutoScout24, in order to reduce operational workload, we use kafka-health-check to automatically restart broker nodes as they become unhealthy.

In-place rolling updates

At AutoScout24, to keep the OS up to date of our clusters running on AWS, we perform regular in-place rolling updates. As we run immutable servers, we terminate each broker and replace them with fresh EC2 instances (keeping the previous broker ids). In order not to jeopardy the cluster stability when terminating brokers, we verify that the cluster is healthy before taking one broker offline. Similarly, we wait for the broker coming back online to fully catch up before proceeding with the next broker. To achieve this, we use the cluster health information provided by kafka-health-check.

Usage

Usage of kafka-health-check:
  -broker-host string
    	ip address or hostname of broker host (default "localhost")
  -broker-id uint
    	id of the Kafka broker to health check
  -broker-port uint
    	Kafka broker port (default 9092)
  -check-interval duration
    	how frequently to perform health checks (default 10s)
  -no-topic-creation
    	disable automatic topic creation and deletion
  -replication-failures-count uint
    	number of replication failures before broker is reported unhealthy (default 5)
  -replication-topic string
    	name of the topic to use for replication checks - use one per cluster, defaults to broker-replication-check
  -server-port uint
    	port to open for http health status queries (default 8000)
  -topic string
    	name of the topic to use - use one per broker, defaults to broker-<id>-health-check
  -zookeeper string
    	ZooKeeper connect string (e.g. node1:2181,node2:2181,.../chroot)

Broker Health

Broker health can be queried at /:

$ curl -s <broker-host>:8000/
{
    "broker": 1,
    "status": "sync"
}

Return codes and status values are:

200 with sync for a healthy broker that is fully in sync with all leaders.
200 with imok for a healthy broker that replays messages of its health check topic, but is not fully in sync.
500 with nook for an unhealthy broker that fails to replay messages in its health check topic within 200 milliseconds or if it fails to stay in the ISR of the replication check topic for more checks than replication-failures-count (default 5).

The returned json contains details about replicas the broker is lagging behind:

$ curl -s <broker-host>:8000/
{
    "broker": 3,
    "status": "imok",
    "out-of-sync": [
        {
            "topic": "mytopic",
            "partition": 0
        }
    ],
    "replication-failures": 1
}

Cluster Health

Cluster health can be queried at /cluster:

$ curl -s <broker-host>:8000/cluster
{
    "status": "green"
}

Return codes and status values are:

200 with green if all replicas of all partitions of all topics are in sync and metadata is consistent.
200 with yellow if one or more partitions are under-replicated and metadata is consistent.
500 with red if one or more partitions are offline or metadata is inconsistent.

The returned json contains details about metadata status and partition replication:

$ curl -s <broker-host>:8000/cluster
{
    "status": "yellow",
    "topics": [
        {
            "topic": "mytopic",
            "status": "yellow",
            "partitions": {
                "1": {
                    "status": "yellow",
                    "OSR": [
                        3
                    ]
                },
                "2": {
                    "status": "yellow",
                    "OSR": [
                        3
                    ]
                }
            }
        }
    ]
}

The fields for additional info and structures are:

topics for topic replication status: [{"topic":"mytopic","status":"yellow","partitions":{"2":{"status":"yellow","OSR":[3]}}}] In this data, OSR means out-of-sync replica and contains the list of all brokers that are not in the ISR.
metadata for inconsistencies between ZooKeeper and Kafka metadata: [{"broker":3,"status":"red","problem":"Missing in ZooKeeper"}]
zookeeper for problems with ZooKeeper connection or data, contains a single string: "Fetching brokers failed: ..."

Supported Kafka Versions

Tested with the following Kafka versions:

2.0.0
1.1.1
1.1.0
1.0.0
0.11.0.2
0.11.0.1
0.11.0.0
0.10.2.1
0.10.2.0
0.10.1.1
0.10.1.0
0.10.0.1
0.10.0.0
0.9.0.1
0.9.0.0

Kafka 0.8 is not supported.

see the compatibility spec for the full list of executed compatibility checks. To execute the compatibility checks, run make compatibility. Running the checks requires Docker.

Building

Run make to build after running make deps to restore the dependencies using govendor.

Prerequisites

Make to run the Makefile
Go 1.8 since it's written in Go

Notable Details on Health Check Behavior

When first started, the check tries to find the Kafka broker to check in the cluster metadata. Then, it tries to find the health check topic, and creates it if missing by communicating directly with ZooKeeper(configuration: 10 seconds message lifetime, one single partition assigned to the broker to check). This behavior can be disabled by using -no-topic-creation.
The check also creates one replication check topic for the whole cluster. This topic is expanded to all brokers that are checked.
When shutting down, the check deletes to health check topic partition by communicating directly with ZooKeeper. It also shrinks the partition assignment of the replication check topic, and deletes it when stopping the last health check process. This behavior can be disabled by using -no-topic-creation.
The check will try to create the health check and replication check topics only on its first connection after startup. If the topic disappears later while the check is running, it will not try to re-create its topics.
If the broker health check fails, the cluster health will be set to red.
For each check pass, the Kafka cluster metadata is fetched from ZooKeeper, i.e. the full data on brokers and topic partitions with replicas.

kafka-health-check's People

Contributors

Stargazers

Watchers

Forkers

pbatey daniilyar ustream starkers athlinksengineering jobseekerltd vicchugu 10sr hugebdu independentip gronau-it-cloud-computing wilson ltnilaysahu psycotica0-shopify dbrambilla fperalta-stratio balboah leffen briansorahan cusystem nazeemkhan77 gruntwork-io trashguy nichegriffy doesitscripters mishamo tobyclemson hussainssabir vanveele fabzo credify andpol skolomiiets garnachod lythobv aar6ncai saschawitte sherenator icyfork v2nek xmariachi uepoch dmiruke sphere-lab wh1tehorse claudio-benfatto adevinta bgreenlee nenoteerawat pdkgamage sirishankar iketutg chengkala santakd yudhaputrama rafaellccoelho pooyakn renjunqu rk295 franmrl cloud-66

kafka-health-check's Issues

Getting broker address directly from zookeeper

Can we have just one-endpoint, which can give information about all the three brokers.

so /cluster endpoint would list all the brokers

The zookeeper address should be good enough to get all the broker metadata

Improve logging on connectivity issues

When the health check cannot connect to Kafka because of misconfiguration, the logged output is not very informative. It would be valuable to improve logging behavior under these conditions so that it can be easily troubleshooted and remedied.

Zookeeper is spaming with "old client" messages

When using kafka-health-check, there are a lot of such warning messages every few seconds in Zookeeper (3.4.12 version)
Connection request from old client /<ip>:<port>; will be dropped if server is in r-o mode

Internet says that this warning relates to C client library which doesn't support read only mode comparing to Java one.

When i stop health-checker, these spam warnings disappear.

Am I using this correctly?

Hi,

I'm using the precompiled binary on Linux, running on a local machine which is pointing at a remote server.

My cmd line looks like this (machine names obscured)

./kafka-health-check -broker-host kafka-host -broker-id 1 -zookeeper kafka-host:2181

When I run this, I get the following output...

time="2018-01-09T11:50:13Z" level=info msg="using topic broker-1-health-check for broker 1 health check"
time="2018-01-09T11:50:13Z" level=info msg="using topic broker-1-health-check for broker 1 replication check"

I don't get any other output, and I'm not sure if that's correct or not. I'm not seeing any topics created in Kafka. If I query the health status on port 8000 I get

{"status": "nook"}

As far as I can tell, Kafka is working correctly, does anyone have any idea what's going on and is there any way I can diagnose it?

incorrect in-sync replica count for broker health check topic

Getting the following error for the :
"producer failure - broker unhealthy: not enough in-sync replicas (19)"

However, kafka-topic.sh --describe shows that the topic has one in-sync replica, so this error condition is incorrect. I am digging into the code, but I am wondering if this is a zookeeper connection issue?

Health Check Just Stalls

First, thanks for creating this tool. It's a big improvement over a simple TCP listener check.

I'm running into a sporadic issue where, when I reboot my Kafka cluster, one of the nodes' kafka-health-check daemon will just stall without doing anything. Here's the only log output I see:

time="2018-04-30T09:43:33Z" level=info msg="using topic broker-2-health-check for broker 2 health check"
time="2018-04-30T09:43:33Z" level=info msg="using topic broker-2-health-check for broker 2 replication check"
time="2018-04-30T09:43:38Z" level=info msg="unable to connect to broker, retrying in 5s (cannot connect)"
time="2018-04-30T09:43:43Z" level=info msg="unable to connect to broker, retrying in 5s (cannot connect)"
time="2018-04-30T09:45:08Z" level=info msg="using topic broker-2-health-check for broker 2 health check"
time="2018-04-30T09:45:08Z" level=info msg="using topic broker-2-health-check for broker 2 replication check"

Note that the 09:45 time stamp is when I manually restarted my supervisord service that runs kafka-health-check.

This issue only occurs after I have an initially healthy cluster, and then being rolling out an update across the cluster. Notably, each replacement Kafka broker retains the same broker-id, so I'm wondering if that's what's tripping up kafka-health-check?

Here's the command I'm using to run it:

kafka-health-check -zookeeper 172.31.4.179:2181,172.31.25.146:2181,172.31.20.115:2181 -broker-port 9094 -broker-id 2 -broker-host 127.0.0.1

And of course this works fine on other Kafka brokers. If it's any help, here's my Kafka broker config:

broker.id=2
listeners=EXTERNAL://0.0.0.0:9092,INTERNAL://0.0.0.0:9093,HEALTHCHECK://127.0.0.1:9094
advertised.listeners=EXTERNAL://13.127.215.56:9092,INTERNAL://172.31.17.29:9093,HEALTHCHECK://127.0.0.1:9094
listener.security.protocol.map=EXTERNAL:PLAINTEXT,INTERNAL:PLAINTEXT,HEALTHCHECK:PLAINTEXT
inter.broker.listener.name=INTERNAL
num.network.threads=3
num.io.threads=8
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
log.dirs=/opt/kafka/kafka-logs/data
num.partitions=1
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
transaction.state.log.min.isr=1
min.insync.replicas=1
default.replication.factor=1
unclean.leader.election.enable=true
log.retention.hours=168
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
zookeeper.connect=172.31.20.115:2181,172.31.4.179:2181,172.31.25.146:2181
zookeeper.connection.timeout.ms=6000
group.initial.rebalance.delay.ms=0

Any help is much appreciated!

enable brokerHost variable for network environments where localhost will not work

Hi Andreas,

First of all--great job on this! I've recently deployed it and it's working well.

My deployment network setup does not allow binding to localhost, so I needed to add a brokerHost variable. I forked the project and will submit a pull request today or over the weekend.

--John

Questions on functionality

Hi @andreas-schroeder ,

Thanks for building this check! I have some questions on some of the behavior I have been experiencing while using this tool and want to make sure I am using it properly.

Replication and partitions count is set to 1 on the broker-replication-check topic. This doesn't seem correct as it wouldn't be able to determine if all brokers were in the ISR.

I recently have encountered a Kafka outage and after I recovered the cluster, I had to restart the Kafka health check service on ALL of my brokers so that it would detect that the brokers were healthy again. This is most likely due to losing connection to the cluster/zookeeper, however in one of my environments where I am experiencing issues getting kafka-health-check to report a healthy cluster, I can see that it does make reconnect attempts.

INFO[0037] closing connection and reconnecting         
 
INFO[0042] found partition id 1 for broker 0 in topic "broker-0-health-check" 

INFO[0042] found partition id 2 for broker 0 in topic "broker-replication-check"
 
INFO[0042] reconnected

I am still unable to figure out why kafka-health-check will not report green on this cluster. I have recompiled the check with an increased timeout without any progress. This is on a fresh Kafka cluster with only the consumer_offsets partition. It will just report NOOK and continue in a loop as mentioned above.

Thank you!

Does this support SASL auth? could not connect to broker

SSL support: Does it connect to brokers configured with ssl

Cannot connect to Broker (Kafka 2.0.0)

Running a 3-node-cluster with a Zookeeper on each node, Kafka 2.0.0. Although trying different things and studying the readme, i do not get a connection to a broker on the same machine, or any broker of my cluster. I tried:

-zookeeper localhost:2181
-zookeeper <full qualified hostname>:2181
I also tried adding -broker-host, -broker-port and -broker-id, in different variations, no success. Error is unable to connect to broker, retrying in 5s (cannot connect).

Either i am missing something obvious that i could not extract from the Readme, or connection to a 2.0.0 broker is not possible. What is the commandline option(s) that is/are expected to do a health check against a local zookeeper/kafka? Thanks.

Provide one-shot check mode

Provide start options that execute the checks once. The check should wait for the broker to become healthy within a specified deadline, and report success or failure through exit codes 0 and 1 respectively.

Motivation: Container health checks have to be specified as repeatedly executed commands. Providing a one-shot check mode would allow to use kafka-health-check directly instead of using curl to query the daemonized kafka-health-check process.

Name in Linux deployable has a typo

Hi @andreas-schroeder, I have just started testing the utility to health-check a multi-node Kafka cluster setup (based on HashiCorp stack) and it's looking good so far.
One minor observation - the name of binary in the Linux deployable (kafka-health-check_0.0.0_linux_amd64.tar.gz) is 'kafha-health-check' instead of 'kafka-health-check'. Please could you correct this or let me know if I can.
Also is it not time now to call the version 1.0.0 ?
Cheers.

Timeout on setting CheckInterval >= 30 sec

Producer is unable to produce when the health check interval is set more than 30sec (have tried 20 seconds, works fine ) and gives producer failure - broker unhealthy: read tcp : i/o timeout
Healthcheck reconnects afterwards . Is there any setting for timeout , couldn't find it on the source code.

Panic while shrinking replication topic

time="2017-09-26T11:01:05+02:00" level=info msg="closing connection and reconnecting"
time="2017-09-26T11:01:09+02:00" level=info msg="error while reconnecting:connect was asked to stop"
time="2017-09-26T11:01:09+02:00" level=info msg="connecting to ZooKeeper ensemble **masked**:2181,**masked**:2181,**masked**:2181"
time="2017-09-26T11:01:09+02:00" level=info msg="creating node /admin/delete_topics/broker-3-health-check"
time="2017-09-26T11:01:14+02:00" level=info msg="Shrinking replication check topic to exclude broker 3"
panic: runtime error: slice bounds out of range

goroutine 1 [running]:
github.com/andreas-schroeder/kafka-health-check/check.delAll(0xc42062e090, 0x4, 0x4, 0xc400000003, 0x1, 0x0, 0x0)
        /home/travis/gopath/src/github.com/andreas-schroeder/kafka-health-check/check/int32_slice.go:29 +0x162
github.com/andreas-schroeder/kafka-health-check/check.(*HealthCheck).deleteTopic(0xc4200ea000, 0x874360, 0xc42000e048, 0x0, 0x0, 0x72f414, 0x18, 0x0, 0x0, 0x0)
        /home/travis/gopath/src/github.com/andreas-schroeder/kafka-health-check/check/setup.go:304 +0x398
github.com/andreas-schroeder/kafka-health-check/check.(*HealthCheck).closeConnection(0xc4200ea000, 0x1)
        /home/travis/gopath/src/github.com/andreas-schroeder/kafka-health-check/check/setup.go:284 +0x261
github.com/andreas-schroeder/kafka-health-check/check.(*HealthCheck).CheckHealth(0xc4200ea000, 0xc420016540, 0xc4200165a0, 0xc420016360)
        /home/travis/gopath/src/github.com/andreas-schroeder/kafka-health-check/check/health_check.go:99 +0x941
main.main()
        /home/travis/gopath/src/github.com/andreas-schroeder/kafka-health-check/main.go:20 +0x150
time="2017-09-26T11:01:14+02:00" level=info msg="using topic broker-3-health-check for broker 3 health check"
time="2017-09-26T11:01:14+02:00" level=info msg="using topic broker-3-health-check for broker 3 replication check"
time="2017-09-26T11:01:19+02:00" level=info msg="found partition id 0 for broker 3 in topic \"broker-3-health-check\""
time="2017-09-26T11:01:19+02:00" level=info msg="found partition id 0 for broker 3 in topic \"broker-replication-check\""
time="2017-09-26T11:01:19+02:00" level=info msg="starting health check loop"

Fetching topics from ZooKeeper failed

Hi,
I get the following error message from the application when trying to list topics from ZooKeeper:

INFO[0236] metadata could not be retrieved, assuming broker unhealthy: Fetching topics from ZooKeeper failed: json: cannot unmarshal object into Go value of type map[int32][]int32

This expects the type of json value partitions in map[int32][]int32 however the actual type is map[string][]int32

[zk: localhost:2181(CONNECTED) 3] get /brokers/topics/broker-1-health-check   
{"version":1,"partitions":{"0":[1]}}
cZxid = 0x10000001b
ctime = Mon Oct 24 09:53:51 CEST 2016
mZxid = 0x10000001b
mtime = Mon Oct 24 09:53:51 CEST 2016

KAFKA_VERSION="0.10.0.1"
SCALA_VERSION="2.11"

sigterm not handled ?

Maybe It's my fault but seems that the process doesn't handle graceful process termination...
SIGTERM does nothing.... only sigkill works :(

Create releases with precompiled binaries

It would be handy to have versioned releases with binaries for common platforms.

Single node down shows other node as broken

Steps to reproduce:

Scale up cluster with 2 brokers.
Run kafka health check on each node.
Stop the first broker.
Hit the broker healthcheck endpoint on the second broker.

Expected result:

200 imok

Actual result:

500 nook

Presumably this is related to the fact that the broker healthcheck topic is not replicated onto the first broker and the second broker sees it as unhealthy when the first is down. Perhaps the second broker's healthcheck should ignore the healthcheck topic of the first?

The cluster is also shown as red from the second broker and yellow from the first, which shows an inconsistent view of things. Perhaps the broker healthcheck topics should be filtered out when polling for cluster health.

Health-check prints "cannot fetch metadata. No topics created?" after kafka has been restarted

Hello. I'm testing your health check:

Start kafka
Start health check
It prints 200 {"broker":0,"status":"sync"}
Stop kafka
It prints 500 {"status": "nook"}
Start kafka.
Health check still prints {"status": "nook"} and logs metadata could not be retrieved, assuming broker unhealthy: cannot fetch metadata. No topics created?
After health check restart it prints 200 {"broker":0,"status":"sync"}

Why is it still has unhealthy state? kafka_2.12-2.2.0

Add option for min.insync.replicas

Observation

In our production environment the min.insync.replicas setting is set to 2 for stability reasons. As the kafka_health_check is creating its non-replicating topics with a replication-factor of 1 this makes the cluster check fail because Kafka is not accepting writes to those topics.

Expectation

The topic is replicated to ensure the min.insync.replicas expectation is fulfilled and then checked.