The /health
endpoint returns status "UNKNOWN" at the body and status code 200 when a uncaught exception happen in the application.
The application crashes and the stacktrace is logged correctly, but since the healthcheck still returns something that is not an error, it is never captured by any orchestration.
This became a problem in our cluster. For example, when a Kafka node fails, all threads with partitions that had it as the leader throws an exception and crashes, which brings the whole app down. This is the intended behaviour for Kafka Streams, the application crashes and you can restart it, manually or automatically.
What we expected
The health-check endpoint would mark a crashed application as being unhealthy, Kubernetes would restart the pod and the application would connect to the new leader.
What actually happened
The health-check endpoint changed to unknown and consider that a "OK". Therefore Kubernetes, logically, did nothing.
Is that an intended behaviour? Currently, the only way for us to have applications auto-healing is to inject a bash script to parse the response consider the "UNKNOWN" from the body as unhealthy. To me it seems to defeat the purpose of having a health-check endpoint, if you need to use a bash script you can just check if the process is running.