forwarderHealth 's healthche

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I shared a flare from that instance in this support case, too: <a href="https://help.d

[BUG] forwarderHealth fails to recover from failures about datadog-agent HOT 17 OPEN

deadok22 commented on June 2, 2024 8

[BUG] forwarderHealth fails to recover from failures

from datadog-agent.

Comments (17)

deadok22 commented on June 2, 2024 3

@sgnn7 thanks for the update. I'm glad to hear the backend issue got resolved.

I'll leave the issue open for a few days for visibility but will close it after that since the incident has been resolved.

Please consider fixing the healthcheckLoop ending, as per the issue's description. With that patched, customers' agents will be able to recover without requiring an operator's attention.

from datadog-agent.

bberg-indeed commented on June 2, 2024 3

When a pod fails the readiness check, Kubernetes is unable to route traffic to it via the service. On a pod with 4/4 containers ready, I can do this:

% kubectl -n datadog exec -it datadog-agent-xgdbx -c agent -- curl agent.datadog.svc.cluster.local.:8126 404 page not found

On a pod where the agent container is failing its readiness check, the command just hangs indefinitely, presumably because Kubernetes is unwilling to route traffic to a pod that is not ready.

At least this is noticeable. When sending UDP metrics traffic to the service, it's likely that the packets are just silently dropped.

from datadog-agent.

deadok22 commented on June 2, 2024 3

@sgnn7

@bberg-indeed's comment implies that Datadog customers that use Kubernetes Service resources to route traffic to the agents who haven't restarted the agents are still losing data as a result of https://status.datadoghq.com/incidents/q2d98y2qv54j

from datadog-agent.

jszwedko commented on June 2, 2024 2

Reopening this issue since the Agent changes to retry API key validation failures hasn't been implemented yet.

from datadog-agent.

aokomorowski commented on June 2, 2024 1

I can confirm it happens for us as well. A simple restart does the job, so bad the issue doesn't cause the liveness probe to fail

from datadog-agent.

voiski commented on June 2, 2024 1

@sgnn7 If this is the trick, can we make flags to allow retry for auto recovery or other mechanisms behind some config? So you can preserve current behavior but allow customization. After all, it looks like it was an intermittent problem.

As a use case, manually triggered aws-ecs tasks will stay up and blocked if sidecars fail. As seen, the datadog agent recovered and sent signals to the server, but the agent health command was kept as failed. Giving more context: Manual ecs task triggers can be part of an automated workflow. Having it self-recover would simplify operations and avoid incidents.

from datadog-agent.

deadok22 commented on June 2, 2024

I shared a flare from that instance in this support case, too: https://help.datadoghq.com/hc/en-us/requests/1583229

from datadog-agent.

posquit0 commented on June 2, 2024

Same here. Many clusters in production are experiencing the same issue with datadog-agent.

from datadog-agent.

deadok22 commented on June 2, 2024

I guess it's worth pointing out that this failure can only occur if the Datadog backend responds with a 403, if I'm reading the code right. That hints at a problem with api key validation in the backend..

from datadog-agent.

theplatformer commented on June 2, 2024

This has also happened for us as well today. As others have mentioned, restarting the pods with the unready agent containers looks to have sorted things as of about an hour ago.

from datadog-agent.

smaugs commented on June 2, 2024

I can confirm this issue too.

from datadog-agent.

pcn commented on June 2, 2024

Same. We got alerts on readiness failing, and were scrambling to understand this yesterday.

from datadog-agent.

voiski commented on June 2, 2024

There was an incident on datadog yesterday. We also had some instances impacted, but we only saw them today.
https://status.datadoghq.com/incidents/q2d98y2qv54j

from datadog-agent.

sgnn7 commented on June 2, 2024

Hi everyone,
Thank you for reporting this.

The problem (as @voiski correctly identified) is tied to an already-known short-lived incident that occurred yesterday that could have impacted the API key validation of some Agents. If needed and as others have suggested, a restart of the Agent should correct the health reporting. We will be revisiting the health logic of the Agent in the future to see what we can do to prevent this from reoccurring.

I'll leave the issue open for a few days for visibility but will close it after that since the incident has been resolved.

Thanks,
Srdjan

from datadog-agent.

sgnn7 commented on June 2, 2024

@deadok22

Please consider fixing the healthcheckLoop ending, as per the issue's description. With that patched, customers' agents will be able to recover without requiring an operator's attention.

Fair. I'll wait for the postmortem but as mentioned in my original comment, we will be looking into this. Overall, it is a deceptively tricky question to answer as to what should happen to the Agent though if a key is detected as "invalid" that may not have great trade-offs in terms of overall UX.

from datadog-agent.

sgnn7 commented on June 2, 2024

Thank you for reporting the issue here, and we appreciate all the contributors for their responses! As mentioned in the previous comments, Datadog engineers detected on Mar 6 and resolved the cause promptly.

If you're interested in learning more details about the issue and its resolution, feel free to request a Root Cause Analysis (RCA) by opening a ticket with our Datadog support (you can reference the status page link https://status.datadoghq.com/incidents/q2d98y2qv54j in the support ticket).

Thank you all for your input, and I will mark this thread as resolved.

from datadog-agent.

[BUG] forwarderHealth fails to recover from failures about datadog-agent HOT 17 OPEN

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent