Code Monkey home page Code Monkey logo

Comments (17)

deadok22 avatar deadok22 commented on June 2, 2024 3

@sgnn7 thanks for the update. I'm glad to hear the backend issue got resolved.

I'll leave the issue open for a few days for visibility but will close it after that since the incident has been resolved.

Please consider fixing the healthcheckLoop ending, as per the issue's description. With that patched, customers' agents will be able to recover without requiring an operator's attention.

from datadog-agent.

bberg-indeed avatar bberg-indeed commented on June 2, 2024 3

When a pod fails the readiness check, Kubernetes is unable to route traffic to it via the service. On a pod with 4/4 containers ready, I can do this:

% kubectl -n datadog exec -it datadog-agent-xgdbx -c agent -- curl agent.datadog.svc.cluster.local.:8126 404 page not found

On a pod where the agent container is failing its readiness check, the command just hangs indefinitely, presumably because Kubernetes is unwilling to route traffic to a pod that is not ready.

At least this is noticeable. When sending UDP metrics traffic to the service, it's likely that the packets are just silently dropped.

from datadog-agent.

deadok22 avatar deadok22 commented on June 2, 2024 3

@sgnn7

@bberg-indeed's comment implies that Datadog customers that use Kubernetes Service resources to route traffic to the agents who haven't restarted the agents are still losing data as a result of https://status.datadoghq.com/incidents/q2d98y2qv54j

from datadog-agent.

jszwedko avatar jszwedko commented on June 2, 2024 2

Reopening this issue since the Agent changes to retry API key validation failures hasn't been implemented yet.

from datadog-agent.

aokomorowski avatar aokomorowski commented on June 2, 2024 1

I can confirm it happens for us as well. A simple restart does the job, so bad the issue doesn't cause the liveness probe to fail

from datadog-agent.

voiski avatar voiski commented on June 2, 2024 1

@sgnn7 If this is the trick, can we make flags to allow retry for auto recovery or other mechanisms behind some config? So you can preserve current behavior but allow customization. After all, it looks like it was an intermittent problem.

As a use case, manually triggered aws-ecs tasks will stay up and blocked if sidecars fail. As seen, the datadog agent recovered and sent signals to the server, but the agent health command was kept as failed. Giving more context: Manual ecs task triggers can be part of an automated workflow. Having it self-recover would simplify operations and avoid incidents.

from datadog-agent.

deadok22 avatar deadok22 commented on June 2, 2024

I shared a flare from that instance in this support case, too: https://help.datadoghq.com/hc/en-us/requests/1583229

from datadog-agent.

posquit0 avatar posquit0 commented on June 2, 2024

Same here. Many clusters in production are experiencing the same issue with datadog-agent.
image

from datadog-agent.

deadok22 avatar deadok22 commented on June 2, 2024

I guess it's worth pointing out that this failure can only occur if the Datadog backend responds with a 403, if I'm reading the code right. That hints at a problem with api key validation in the backend..

from datadog-agent.

theplatformer avatar theplatformer commented on June 2, 2024

This has also happened for us as well today. As others have mentioned, restarting the pods with the unready agent containers looks to have sorted things as of about an hour ago.

from datadog-agent.

smaugs avatar smaugs commented on June 2, 2024

I can confirm this issue too.

from datadog-agent.

pcn avatar pcn commented on June 2, 2024

Same. We got alerts on readiness failing, and were scrambling to understand this yesterday.

from datadog-agent.

voiski avatar voiski commented on June 2, 2024

There was an incident on datadog yesterday. We also had some instances impacted, but we only saw them today.
https://status.datadoghq.com/incidents/q2d98y2qv54j

from datadog-agent.

sgnn7 avatar sgnn7 commented on June 2, 2024

Hi everyone,
Thank you for reporting this.

The problem (as @voiski correctly identified) is tied to an already-known short-lived incident that occurred yesterday that could have impacted the API key validation of some Agents. If needed and as others have suggested, a restart of the Agent should correct the health reporting. We will be revisiting the health logic of the Agent in the future to see what we can do to prevent this from reoccurring.

I'll leave the issue open for a few days for visibility but will close it after that since the incident has been resolved.

Thanks,
Srdjan

from datadog-agent.

sgnn7 avatar sgnn7 commented on June 2, 2024

@deadok22

Please consider fixing the healthcheckLoop ending, as per the issue's description. With that patched, customers' agents will be able to recover without requiring an operator's attention.

Fair. I'll wait for the postmortem but as mentioned in my original comment, we will be looking into this. Overall, it is a deceptively tricky question to answer as to what should happen to the Agent though if a key is detected as "invalid" that may not have great trade-offs in terms of overall UX.

from datadog-agent.

sgnn7 avatar sgnn7 commented on June 2, 2024

Thank you for reporting the issue here, and we appreciate all the contributors for their responses! As mentioned in the previous comments, Datadog engineers detected on Mar 6 and resolved the cause promptly.

If you're interested in learning more details about the issue and its resolution, feel free to request a Root Cause Analysis (RCA) by opening a ticket with our Datadog support (you can reference the status page link https://status.datadoghq.com/incidents/q2d98y2qv54j in the support ticket).

Thank you all for your input, and I will mark this thread as resolved.

from datadog-agent.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.