Comments (17)
@sgnn7 thanks for the update. I'm glad to hear the backend issue got resolved.
I'll leave the issue open for a few days for visibility but will close it after that since the incident has been resolved.
Please consider fixing the healthcheckLoop
ending, as per the issue's description. With that patched, customers' agents will be able to recover without requiring an operator's attention.
from datadog-agent.
When a pod fails the readiness check, Kubernetes is unable to route traffic to it via the service. On a pod with 4/4 containers ready, I can do this:
% kubectl -n datadog exec -it datadog-agent-xgdbx -c agent -- curl agent.datadog.svc.cluster.local.:8126 404 page not found
On a pod where the agent container is failing its readiness check, the command just hangs indefinitely, presumably because Kubernetes is unwilling to route traffic to a pod that is not ready.
At least this is noticeable. When sending UDP metrics traffic to the service, it's likely that the packets are just silently dropped.
from datadog-agent.
@bberg-indeed's comment implies that Datadog customers that use Kubernetes Service resources to route traffic to the agents who haven't restarted the agents are still losing data as a result of https://status.datadoghq.com/incidents/q2d98y2qv54j
from datadog-agent.
Reopening this issue since the Agent changes to retry API key validation failures hasn't been implemented yet.
from datadog-agent.
I can confirm it happens for us as well. A simple restart does the job, so bad the issue doesn't cause the liveness probe to fail
from datadog-agent.
@sgnn7 If this is the trick, can we make flags to allow retry for auto recovery or other mechanisms behind some config? So you can preserve current behavior but allow customization. After all, it looks like it was an intermittent problem.
As a use case, manually triggered aws-ecs tasks will stay up and blocked if sidecars fail. As seen, the datadog agent recovered and sent signals to the server, but the agent health
command was kept as failed. Giving more context: Manual ecs task triggers can be part of an automated workflow. Having it self-recover would simplify operations and avoid incidents.
from datadog-agent.
I shared a flare from that instance in this support case, too: https://help.datadoghq.com/hc/en-us/requests/1583229
from datadog-agent.
Same here. Many clusters in production are experiencing the same issue with datadog-agent
.
from datadog-agent.
I guess it's worth pointing out that this failure can only occur if the Datadog backend responds with a 403, if I'm reading the code right. That hints at a problem with api key validation in the backend..
from datadog-agent.
This has also happened for us as well today. As others have mentioned, restarting the pods with the unready agent containers looks to have sorted things as of about an hour ago.
from datadog-agent.
I can confirm this issue too.
from datadog-agent.
Same. We got alerts on readiness failing, and were scrambling to understand this yesterday.
from datadog-agent.
There was an incident on datadog yesterday. We also had some instances impacted, but we only saw them today.
https://status.datadoghq.com/incidents/q2d98y2qv54j
from datadog-agent.
Hi everyone,
Thank you for reporting this.
The problem (as @voiski correctly identified) is tied to an already-known short-lived incident that occurred yesterday that could have impacted the API key validation of some Agents. If needed and as others have suggested, a restart of the Agent should correct the health reporting. We will be revisiting the health logic of the Agent in the future to see what we can do to prevent this from reoccurring.
I'll leave the issue open for a few days for visibility but will close it after that since the incident has been resolved.
Thanks,
Srdjan
from datadog-agent.
Please consider fixing the
healthcheckLoop
ending, as per the issue's description. With that patched, customers' agents will be able to recover without requiring an operator's attention.
Fair. I'll wait for the postmortem but as mentioned in my original comment, we will be looking into this. Overall, it is a deceptively tricky question to answer as to what should happen to the Agent though if a key is detected as "invalid" that may not have great trade-offs in terms of overall UX.
from datadog-agent.
Thank you for reporting the issue here, and we appreciate all the contributors for their responses! As mentioned in the previous comments, Datadog engineers detected on Mar 6 and resolved the cause promptly.
If you're interested in learning more details about the issue and its resolution, feel free to request a Root Cause Analysis (RCA) by opening a ticket with our Datadog support (you can reference the status page link https://status.datadoghq.com/incidents/q2d98y2qv54j in the support ticket).
Thank you all for your input, and I will mark this thread as resolved.
from datadog-agent.
Related Issues (20)
- [BUG] Agent does not log anything when it fails to start due to configuration error HOT 5
- [BUG] OTLP receiver fails to start if logs_enabled=false
- [BUG] agent Dockerfile started crashing HOT 10
- Does DataDog agent have limit resource feature ? HOT 1
- Is Datadog Agent meant to run on kind (kubernetes) too?
- [BUG] ArrayIndexOutOfBoundsException at com.datadoghq.profiler.JavaProfiler.getPage(JavaProfiler.java:267) HOT 1
- [BUG] ConnectionResetError when enabling MTLS for Istio mesh with Datadog integration HOT 3
- [BUG] Kubernetes - Hostnames are sometimes incorrect HOT 1
- Environment variable appears to be wrong in config template HOT 1
- Can't use K8s secret in auth header HOT 2
- [BUG] postgres configuration item collect_wal_metrics doesn't prevent attempts to collect metrics if `false` HOT 3
- Disable componentstatus check via Helm Chart
- Add additional metrics to ECS Integration
- Docker metrics only
- [BUG] Facing Too many errors for endpoint 'https://orchestrator.api.datadoghq.com/api/v2 error for datadog cluster agent
- [BUG] CMD API Server Error HOT 8
- [BUG] could not configure check instance for python check openmetrics: yaml: unmarshal errors: line 6: cannot unmarshal !!map into string
- [BUG] Datadog agent causing RPM database get corrupted HOT 11
- If the System Probe Agent or Security Agent is disabled, inventory agent may be producing noisy logs. HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from datadog-agent.