josepdcs / kubectl-prof Goto Github PK
View Code? Open in Web Editor NEWkubectl-prof is a kubectl plugin to profile applications on kubernetes with minimum overhead
License: Apache License 2.0
kubectl-prof is a kubectl plugin to profile applications on kubernetes with minimum overhead
License: Apache License 2.0
When I run with java async-profiler cpu
events - all the flamegraphs are downloaded
--tool async-profiler -e cpu -l java -o flamegraph -t 2m --interval 60s
but with java async-profiler alloc
events I get an error saying Checksum does not match
--tool async-profiler -e alloc -l java -o flamegraph -t 2m --interval 60s
Default profiling tool async-profiler will be used ... ✔
Verifying target pod ... ✔
Launching profiler ... ✔
Profiling ... ✔
Checksum does not match, retrying: /tmp/agent-flamegraph-1382909.html.gz...
Checksum does not match, retrying: /tmp/agent-flamegraph-1382909.html.gz...
with alloc
events, when I remove --interval 60s
a single flamegraph is produced and downloaded.
looks like a timing issue, I tried changing (perfDelayBetweenJobs
) but it doesnt seem to introduce a delay at all as there is no delay between the jobs timestamps. I see the sleep in the code in ./internal/agent/profiler/jvm/async_profiler.go
{"type":"log","data":{"time":"2024-02-01T19:55:57.560460874Z","level":"debug","msg":"The target filesystem is: /run/containerd/io.containerd.runtime.v2.task/k8s.io/9207e89a6d90b33dc9082d185a43c82a20c217e5b42bf42dd1dc33409829e9dc/rootfs"}}
{"type":"log","data":{"time":"2024-02-01T19:55:57.560837314Z","level":"debug","msg":"pgrep -P 3327888"}}
{"type":"log","data":{"time":"2024-02-01T19:55:57.580797179Z","level":"debug","msg":"pgrep -P 3328340"}}
{"type":"log","data":{"time":"2024-02-01T19:55:57.598762725Z","level":"debug","msg":"pgrep -P 3328373"}}
{"type":"log","data":{"time":"2024-02-01T19:55:57.615486974Z","level":"debug","msg":"The PIDs to be profiled: [3328373]"}}
{"type":"log","data":{"time":"2024-02-01T19:55:57.615520289Z","level":"debug","msg":"cp -r /app/async-profiler /tmp"}}
{"type":"log","data":{"time":"2024-02-01T19:55:57.618155249Z","level":"debug","msg":"/tmp/async-profiler/profiler.sh -o flamegraph -d 60 -f /tmp/agent-flamegraph-3328373.html -e alloc --fdtransfer 3328373"}}
{"type":"log","data":{"time":"2024-02-01T19:56:57.756881691Z","level":"debug","msg":"stat -c%s /tmp/agent-flamegraph-3328373.html.gz"}}
{"type":"result","data":{"time":"2024-02-01T19:56:57.757942168Z","result-type":"flamegraph","file":"/tmp/agent-flamegraph-3328373.html.gz","file-size-in-bytes":28507,"checksum":"2ddf9c65b8b12630eabfc2ffd6d9d61f","compressor-type":"gzip"}}
{"type":"log","data":{"time":"2024-02-01T19:56:57.758313068Z","level":"debug","msg":"/tmp/async-profiler/profiler.sh -o flamegraph -d 60 -f /tmp/agent-flamegraph-3328373.html -e alloc --fdtransfer 3328373"}}
{"type":"log","data":{"time":"2024-02-01T19:57:57.859065758Z","level":"debug","msg":"stat -c%s /tmp/agent-flamegraph-3328373.html.gz"}}
{"type":"result","data":{"time":"2024-02-01T19:57:57.860134251Z","result-type":"flamegraph","file":"/tmp/agent-flamegraph-3328373.html.gz","file-size-in-bytes":27931,"checksum":"385cf11bce4a902a073427e3b521f28b","compressor-type":"gzip"}}
{"type":"log","data":{"time":"2024-02-01T19:57:57.860312937Z","level":"debug","msg":"/tmp/async-profiler/profiler.sh -o flamegraph -d 60 -f /tmp/agent-flamegraph-3328373.html -e alloc --fdtransfer 3328373"}}
{"type":"log","data":{"time":"2024-02-01T19:58:57.961279147Z","level":"debug","msg":"stat -c%s /tmp/agent-flamegraph-3328373.html.gz"}}
{"type":"result","data":{"time":"2024-02-01T19:58:57.962588016Z","result-type":"flamegraph","file":"/tmp/agent-flamegraph-3328373.html.gz","file-size-in-bytes":34371,"checksum":"6673e8b7ae742ee567d29eb6dcc7d77c","compressor-type":"gzip"}}
{"type":"progress","data":{"time":"2024-02-01T19:58:57.962825035Z","stage":"ended"}}
{"type":"log","data":{"time":"2024-02-01T20:03:57.976570146Z","level":"warn","msg":"Maximum allowed time 5m0s surpassed. Cleaning up and auto-deleting the agent..."}}
{"type":"log","data":{"time":"2024-02-01T20:03:57.976643458Z","level":"debug","msg":"/tmp/async-profiler/profiler.sh stop 3328373"}}
{"type":"log","data":{"time":"2024-02-01T20:03:57.990292613Z","level":"debug","msg":"Trying to remove file: /tmp/agent-flamegraph-3328373.html"}}
{"type":"log","data":{"time":"2024-02-01T20:03:57.990424459Z","level":"debug","msg":"Trying to remove file: /tmp/agent-flamegraph-3328373.html.gz"}}
maybe if the filename had a timestamp or an interval counter rather than the same file filename agent-flamegraph-1382909.html
for each interval, it would help. Also the perfRecordOutputFileName
and perfScriptOutputFileName
?
Enable use of perf, bpf and https://github.com/flamegraph-rs/flamegraph
Hi there,
Just wanted to know if there would be an interest for the support of Arm64 architectures ?
Any work launched on the topic or foreseen challenges ?
I can try to contribute if needed.
Thx !
New awesome async-profiler version: https://github.com/async-profiler/async-profiler/releases/tag/v3.0
Could be useful being able to specify the concrete PID to be profiled if it is known --pid 343434
First of all, @josepdcs thank you for your efforts, this project is the best supported project for performance sampling so far.
When a application container that contains multiple processes require specifying the target process name when capturing. Just like the --pgrep parameter inside the upstream
Here a low-overhead sampling profiler for PHP 7+
https://github.com/adsr/phpspy
https://github.com/reliforp/reli-prof
Being able to profile apps that launch more than one child process, even in model of hierarchical processes
It occurs in python, node.js, etc.
for example in
https://github.com/josepdcs/kubectl-prof/blob/main/internal/cli/kubernetes/job/jvm.go#L76
The capability in the JobConfig for perf sampling can be lowered from SYS_ADMIN
to just PERFMON
and SYSLOG
kernels prior to v5.9 may require SYS_PTRACE
https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html#perf-events-access-control
The permissions required for perf are
sysctl -w kernel.kptr_restrict=0
sysctl -w kernel.perf_event_paranoid=1
or capabilities PERFMON
and SYSLOG
which is confirmed in the kernel code at the following locations
https://elixir.bootlin.com/linux/v5.15.148/source/tools/perf/util/util.c#L290
https://elixir.bootlin.com/linux/v5.15.148/source/kernel/kallsyms.c#L794
I modified the line mentioned, built and tested the plugin with java and async-profiler and the profiler returns the output. this is some of the output of --dry-run
name: kubectl-prof
resources:
limits:
cpu: "1"
securityContext:
capabilities:
add:
- PERFMON
- SYSLOG
privileged: true
% kubectl prof normcore-room-07156def-a20e-411e-8c4e-8775f63e0d0b -t 1m --lang clang++ -o flamegraph
Default profiling tool bpf will be used ... ✔
Verifying target pod ... ✔
Launching profiler ... ❌
FATA[2024-02-15T15:50:04-05:00] Job.batch "kubectl-prof-clang++-bpf-a46bf2fc-a284-435f-87fe-34f3383959d1" is invalid: [metadata.name: Invalid value: "kubectl-prof-clang++-bpf-a46bf2fc-a284-435f-87fe-34f3383959d1": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*'), spec.template.labels: Invalid value: "kubectl-prof-clang++-bpf-a46bf2fc-a284-435f-87fe-34f3383959d1": a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue', or 'my_value', or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?')]
It looks like the clang++
in the profiler pod name is failing on my kubernetes deployment. Is there a way to set a custom name or not include the language in the profiler pod name?
The raw output format is provided for the most profiling tools (async-profiler, py-spy, rbspy, perf, bcc-profile, etc.) and it can be transformed to framegraph by using Brendan Greed's tool or, even, and this is most worth, it can be read by SpeedScope tool
Hello @josepdcs. Thank you for your effort on developing this tool.
I noticed something that there is an option --runtime-path
where one can specify an alternative container runtime install path.
root@kubectl-flame-658d9ffb4-zwfls:/workspace# kubectl prof | grep runtime-path
--runtime-path string Use a different container runtime install path (default "/run/containerd/")
...
However, no matter what I set this to, like /host/var/snap/microk8s/common/run/containerd/
or /host/data/snap/microk8s/common/run/containerd/
, kubectl-prof
appears to ignore this.
{"type":"log","data":{"time":"2024-01-16T05:13:36.376271942Z","level":"debug","msg":"{\"Duration\":5000000000,\"Interval\":5000000000,\"UID\":\"21488509-a797-4dc9-b3db-ff368fa1c55a\",\"ContainerRuntime\":\"containerd\",\"ContainerID\":\"804b035d095a1843d1baedff11f9e38f9f9cd967731e458f65feb15d1de6e9fe\",\"PodUID\":\"db24dd86-715f-4de3-9f37-962aa5aa32b7\",\"Language\":\"java\",\"Event\":\"itimer\",\"Compressor\":\"gzip\",\"Tool\":\"async-profiler\",\"OutputType\":\"flamegraph\",\"FileName\":\"\",\"HeapDumpSplitInChunkSize\":\"\",\"AdditionalArguments\":null}"}} 2024-01-16T05:13:36.376801051Z {"type":"progress","data":{"time":"2024-01-16T05:13:36.376699686Z","stage":"started"}} 2024-01-16T05:13:36.376804395Z {"type":"log","data":{"time":"2024-01-16T05:13:36.376769131Z","level":"debug","msg":"The target filesystem is: /run/containerd/io.containerd.runtime.v2.task/k8s.io/804b035d095a1843d1baedff11f9e38f9f9cd967731e458f65feb15d1de6e9fe/rootfs"}} 2024-01-16T05:13:36.378534835Z {"type":"error","data":{"reason":"read file failed: /run/containerd/io.containerd.runtime.v2.task/k8s.io/804b035d095a1843d1baedff11f9e38f9f9cd967731e458f65feb15d1de6e9fe/init.pid: open /run/containerd/io.containerd.runtime.v2.task/k8s.io/804b035d095a1843d1baedff11f9e38f9f9cd967731e458f65feb15d1de6e9fe/init.pid: no such file or directory"}}
{"type":"log","data":{"time":"2024-01-16T05:18:36.378708007Z","level":"warn","msg":"Maximum allowed time 5m0s surpassed. Cleaning up and auto-deleting the agent..."}} 2024-01-16T05:18:36.378825029Z {"type":"log","data":{"time":"2024-01-16T05:18:36.378739311Z","level":"debug","msg":"/tmp/async-profiler/profiler.sh stop"}}
As one can see in /run/containerd/io.containerd.runtime.v2.task/k8s.io/804b035d095a1843d1baedff11f9e38f9f9cd967731e458f65feb15d1de6e9fe/init.pid: no such file or directory"
, it is always /run/containerd/
that it attempts to read from.
This issue is preventing my from profiling pods running on a microk8s based node, with containerd apparently installed by snap. Is there any workaround available for my situation?
Thank you.
./kubectl-prof quickstart-es-default-0 -n default --pgrep Elasticsearch -e cpu -l java -t 500s -o raw --log-level debug
Default profiling tool async-profiler will be used ... ✔
Verifying target pod ... ✔
Launching profiler ... ✔
Profiling ... ✔
Error: open /tmp/agent-raw-2037586-1.txt: no such file or directory ❌
{"type":"log","data":{"time":"2024-04-25T08:39:05.102930398Z","level":"debug","msg":"{"Duration":500000000000,"Interval":500000000000,"UID":"d4e71b1e-3537-402d-a4bd-57981c1aeb3e","ContainerRuntime":"containerd","ContainerRuntimePath":"/run/containerd","ContainerID":"2661c207673062ac9b40389fb8d25fbc00f1e3a1cbb12f3157f37bc5ac5bad1c","PodUID":"d37565e8-e463-4f9d-b43a-31f1fe68aaa2","Language":"java","Event":"cpu","Compressor":"gzip","Tool":"async-profiler","OutputType":"raw","FileName":"","HeapDumpSplitInChunkSize":"","PID":"","Pgrep":"Elasticsearch","AdditionalArguments":null,"Iteration":0}"}}
{"type":"progress","data":{"time":"2024-04-25T08:39:05.103293142Z","stage":"started"}}
{"type":"log","data":{"time":"2024-04-25T08:39:05.103421331Z","level":"debug","msg":"The target filesystem is: /run/containerd/io.containerd.runtime.v2.task/k8s.io/2661c207673062ac9b40389fb8d25fbc00f1e3a1cbb12f3157f37bc5ac5bad1c/rootfs"}}
{"type":"log","data":{"time":"2024-04-25T08:39:05.103716862Z","level":"debug","msg":"pgrep -P 2037504"}}
{"type":"log","data":{"time":"2024-04-25T08:39:05.117411991Z","level":"debug","msg":"pgrep -P 2037516"}}
{"type":"log","data":{"time":"2024-04-25T08:39:05.128687515Z","level":"debug","msg":"pgrep -P 2037586"}}
{"type":"log","data":{"time":"2024-04-25T08:39:05.139730059Z","level":"debug","msg":"/app/get-ps-command.sh 2037586"}}
{"type":"log","data":{"time":"2024-04-25T08:39:05.164718085Z","level":"debug","msg":"ps command output: /usr/share/elasticsearch/jdk/bin/java -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -Djava.security.manager=allow -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Dlog4j2.formatMsgNoLookups=true -Djava.locale.providers=SPI,COMPAT --add-opens=java.base/java.io=org.elasticsearch.preallocate --enable-native-access=org.elasticsearch.nativeaccess -Des.cgroups.hierarchy.override=/ -XX:ReplayDataFile=logs/replay_pid%p.log -Des.distribution.type=docker -XX:+UseG1GC -Djava.io.tmpdir=/tmp/elasticsearch-9751218612527351347 --add-modules=jdk.incubator.vector -XX:+HeapDumpOnOutOfMemoryError -XX:+ExitOnOutOfMemoryError -XX:HeapDumpPath=data -XX:ErrorFile=logs/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,level,pid,tags:filecount=32,filesize=64m -Xms1024m -Xmx1024m -XX:MaxDirectMemorySize=536870912 -XX:G1HeapRegionSize=4m -XX:InitiatingHeapOccupancyPercent=30 -XX:G1ReservePercent=15 --module-path /usr/share/elasticsearch/lib --add-modules=jdk.net --add-modules=ALL-MODULE-PATH -m org.elasticsearch.server/org.elasticsearch.bootstrap.Elasticsearch"}}
{"type":"log","data":{"time":"2024-04-25T08:39:05.164846621Z","level":"debug","msg":"The PIDs to be profiled: [2037586]"}}
{"type":"log","data":{"time":"2024-04-25T08:39:05.164878472Z","level":"debug","msg":"cp -r /app/async-profiler /tmp"}}
{"type":"log","data":{"time":"2024-04-25T08:39:05.167856531Z","level":"debug","msg":"/tmp/async-profiler/profiler.sh -o collapsed -d 500 -f /tmp/agent-raw-2037586-1.txt -e cpu --fdtransfer 2037586"}}
{"type":"error","data":{"reason":"open /tmp/agent-raw-2037586-1.txt: no such file or directory"}}
{"type":"log","data":{"time":"2024-04-25T08:47:25.837430888Z","level":"debug","msg":"Received signal: terminated"}}
{"type":"log","data":{"time":"2024-04-25T08:47:25.83748281Z","level":"debug","msg":"/tmp/async-profiler/profiler.sh stop 2037586"}}
{"type":"log","data":{"time":"2024-04-25T08:47:25.84230416Z","level":"debug","msg":"Profiling finished properly. Bye!"}}
rpc error: code = NotFound desc = an error occurred when try to find container "87727459221fec3493c440103aab9f80390663a3cfc3672a770b5f64f1037aa9": not found
How can I debug this problem? Please help!
The default configuration could be stored in .kube
folder as yaml file: kubectl-prof.yml
Hey all, I've been able to get kubectl-prof working with the perf profiler and it works beautifully (thanks for the hard work here!). However, I'd really like to use the eBPF profiler, but on DigitalOcean, their host instances do not have the kheaders module available.
I've reached out to DO and it doesn't seem like the lack of kheaders is going to change anytime soon. I was looking at CO-RE (https://web.archive.org/web/20220522105208/https://www.seekret.io/blog/handling-the-challenge-of-deploying-ebpf-into-the-wild/) as a potential work around.
Is this something that I could help implement here?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.