josepdcs / kubectl-prof Goto Github PK

kubectl-prof is a kubectl plugin to profile applications on kubernetes with minimum overhead

License: Apache License 2.0

Makefile 2.11% Go 90.53% Python 2.46% Shell 2.88% Dockerfile 1.37% C 0.13% Java 0.19% JavaScript 0.14% Ruby 0.09% Rust 0.10%

profiling kubernetes java ebpf golang nodejs perf python clang

kubectl-prof's Issues

flamegraphs for java with -e alloc and --interval don't download (Checksum does not match) error

When I run with java async-profiler cpu events - all the flamegraphs are downloaded

--tool async-profiler -e cpu -l java -o flamegraph -t 2m --interval 60s

but with java async-profiler alloc events I get an error saying Checksum does not match

--tool async-profiler -e alloc -l java -o flamegraph -t 2m --interval 60s

Default profiling tool async-profiler will be used ... ✔
Verifying target pod ... ✔
Launching profiler ... ✔
Profiling ... ✔

Checksum does not match, retrying: /tmp/agent-flamegraph-1382909.html.gz...
Checksum does not match, retrying: /tmp/agent-flamegraph-1382909.html.gz...

with alloc events, when I remove --interval 60s a single flamegraph is produced and downloaded.

looks like a timing issue, I tried changing (perfDelayBetweenJobs) but it doesnt seem to introduce a delay at all as there is no delay between the jobs timestamps. I see the sleep in the code in ./internal/agent/profiler/jvm/async_profiler.go

{"type":"log","data":{"time":"2024-02-01T19:55:57.560460874Z","level":"debug","msg":"The target filesystem is: /run/containerd/io.containerd.runtime.v2.task/k8s.io/9207e89a6d90b33dc9082d185a43c82a20c217e5b42bf42dd1dc33409829e9dc/rootfs"}}
{"type":"log","data":{"time":"2024-02-01T19:55:57.560837314Z","level":"debug","msg":"pgrep -P 3327888"}}
{"type":"log","data":{"time":"2024-02-01T19:55:57.580797179Z","level":"debug","msg":"pgrep -P 3328340"}}
{"type":"log","data":{"time":"2024-02-01T19:55:57.598762725Z","level":"debug","msg":"pgrep -P 3328373"}}
{"type":"log","data":{"time":"2024-02-01T19:55:57.615486974Z","level":"debug","msg":"The PIDs to be profiled: [3328373]"}}
{"type":"log","data":{"time":"2024-02-01T19:55:57.615520289Z","level":"debug","msg":"cp -r /app/async-profiler /tmp"}}
{"type":"log","data":{"time":"2024-02-01T19:55:57.618155249Z","level":"debug","msg":"/tmp/async-profiler/profiler.sh -o flamegraph -d 60 -f /tmp/agent-flamegraph-3328373.html -e alloc --fdtransfer 3328373"}}
{"type":"log","data":{"time":"2024-02-01T19:56:57.756881691Z","level":"debug","msg":"stat -c%s /tmp/agent-flamegraph-3328373.html.gz"}}
{"type":"result","data":{"time":"2024-02-01T19:56:57.757942168Z","result-type":"flamegraph","file":"/tmp/agent-flamegraph-3328373.html.gz","file-size-in-bytes":28507,"checksum":"2ddf9c65b8b12630eabfc2ffd6d9d61f","compressor-type":"gzip"}}
{"type":"log","data":{"time":"2024-02-01T19:56:57.758313068Z","level":"debug","msg":"/tmp/async-profiler/profiler.sh -o flamegraph -d 60 -f /tmp/agent-flamegraph-3328373.html -e alloc --fdtransfer 3328373"}}
{"type":"log","data":{"time":"2024-02-01T19:57:57.859065758Z","level":"debug","msg":"stat -c%s /tmp/agent-flamegraph-3328373.html.gz"}}
{"type":"result","data":{"time":"2024-02-01T19:57:57.860134251Z","result-type":"flamegraph","file":"/tmp/agent-flamegraph-3328373.html.gz","file-size-in-bytes":27931,"checksum":"385cf11bce4a902a073427e3b521f28b","compressor-type":"gzip"}}
{"type":"log","data":{"time":"2024-02-01T19:57:57.860312937Z","level":"debug","msg":"/tmp/async-profiler/profiler.sh -o flamegraph -d 60 -f /tmp/agent-flamegraph-3328373.html -e alloc --fdtransfer 3328373"}}
{"type":"log","data":{"time":"2024-02-01T19:58:57.961279147Z","level":"debug","msg":"stat -c%s /tmp/agent-flamegraph-3328373.html.gz"}}
{"type":"result","data":{"time":"2024-02-01T19:58:57.962588016Z","result-type":"flamegraph","file":"/tmp/agent-flamegraph-3328373.html.gz","file-size-in-bytes":34371,"checksum":"6673e8b7ae742ee567d29eb6dcc7d77c","compressor-type":"gzip"}}
{"type":"progress","data":{"time":"2024-02-01T19:58:57.962825035Z","stage":"ended"}}
{"type":"log","data":{"time":"2024-02-01T20:03:57.976570146Z","level":"warn","msg":"Maximum allowed time 5m0s surpassed. Cleaning up and auto-deleting the agent..."}}
{"type":"log","data":{"time":"2024-02-01T20:03:57.976643458Z","level":"debug","msg":"/tmp/async-profiler/profiler.sh stop 3328373"}}
{"type":"log","data":{"time":"2024-02-01T20:03:57.990292613Z","level":"debug","msg":"Trying to remove file: /tmp/agent-flamegraph-3328373.html"}}
{"type":"log","data":{"time":"2024-02-01T20:03:57.990424459Z","level":"debug","msg":"Trying to remove file: /tmp/agent-flamegraph-3328373.html.gz"}}

maybe if the filename had a timestamp or an interval counter rather than the same file filename agent-flamegraph-1382909.html for each interval, it would help. Also the perfRecordOutputFileName and perfScriptOutputFileName ?

Rust profiling

Enable use of perf, bpf and https://github.com/flamegraph-rs/flamegraph

Support for Arm64 architectures

Hi there,

Just wanted to know if there would be an interest for the support of Arm64 architectures ?
Any work launched on the topic or foreseen challenges ?

I can try to contribute if needed.

Thx !

Bump to Async-Profiler 3.0

New awesome async-profiler version: https://github.com/async-profiler/async-profiler/releases/tag/v3.0

Support for specifying the concrete PID

Could be useful being able to specify the concrete PID to be profiled if it is known --pid 343434

Support for specify process name

First of all, @josepdcs thank you for your efforts, this project is the best supported project for performance sampling so far.

When a application container that contains multiple processes require specifying the target process name when capturing. Just like the --pgrep parameter inside the upstream

PHP profiling

Here a low-overhead sampling profiler for PHP 7+

https://github.com/adsr/phpspy
https://github.com/reliforp/reli-prof

1.2.2 Support for applications running subprocesses

Being able to profile apps that launch more than one child process, even in model of hierarchical processes

It occurs in python, node.js, etc.

Complete Readme

capabilities in JobConfig can be reduced from SYS_ADMIN

for example in
https://github.com/josepdcs/kubectl-prof/blob/main/internal/cli/kubernetes/job/jvm.go#L76

The capability in the JobConfig for perf sampling can be lowered from SYS_ADMIN to just PERFMON and SYSLOG
kernels prior to v5.9 may require SYS_PTRACE

https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html#perf-events-access-control

The permissions required for perf are
sysctl -w kernel.kptr_restrict=0
sysctl -w kernel.perf_event_paranoid=1

or capabilities PERFMON and SYSLOG which is confirmed in the kernel code at the following locations
https://elixir.bootlin.com/linux/v5.15.148/source/tools/perf/util/util.c#L290
https://elixir.bootlin.com/linux/v5.15.148/source/kernel/kallsyms.c#L794

I modified the line mentioned, built and tested the plugin with java and async-profiler and the profiler returns the output. this is some of the output of --dry-run

      name: kubectl-prof
        resources:
          limits:
            cpu: "1"
        securityContext:
          capabilities:
            add:
            - PERFMON
            - SYSLOG
          privileged: true

Clang++ target attempts to create profiler pod with invalid name

% kubectl prof normcore-room-07156def-a20e-411e-8c4e-8775f63e0d0b -t 1m --lang clang++ -o flamegraph
Default profiling tool bpf will be used ... ✔
Verifying target pod ... ✔
Launching profiler ... ❌
FATA[2024-02-15T15:50:04-05:00] Job.batch "kubectl-prof-clang++-bpf-a46bf2fc-a284-435f-87fe-34f3383959d1" is invalid: [metadata.name: Invalid value: "kubectl-prof-clang++-bpf-a46bf2fc-a284-435f-87fe-34f3383959d1": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*'), spec.template.labels: Invalid value: "kubectl-prof-clang++-bpf-a46bf2fc-a284-435f-87fe-34f3383959d1": a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue',  or 'my_value',  or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?')]

It looks like the clang++ in the profiler pod name is failing on my kubernetes deployment. Is there a way to set a custom name or not include the language in the profiler pod name?

Set containerd as default container runtime over crio

Provide the RAW output format

The raw output format is provided for the most profiling tools (async-profiler, py-spy, rbspy, perf, bcc-profile, etc.) and it can be transformed to framegraph by using Brendan Greed's tool or, even, and this is most worth, it can be read by SpeedScope tool

--runtime-path appears not to work

Hello @josepdcs. Thank you for your effort on developing this tool.

I noticed something that there is an option --runtime-path where one can specify an alternative container runtime install path.

root@kubectl-flame-658d9ffb4-zwfls:/workspace# kubectl prof | grep runtime-path
      --runtime-path string             Use a different container runtime install path (default "/run/containerd/")
...

However, no matter what I set this to, like /host/var/snap/microk8s/common/run/containerd/ or /host/data/snap/microk8s/common/run/containerd/, kubectl-prof appears to ignore this.

{"type":"log","data":{"time":"2024-01-16T05:13:36.376271942Z","level":"debug","msg":"{\"Duration\":5000000000,\"Interval\":5000000000,\"UID\":\"21488509-a797-4dc9-b3db-ff368fa1c55a\",\"ContainerRuntime\":\"containerd\",\"ContainerID\":\"804b035d095a1843d1baedff11f9e38f9f9cd967731e458f65feb15d1de6e9fe\",\"PodUID\":\"db24dd86-715f-4de3-9f37-962aa5aa32b7\",\"Language\":\"java\",\"Event\":\"itimer\",\"Compressor\":\"gzip\",\"Tool\":\"async-profiler\",\"OutputType\":\"flamegraph\",\"FileName\":\"\",\"HeapDumpSplitInChunkSize\":\"\",\"AdditionalArguments\":null}"}} 2024-01-16T05:13:36.376801051Z {"type":"progress","data":{"time":"2024-01-16T05:13:36.376699686Z","stage":"started"}} 2024-01-16T05:13:36.376804395Z {"type":"log","data":{"time":"2024-01-16T05:13:36.376769131Z","level":"debug","msg":"The target filesystem is: /run/containerd/io.containerd.runtime.v2.task/k8s.io/804b035d095a1843d1baedff11f9e38f9f9cd967731e458f65feb15d1de6e9fe/rootfs"}} 2024-01-16T05:13:36.378534835Z {"type":"error","data":{"reason":"read file failed: /run/containerd/io.containerd.runtime.v2.task/k8s.io/804b035d095a1843d1baedff11f9e38f9f9cd967731e458f65feb15d1de6e9fe/init.pid: open /run/containerd/io.containerd.runtime.v2.task/k8s.io/804b035d095a1843d1baedff11f9e38f9f9cd967731e458f65feb15d1de6e9fe/init.pid: no such file or directory"}}
{"type":"log","data":{"time":"2024-01-16T05:18:36.378708007Z","level":"warn","msg":"Maximum allowed time 5m0s surpassed. Cleaning up and auto-deleting the agent..."}} 2024-01-16T05:18:36.378825029Z {"type":"log","data":{"time":"2024-01-16T05:18:36.378739311Z","level":"debug","msg":"/tmp/async-profiler/profiler.sh stop"}}

As one can see in /run/containerd/io.containerd.runtime.v2.task/k8s.io/804b035d095a1843d1baedff11f9e38f9f9cd967731e458f65feb15d1de6e9fe/init.pid: no such file or directory", it is always /run/containerd/ that it attempts to read from.

This issue is preventing my from profiling pods running on a microk8s based node, with containerd apparently installed by snap. Is there any workaround available for my situation?

Thank you.

use async profiler generate raw for java always no such file or directory error

./kubectl-prof quickstart-es-default-0 -n default --pgrep Elasticsearch -e cpu -l java -t 500s -o raw --log-level debug
Default profiling tool async-profiler will be used ... ✔
Verifying target pod ... ✔
Launching profiler ... ✔
Profiling ... ✔
Error: open /tmp/agent-raw-2037586-1.txt: no such file or directory ❌

{"type":"log","data":{"time":"2024-04-25T08:39:05.102930398Z","level":"debug","msg":"{"Duration":500000000000,"Interval":500000000000,"UID":"d4e71b1e-3537-402d-a4bd-57981c1aeb3e","ContainerRuntime":"containerd","ContainerRuntimePath":"/run/containerd","ContainerID":"2661c207673062ac9b40389fb8d25fbc00f1e3a1cbb12f3157f37bc5ac5bad1c","PodUID":"d37565e8-e463-4f9d-b43a-31f1fe68aaa2","Language":"java","Event":"cpu","Compressor":"gzip","Tool":"async-profiler","OutputType":"raw","FileName":"","HeapDumpSplitInChunkSize":"","PID":"","Pgrep":"Elasticsearch","AdditionalArguments":null,"Iteration":0}"}}
{"type":"progress","data":{"time":"2024-04-25T08:39:05.103293142Z","stage":"started"}}
{"type":"log","data":{"time":"2024-04-25T08:39:05.103421331Z","level":"debug","msg":"The target filesystem is: /run/containerd/io.containerd.runtime.v2.task/k8s.io/2661c207673062ac9b40389fb8d25fbc00f1e3a1cbb12f3157f37bc5ac5bad1c/rootfs"}}
{"type":"log","data":{"time":"2024-04-25T08:39:05.103716862Z","level":"debug","msg":"pgrep -P 2037504"}}
{"type":"log","data":{"time":"2024-04-25T08:39:05.117411991Z","level":"debug","msg":"pgrep -P 2037516"}}
{"type":"log","data":{"time":"2024-04-25T08:39:05.128687515Z","level":"debug","msg":"pgrep -P 2037586"}}
{"type":"log","data":{"time":"2024-04-25T08:39:05.139730059Z","level":"debug","msg":"/app/get-ps-command.sh 2037586"}}
{"type":"log","data":{"time":"2024-04-25T08:39:05.164718085Z","level":"debug","msg":"ps command output: /usr/share/elasticsearch/jdk/bin/java -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -Djava.security.manager=allow -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Dlog4j2.formatMsgNoLookups=true -Djava.locale.providers=SPI,COMPAT --add-opens=java.base/java.io=org.elasticsearch.preallocate --enable-native-access=org.elasticsearch.nativeaccess -Des.cgroups.hierarchy.override=/ -XX:ReplayDataFile=logs/replay_pid%p.log -Des.distribution.type=docker -XX:+UseG1GC -Djava.io.tmpdir=/tmp/elasticsearch-9751218612527351347 --add-modules=jdk.incubator.vector -XX:+HeapDumpOnOutOfMemoryError -XX:+ExitOnOutOfMemoryError -XX:HeapDumpPath=data -XX:ErrorFile=logs/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,level,pid,tags:filecount=32,filesize=64m -Xms1024m -Xmx1024m -XX:MaxDirectMemorySize=536870912 -XX:G1HeapRegionSize=4m -XX:InitiatingHeapOccupancyPercent=30 -XX:G1ReservePercent=15 --module-path /usr/share/elasticsearch/lib --add-modules=jdk.net --add-modules=ALL-MODULE-PATH -m org.elasticsearch.server/org.elasticsearch.bootstrap.Elasticsearch"}}
{"type":"log","data":{"time":"2024-04-25T08:39:05.164846621Z","level":"debug","msg":"The PIDs to be profiled: [2037586]"}}
{"type":"log","data":{"time":"2024-04-25T08:39:05.164878472Z","level":"debug","msg":"cp -r /app/async-profiler /tmp"}}
{"type":"log","data":{"time":"2024-04-25T08:39:05.167856531Z","level":"debug","msg":"/tmp/async-profiler/profiler.sh -o collapsed -d 500 -f /tmp/agent-raw-2037586-1.txt -e cpu --fdtransfer 2037586"}}
{"type":"error","data":{"reason":"open /tmp/agent-raw-2037586-1.txt: no such file or directory"}}
{"type":"log","data":{"time":"2024-04-25T08:47:25.837430888Z","level":"debug","msg":"Received signal: terminated"}}
{"type":"log","data":{"time":"2024-04-25T08:47:25.83748281Z","level":"debug","msg":"/tmp/async-profiler/profiler.sh stop 2037586"}}
{"type":"log","data":{"time":"2024-04-25T08:47:25.84230416Z","level":"debug","msg":"Profiling finished properly. Bye!"}}
rpc error: code = NotFound desc = an error occurred when try to find container "87727459221fec3493c440103aab9f80390663a3cfc3672a770b5f64f1037aa9": not found

How can I debug this problem? Please help!

Krew action install does not work

Implement a default configuration file

The default configuration could be stored in .kube folder as yaml file: kubectl-prof.yml

Use CO-RE to remove dependency on host machine kernel headers?

Hey all, I've been able to get kubectl-prof working with the perf profiler and it works beautifully (thanks for the hard work here!). However, I'd really like to use the eBPF profiler, but on DigitalOcean, their host instances do not have the kheaders module available.

I've reached out to DO and it doesn't seem like the lack of kheaders is going to change anytime soon. I was looking at CO-RE (https://web.archive.org/web/20220522105208/https://www.seekret.io/blog/handling-the-challenge-of-deploying-ebpf-into-the-wild/) as a potential work around.

Is this something that I could help implement here?

josepdcs / kubectl-prof Goto Github PK

kubectl-prof's Issues

Recommend Projects

Recommend Topics

Recommend Org