Code Monkey home page Code Monkey logo

kubectl-prof's Issues

flamegraphs for java with -e alloc and --interval don't download (Checksum does not match) error

When I run with java async-profiler cpu events - all the flamegraphs are downloaded

--tool async-profiler -e cpu -l java -o flamegraph -t 2m --interval 60s

but with java async-profiler alloc events I get an error saying Checksum does not match

--tool async-profiler -e alloc -l java -o flamegraph -t 2m --interval 60s

Default profiling tool async-profiler will be used ... ✔
Verifying target pod ... ✔
Launching profiler ... ✔
Profiling ... ✔

Checksum does not match, retrying: /tmp/agent-flamegraph-1382909.html.gz...
Checksum does not match, retrying: /tmp/agent-flamegraph-1382909.html.gz...

with alloc events, when I remove --interval 60s a single flamegraph is produced and downloaded.

looks like a timing issue, I tried changing (perfDelayBetweenJobs) but it doesnt seem to introduce a delay at all as there is no delay between the jobs timestamps. I see the sleep in the code in ./internal/agent/profiler/jvm/async_profiler.go

{"type":"log","data":{"time":"2024-02-01T19:55:57.560460874Z","level":"debug","msg":"The target filesystem is: /run/containerd/io.containerd.runtime.v2.task/k8s.io/9207e89a6d90b33dc9082d185a43c82a20c217e5b42bf42dd1dc33409829e9dc/rootfs"}}
{"type":"log","data":{"time":"2024-02-01T19:55:57.560837314Z","level":"debug","msg":"pgrep -P 3327888"}}
{"type":"log","data":{"time":"2024-02-01T19:55:57.580797179Z","level":"debug","msg":"pgrep -P 3328340"}}
{"type":"log","data":{"time":"2024-02-01T19:55:57.598762725Z","level":"debug","msg":"pgrep -P 3328373"}}
{"type":"log","data":{"time":"2024-02-01T19:55:57.615486974Z","level":"debug","msg":"The PIDs to be profiled: [3328373]"}}
{"type":"log","data":{"time":"2024-02-01T19:55:57.615520289Z","level":"debug","msg":"cp -r /app/async-profiler /tmp"}}
{"type":"log","data":{"time":"2024-02-01T19:55:57.618155249Z","level":"debug","msg":"/tmp/async-profiler/profiler.sh -o flamegraph -d 60 -f /tmp/agent-flamegraph-3328373.html -e alloc --fdtransfer 3328373"}}
{"type":"log","data":{"time":"2024-02-01T19:56:57.756881691Z","level":"debug","msg":"stat -c%s /tmp/agent-flamegraph-3328373.html.gz"}}
{"type":"result","data":{"time":"2024-02-01T19:56:57.757942168Z","result-type":"flamegraph","file":"/tmp/agent-flamegraph-3328373.html.gz","file-size-in-bytes":28507,"checksum":"2ddf9c65b8b12630eabfc2ffd6d9d61f","compressor-type":"gzip"}}
{"type":"log","data":{"time":"2024-02-01T19:56:57.758313068Z","level":"debug","msg":"/tmp/async-profiler/profiler.sh -o flamegraph -d 60 -f /tmp/agent-flamegraph-3328373.html -e alloc --fdtransfer 3328373"}}
{"type":"log","data":{"time":"2024-02-01T19:57:57.859065758Z","level":"debug","msg":"stat -c%s /tmp/agent-flamegraph-3328373.html.gz"}}
{"type":"result","data":{"time":"2024-02-01T19:57:57.860134251Z","result-type":"flamegraph","file":"/tmp/agent-flamegraph-3328373.html.gz","file-size-in-bytes":27931,"checksum":"385cf11bce4a902a073427e3b521f28b","compressor-type":"gzip"}}
{"type":"log","data":{"time":"2024-02-01T19:57:57.860312937Z","level":"debug","msg":"/tmp/async-profiler/profiler.sh -o flamegraph -d 60 -f /tmp/agent-flamegraph-3328373.html -e alloc --fdtransfer 3328373"}}
{"type":"log","data":{"time":"2024-02-01T19:58:57.961279147Z","level":"debug","msg":"stat -c%s /tmp/agent-flamegraph-3328373.html.gz"}}
{"type":"result","data":{"time":"2024-02-01T19:58:57.962588016Z","result-type":"flamegraph","file":"/tmp/agent-flamegraph-3328373.html.gz","file-size-in-bytes":34371,"checksum":"6673e8b7ae742ee567d29eb6dcc7d77c","compressor-type":"gzip"}}
{"type":"progress","data":{"time":"2024-02-01T19:58:57.962825035Z","stage":"ended"}}
{"type":"log","data":{"time":"2024-02-01T20:03:57.976570146Z","level":"warn","msg":"Maximum allowed time 5m0s surpassed. Cleaning up and auto-deleting the agent..."}}
{"type":"log","data":{"time":"2024-02-01T20:03:57.976643458Z","level":"debug","msg":"/tmp/async-profiler/profiler.sh stop 3328373"}}
{"type":"log","data":{"time":"2024-02-01T20:03:57.990292613Z","level":"debug","msg":"Trying to remove file: /tmp/agent-flamegraph-3328373.html"}}
{"type":"log","data":{"time":"2024-02-01T20:03:57.990424459Z","level":"debug","msg":"Trying to remove file: /tmp/agent-flamegraph-3328373.html.gz"}}

maybe if the filename had a timestamp or an interval counter rather than the same file filename agent-flamegraph-1382909.html for each interval, it would help. Also the perfRecordOutputFileName and perfScriptOutputFileName ?

Support for Arm64 architectures

Hi there,

Just wanted to know if there would be an interest for the support of Arm64 architectures ?
Any work launched on the topic or foreseen challenges ?

I can try to contribute if needed.

Thx !

Support for specify process name

First of all, @josepdcs thank you for your efforts, this project is the best supported project for performance sampling so far.

When a application container that contains multiple processes require specifying the target process name when capturing. Just like the --pgrep parameter inside the upstream

capabilities in JobConfig can be reduced from SYS_ADMIN

for example in
https://github.com/josepdcs/kubectl-prof/blob/main/internal/cli/kubernetes/job/jvm.go#L76

The capability in the JobConfig for perf sampling can be lowered from SYS_ADMIN to just PERFMON and SYSLOG
kernels prior to v5.9 may require SYS_PTRACE

https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html#perf-events-access-control

The permissions required for perf are
sysctl -w kernel.kptr_restrict=0
sysctl -w kernel.perf_event_paranoid=1

or capabilities PERFMON and SYSLOG which is confirmed in the kernel code at the following locations
https://elixir.bootlin.com/linux/v5.15.148/source/tools/perf/util/util.c#L290
https://elixir.bootlin.com/linux/v5.15.148/source/kernel/kallsyms.c#L794

I modified the line mentioned, built and tested the plugin with java and async-profiler and the profiler returns the output. this is some of the output of --dry-run

      name: kubectl-prof
        resources:
          limits:
            cpu: "1"
        securityContext:
          capabilities:
            add:
            - PERFMON
            - SYSLOG
          privileged: true

Clang++ target attempts to create profiler pod with invalid name

% kubectl prof normcore-room-07156def-a20e-411e-8c4e-8775f63e0d0b -t 1m --lang clang++ -o flamegraph
Default profiling tool bpf will be used ... ✔
Verifying target pod ... ✔
Launching profiler ... ❌
FATA[2024-02-15T15:50:04-05:00] Job.batch "kubectl-prof-clang++-bpf-a46bf2fc-a284-435f-87fe-34f3383959d1" is invalid: [metadata.name: Invalid value: "kubectl-prof-clang++-bpf-a46bf2fc-a284-435f-87fe-34f3383959d1": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*'), spec.template.labels: Invalid value: "kubectl-prof-clang++-bpf-a46bf2fc-a284-435f-87fe-34f3383959d1": a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue',  or 'my_value',  or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?')] 

It looks like the clang++ in the profiler pod name is failing on my kubernetes deployment. Is there a way to set a custom name or not include the language in the profiler pod name?

--runtime-path appears not to work

Hello @josepdcs. Thank you for your effort on developing this tool.

I noticed something that there is an option --runtime-path where one can specify an alternative container runtime install path.

root@kubectl-flame-658d9ffb4-zwfls:/workspace# kubectl prof | grep runtime-path
      --runtime-path string             Use a different container runtime install path (default "/run/containerd/")
...

However, no matter what I set this to, like /host/var/snap/microk8s/common/run/containerd/ or /host/data/snap/microk8s/common/run/containerd/, kubectl-prof appears to ignore this.

{"type":"log","data":{"time":"2024-01-16T05:13:36.376271942Z","level":"debug","msg":"{\"Duration\":5000000000,\"Interval\":5000000000,\"UID\":\"21488509-a797-4dc9-b3db-ff368fa1c55a\",\"ContainerRuntime\":\"containerd\",\"ContainerID\":\"804b035d095a1843d1baedff11f9e38f9f9cd967731e458f65feb15d1de6e9fe\",\"PodUID\":\"db24dd86-715f-4de3-9f37-962aa5aa32b7\",\"Language\":\"java\",\"Event\":\"itimer\",\"Compressor\":\"gzip\",\"Tool\":\"async-profiler\",\"OutputType\":\"flamegraph\",\"FileName\":\"\",\"HeapDumpSplitInChunkSize\":\"\",\"AdditionalArguments\":null}"}} 2024-01-16T05:13:36.376801051Z {"type":"progress","data":{"time":"2024-01-16T05:13:36.376699686Z","stage":"started"}} 2024-01-16T05:13:36.376804395Z {"type":"log","data":{"time":"2024-01-16T05:13:36.376769131Z","level":"debug","msg":"The target filesystem is: /run/containerd/io.containerd.runtime.v2.task/k8s.io/804b035d095a1843d1baedff11f9e38f9f9cd967731e458f65feb15d1de6e9fe/rootfs"}} 2024-01-16T05:13:36.378534835Z {"type":"error","data":{"reason":"read file failed: /run/containerd/io.containerd.runtime.v2.task/k8s.io/804b035d095a1843d1baedff11f9e38f9f9cd967731e458f65feb15d1de6e9fe/init.pid: open /run/containerd/io.containerd.runtime.v2.task/k8s.io/804b035d095a1843d1baedff11f9e38f9f9cd967731e458f65feb15d1de6e9fe/init.pid: no such file or directory"}}
{"type":"log","data":{"time":"2024-01-16T05:18:36.378708007Z","level":"warn","msg":"Maximum allowed time 5m0s surpassed. Cleaning up and auto-deleting the agent..."}} 2024-01-16T05:18:36.378825029Z {"type":"log","data":{"time":"2024-01-16T05:18:36.378739311Z","level":"debug","msg":"/tmp/async-profiler/profiler.sh stop"}}

As one can see in /run/containerd/io.containerd.runtime.v2.task/k8s.io/804b035d095a1843d1baedff11f9e38f9f9cd967731e458f65feb15d1de6e9fe/init.pid: no such file or directory", it is always /run/containerd/ that it attempts to read from.

This issue is preventing my from profiling pods running on a microk8s based node, with containerd apparently installed by snap. Is there any workaround available for my situation?

Thank you.

use async profiler generate raw for java always no such file or directory error

./kubectl-prof quickstart-es-default-0 -n default --pgrep Elasticsearch -e cpu -l java -t 500s -o raw --log-level debug
Default profiling tool async-profiler will be used ... ✔
Verifying target pod ... ✔
Launching profiler ... ✔
Profiling ... ✔
Error: open /tmp/agent-raw-2037586-1.txt: no such file or directory ❌

{"type":"log","data":{"time":"2024-04-25T08:39:05.102930398Z","level":"debug","msg":"{"Duration":500000000000,"Interval":500000000000,"UID":"d4e71b1e-3537-402d-a4bd-57981c1aeb3e","ContainerRuntime":"containerd","ContainerRuntimePath":"/run/containerd","ContainerID":"2661c207673062ac9b40389fb8d25fbc00f1e3a1cbb12f3157f37bc5ac5bad1c","PodUID":"d37565e8-e463-4f9d-b43a-31f1fe68aaa2","Language":"java","Event":"cpu","Compressor":"gzip","Tool":"async-profiler","OutputType":"raw","FileName":"","HeapDumpSplitInChunkSize":"","PID":"","Pgrep":"Elasticsearch","AdditionalArguments":null,"Iteration":0}"}}
{"type":"progress","data":{"time":"2024-04-25T08:39:05.103293142Z","stage":"started"}}
{"type":"log","data":{"time":"2024-04-25T08:39:05.103421331Z","level":"debug","msg":"The target filesystem is: /run/containerd/io.containerd.runtime.v2.task/k8s.io/2661c207673062ac9b40389fb8d25fbc00f1e3a1cbb12f3157f37bc5ac5bad1c/rootfs"}}
{"type":"log","data":{"time":"2024-04-25T08:39:05.103716862Z","level":"debug","msg":"pgrep -P 2037504"}}
{"type":"log","data":{"time":"2024-04-25T08:39:05.117411991Z","level":"debug","msg":"pgrep -P 2037516"}}
{"type":"log","data":{"time":"2024-04-25T08:39:05.128687515Z","level":"debug","msg":"pgrep -P 2037586"}}
{"type":"log","data":{"time":"2024-04-25T08:39:05.139730059Z","level":"debug","msg":"/app/get-ps-command.sh 2037586"}}
{"type":"log","data":{"time":"2024-04-25T08:39:05.164718085Z","level":"debug","msg":"ps command output: /usr/share/elasticsearch/jdk/bin/java -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -Djava.security.manager=allow -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Dlog4j2.formatMsgNoLookups=true -Djava.locale.providers=SPI,COMPAT --add-opens=java.base/java.io=org.elasticsearch.preallocate --enable-native-access=org.elasticsearch.nativeaccess -Des.cgroups.hierarchy.override=/ -XX:ReplayDataFile=logs/replay_pid%p.log -Des.distribution.type=docker -XX:+UseG1GC -Djava.io.tmpdir=/tmp/elasticsearch-9751218612527351347 --add-modules=jdk.incubator.vector -XX:+HeapDumpOnOutOfMemoryError -XX:+ExitOnOutOfMemoryError -XX:HeapDumpPath=data -XX:ErrorFile=logs/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,level,pid,tags:filecount=32,filesize=64m -Xms1024m -Xmx1024m -XX:MaxDirectMemorySize=536870912 -XX:G1HeapRegionSize=4m -XX:InitiatingHeapOccupancyPercent=30 -XX:G1ReservePercent=15 --module-path /usr/share/elasticsearch/lib --add-modules=jdk.net --add-modules=ALL-MODULE-PATH -m org.elasticsearch.server/org.elasticsearch.bootstrap.Elasticsearch"}}
{"type":"log","data":{"time":"2024-04-25T08:39:05.164846621Z","level":"debug","msg":"The PIDs to be profiled: [2037586]"}}
{"type":"log","data":{"time":"2024-04-25T08:39:05.164878472Z","level":"debug","msg":"cp -r /app/async-profiler /tmp"}}
{"type":"log","data":{"time":"2024-04-25T08:39:05.167856531Z","level":"debug","msg":"/tmp/async-profiler/profiler.sh -o collapsed -d 500 -f /tmp/agent-raw-2037586-1.txt -e cpu --fdtransfer 2037586"}}
{"type":"error","data":{"reason":"open /tmp/agent-raw-2037586-1.txt: no such file or directory"}}
{"type":"log","data":{"time":"2024-04-25T08:47:25.837430888Z","level":"debug","msg":"Received signal: terminated"}}
{"type":"log","data":{"time":"2024-04-25T08:47:25.83748281Z","level":"debug","msg":"/tmp/async-profiler/profiler.sh stop 2037586"}}
{"type":"log","data":{"time":"2024-04-25T08:47:25.84230416Z","level":"debug","msg":"Profiling finished properly. Bye!"}}
rpc error: code = NotFound desc = an error occurred when try to find container "87727459221fec3493c440103aab9f80390663a3cfc3672a770b5f64f1037aa9": not found

How can I debug this problem? Please help!

Use CO-RE to remove dependency on host machine kernel headers?

Hey all, I've been able to get kubectl-prof working with the perf profiler and it works beautifully (thanks for the hard work here!). However, I'd really like to use the eBPF profiler, but on DigitalOcean, their host instances do not have the kheaders module available.

I've reached out to DO and it doesn't seem like the lack of kheaders is going to change anytime soon. I was looking at CO-RE (https://web.archive.org/web/20220522105208/https://www.seekret.io/blog/handling-the-challenge-of-deploying-ebpf-into-the-wild/) as a potential work around.

Is this something that I could help implement here?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.