Code Monkey home page Code Monkey logo

surftrace's Introduction

mascot.png

1、简介

​ surftrace是一个ftrace的自动封装器和开发编译平台,既能让用户基于libbpf快速构建工程进行开发,也能作为ftrace的封装器进行trace命令编写。项目包含surftrace工具集和pylcc、glcc(python or generic C language for libbpf Compiler Collection),提供远程和本地eBPF的编译能力。

surftrace.png

1.1、ftrace原理与不足

​ ftrace是一个内核中的追踪器,用于帮助系统开发者或设计者查看内核运行情况,它可以被用来调试或者分析延迟/性能等常见问题。早期 ftrace 是一个 function tracer,仅能够记录内核的函数调用流程。如今ftrace已经成为一个开发框架,从2.6内核开始引入,是一套公认安全、可靠、高效的内核数据获取方式。

​ ftrace对使用者的要求比较高,以对内核符号 wake_up_new_task 进行trace,同时要获取入参(struct task_struct *)->comm 成员信息为例,启动配置需要经历三个步骤:

echo 'p:f0 wake_up_new_task comm=+0x678(%di):string' >> /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on

​ 要想停止需要继续配置如下:

echo 0 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable
echo -:f0 >> /sys/kernel/debug/tracing/kprobe_events
echo 0 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on

​ 一共需要六个步骤。其中,最困难的是第一个参数解析步骤。通常情况下,需要使用gdb 加载对应内核vmlinux, 对 struct task_struct 结构体中 comm成员进行偏移计算。上述方法如果不经常使用,重新手工操作的时间成本非常高,导致真正直接采用ftrace对内核信息进行采集的案例非常少,相关资料文献也匮乏。

1.2、surftrace目标

​ surftrace的主要目标是为了降低内核trace难度,达到快速高效获取内核信息目标。综合来说要达到以下效果:

    1. 一键trace内核符号,并获取指定内核数据;
    1. 除了C和linux 操作系统内核,用户无需新增学习掌握其它知识点(需要获取数据进行二次处理除外);
    1. 覆盖大部分主流发行版内核;
    1. 类似bcc开发模式,达到libbpf最佳资源消耗;

2、surftrace 命令使用

​ 使用surftrace,需要满足以下条件:

    1. 公开发行版linux内核,支持目录参考:http://pylcc.openanolis.cn/version/ (持续更新)
    1. 内核支持ftrace,已配置了debugfs,root权限;
    1. Python2 >= 2.7; Python3 >= 3.5,已安装pip;

​ surftrace支持 remote(默认),local和gdb三种表达式解析器,要求分别如下:

    1. remote mode:可以访问pylcc.openanolis.cn
    1. local mode:从http://pylcc.openanolis.cn/db/ 下载对应arch和内核的下载到本地
    1. gdb mode:gdb version > 8.0,存放有对应内核的vmlinux;对于gdb模式而言,不受公开发行版内核限制(性能太弱,已经不再推荐)

2.1、安装

​ 我们以龙蜥 4.19.91-24.8.an8.x86_64内核为例,需要root用户,执行以下命令进行安装:

pip3 install surftrace
Collecting surftrace
  Downloading http://mirrors.cloud.aliyuncs.com/pypi/packages/b9/a2/f7e04bb8ebb12e6517162a70886e3ffe8d466437b15624590c9301fdcc52/surftrace-0.2.tar.gz
Building wheels for collected packages: surftrace
  Running setup.py bdist_wheel for surftrace ... done
  Stored in directory: /root/.cache/pip/wheels/cf/28/93/187f359be189bf0bf4a70197c53519c6ca54ffb957bcbebf5a
Successfully built surftrace
Installing collected packages: surftrace
Successfully installed surftrace-0.2

 0.6以上(含)的版本采用https流的方式与服务器传输数据,低于0.6版本采用tcp 流传输。后者服务将从2023年12月31号起后下线。

​ 检查安装是否成功

surftrace --help
usage: surftrace [-h] [-v VMLINUX] [-m MODE] [-d DB] [-r RIP] [-f FILE]
                 [-g GDB] [-F FUNC] [-o OUTPUT] [-l LINE] [-a ARCH] [-s] [-S]
                 [traces [traces ...]]

Trace ftrace kprobe events.

positional arguments:
  traces                set trace args.

optional arguments:
  -h, --help            show this help message and exit
  -v VMLINUX, --vmlinux VMLINUX
                        set vmlinux path.
  -m MODE, --mode MODE  set arg parser, fro
  -d DB, --db DB        set local db path.
  -r RIP, --rip RIP     set remote server ip, remote mode only.
  -f FILE, --file FILE  set input args path.
  -g GDB, --gdb GDB     set gdb exe file path.
  -F FUNC, --func FUNC  disasassemble function.
  -o OUTPUT, --output OUTPUT
                        set output bash file
  -l LINE, --line LINE  get file disasemble info
  -a ARCH, --arch ARCH  set architecture.
  -s, --stack           show call stacks.
  -S, --show            only show expressions.

examples:

2.2、常规函数入口trace

​ 接下来我们以 以下两个常用内核符号为例,它的原型定义如下:

void wake_up_new_task(struct task_struct *p);
struct file *do_filp_open(int dfd, struct filename *pathname, const struct open_flags *op);

2.2.1、追踪符号入口和返回点

  • 命令:surftrace 'p wake_up_new_task' 'r wake_up_new_task'
surftrace 'p wake_up_new_task' 'r wake_up_new_task'
echo 'p:f0 wake_up_new_task' >> /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable
echo 'r:f1 wake_up_new_task' >> /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f1/enable
echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on
 surftrace-2336  [001] ....  1447.877666: f0: (wake_up_new_task+0x0/0x280)
 surftrace-2336  [001] d...  1447.877670: f1: (_do_fork+0x153/0x3d0 <- wake_up_new_task)

​ 示例中入参有两个表达式,所有表达式要用单引号括起来。

  • 'p wake_up_new_task':p表示表示probe函数入口;
  • 'r wake_up_new_task':r表示probe函数返回位置;

​ 后面的wake_up_new_task是要trace的函数符号,这个符号必须要在tracing/available_filter_functions 中可以找到的。

2.2.2、获取函数入参

​ 要获取 do_filp_open 函数 第一个入参dfd,它的数据类型是:int。

​- 命令:surftrace 'p do_filp_open dfd=%0'

surftrace 'p do_filp_open dfd=%0'
echo 'p:f0 do_filp_open dfd=%di:u32' >> /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable
echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on
 surftrace-2435  [001] ....  2717.606277: f0: (do_filp_open+0x0/0x100) dfd=4294967196
 AliYunDun-1812  [000] ....  2717.655955: f0: (do_filp_open+0x0/0x100) dfd=4294967196
 AliYunDun-1812  [000] ....  2717.856227: f0: (do_filp_open+0x0/0x100) dfd=4294967196
  • dfd是自定义变量,可以自行定义,名字不冲突即可
  • %0表示第一个入参,%1表示第二个……

​ 前面打印中,dfd是按照十进制显示的,可能没有十六进制那么直观,指定十六进制的方法:

​ 命令:surftrace 'p do_filp_open dfd=X%0'

surftrace 'p do_filp_open dfd=X%0'
echo 'p:f0 do_filp_open dfd=%di:x32' >> /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable
echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on
 surftrace-2459  [000] ....  3137.167885: f0: (do_filp_open+0x0/0x100) dfd=0xffffff9c
 AliYunDun-1812  [001] ....  3137.171997: f0: (do_filp_open+0x0/0x100) dfd=0xffffff9c
 AliYunDun-1826  [001] ....  3137.201401: f0: (do_filp_open+0x0/0x100) dfd=0xffffff9c

​ 传参编号%前面使用了X进制类型标识符,共有SUX三种类型,分别对应有符号十进制、无符号十进制和十六进制,不指定默认为U类型。

2.2.3、解析入参结构体

​ wake_up_new_task入参类型为struct task_struct *,如果要获取入参中comm成员,即任务名,

​- 命令:surftrace 'p wake_up_new_task comm=%0->comm'

surftrace 'p wake_up_new_task comm=%0->comm'
echo 'p:f0 wake_up_new_task comm=+0xae0(%di):string' >> /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable
echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on
 surftrace-2421  [000] ....  2368.261019: f0: (wake_up_new_task+0x0/0x280) comm="surftrace"
 bash-2392  [001] ....  2375.809655: f0: (wake_up_new_task+0x0/0x280) comm="bash"
 bash-2392  [001] ....  2379.038534: f0: (wake_up_new_task+0x0/0x280) comm="bash"
 bash-2392  [000] ....  2381.237443: f0: (wake_up_new_task+0x0/0x280) comm="bash"

​ 方法和C语言获取结构体成员方法一样。

​ 结构体类型可以级联访问:

 surftrace 'p wake_up_new_task uesrs=S%0->mm->mm_users'
echo 'p:f0 wake_up_new_task uesrs=+0x58(+0x850(%di)):s32' >> /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable
echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on
 surftrace-2471  [001] ....  3965.234680: f0: (wake_up_new_task+0x0/0x280) uesrs=2
 bash-2392  [000] ....  3970.094475: f0: (wake_up_new_task+0x0/0x280) uesrs=1
 bash-2392  [000] ....  3971.954463: f0: (wake_up_new_task+0x0/0x280) uesrs=1
surftrace 'p wake_up_new_task node=%0->se.run_node.rb_left'
echo 'p:f0 wake_up_new_task node=+0xa8(%di):u64' >> /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable
echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on
 surftrace-2543  [001] ....  5926.605145: f0: (wake_up_new_task+0x0/0x280) node=0
 bash-2392  [001] ....  5940.292293: f0: (wake_up_new_task+0x0/0x280) node=0
 bash-2392  [001] ....  5945.207106: f0: (wake_up_new_task+0x0/0x280) node=0
 systemd-journal-553   [000] ....  5953.211998: f0: (wake_up_new_task+0x0/0x280) node=0

2.2.4、设置过过滤器

​ 过滤器需要放在表达式最后,以f:开头,可以使用括号和&& ||逻辑表达式进行组合,具体写法可以参考ftrace文档说明

​ 命令行 surftrace 'p wake_up_new_task comm=%0->comm f:comm=="python3"'

surftrace 'p wake_up_new_task comm=%0->comm f:comm=="python3"'
echo 'p:f0 wake_up_new_task comm=+0xb28(%di):string' >> /sys/kernel/debug/tracing/kprobe_events
echo 'comm=="python3"' > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/filter
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable
echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on
 <...>-2640781 [002] .... 6305734.444913: f0: (wake_up_new_task+0x0/0x250) comm="python3"
 <...>-2640781 [002] .... 6305734.447806: f0: (wake_up_new_task+0x0/0x250) comm="python3"
 <...>-2640781 [002] .... 6305734.450897: f0: (wake_up_new_task+0x0/0x250) comm="python3"

 系统会默认提供 'common_pid', 'common_preempt_count', 'common_flags', 'common_type' 这5个变量作为过滤器,该变量由系统提供,无需额外定义。

2.2.5、函数内部追踪

​ 函数内部追踪需要结合函数内部汇编代码进行推导,该方法并不通用,该内容操作进供参考。反汇编do_filp_open函数

3699	in fs/namei.c
   0xffffffff812adb65 <+85>:	mov    %r13d,%edx
   0xffffffff812adb70 <+96>:	or     $0x40,%edx
   0xffffffff812adb73 <+99>:	mov    %r12,%rsi
   0xffffffff812adb76 <+102>:	mov    %rsp,%rdi
   0xffffffff812adb89 <+121>:	callq  0xffffffff812ac760 <path_openat>
   0xffffffff812adb92 <+130>:	mov    %rax,%rbx

3700	in fs/namei.c
   0xffffffff812adb8e <+126>:	cmp    $0xfffffffffffffff6,%rax
   0xffffffff812adb95 <+133>:	je     0xffffffff812adbb4 <do_filp_open+164>

3701	in fs/namei.c
   0xffffffff812adbb4 <+164>:	mov    %r13d,%edx
   0xffffffff812adbb7 <+167>:	mov    %r12,%rsi
   0xffffffff812adbba <+170>:	mov    %rsp,%rdi
   0xffffffff812adbbd <+173>:	callq  0xffffffff812ac760 <path_openat>
   0xffffffff812adbc2 <+178>:	mov    %rax,%rbx
   0xffffffff812adbc5 <+181>:	jmp    0xffffffff812adb97 <do_filp_open+135>

3702	in fs/namei.c
   0xffffffff812adb97 <+135>:	cmp    $0xffffffffffffff8c,%rbx
   0xffffffff812adb9b <+139>:	je     0xffffffff812adbc7 <do_filp_open+183>

 对应源码

struct file *do_filp_open(int dfd, struct filename *pathname,
  		const struct open_flags *op)
{
  	struct nameidata nd;
  	int flags = op->lookup_flags;
  	struct file *filp;
  
  	set_nameidata(&nd, dfd, pathname);
  	filp = path_openat(&nd, op, flags | LOOKUP_RCU);
  	if (unlikely(filp == ERR_PTR(-ECHILD)))
  		filp = path_openat(&nd, op, flags);
  	if (unlikely(filp == ERR_PTR(-ESTALE)))
  		filp = path_openat(&nd, op, flags | LOOKUP_REVAL);
  	restore_nameidata();
  	return filp;
}

 要获取 3699行 filp = path_openat(&nd, op, flags | LOOKUP_RCU) 对应的filp的值

surftrace 'p do_filp_open+121 filp=X!(u64)%ax'
echo 'p:f0 do_filp_open+121 filp=%ax:x64' >> /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable
echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on
 <...>-1315799 [006] d.Z. 6314249.201847: f0: (do_filp_open+0x79/0xd0) filp=0xffff929db2819840
 <...>-4006158 [014] d.Z. 6314249.326736: f0: (do_filp_open+0x79/0xd0) filp=0xffff929daeac48c0

 变量表达式:filp=X!(u64)%ax 中,使用!对寄存器类型进行数据类型强制转换,括号当中的是是数据类型定义。

 展开 struct file 结构体定义:

struct file {
    union {
        struct llist_node fu_llist;
        struct callback_head fu_rcuhead;
    } f_u;
    struct path f_path;
    struct inode *f_inode;
    const struct file_operations *f_op;
    spinlock_t f_lock;
    enum rw_hint f_write_hint;
    atomic_long_t f_count;
    unsigned int f_flags;
    fmode_t f_mode;
    struct mutex f_pos_lock;
    loff_t f_pos;
    struct fown_struct f_owner;
    const struct cred *f_cred;
    struct file_ra_state f_ra;
    u64 f_version;
    void *f_security;
    void *private_data;
    struct list_head f_ep_links;
    struct list_head f_tfile_llink;
    struct address_space *f_mapping;
    errseq_t f_wb_err;
}

​ 如果要获取此时的f_pos值,可以这样获取

  • 命令行:surftrace 'p do_filp_open+121 pos=X!(struct file*)%ax->f_pos'
surftrace 'p do_filp_open+121 pos=X!(struct file*)%ax->f_pos'
echo 'p:f0 do_filp_open+121 pos=+0x68(%ax):x64' >> /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable
echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on
 <...>-1334277 [010] d.Z. 6314645.646230: f0: (do_filp_open+0x79/0xd0) pos=0x0
 <...>-2916553 [002] d.Z. 6314645.653164: f0: (do_filp_open+0x79/0xd0) pos=0x0
 <...>-2916553 [002] d.Z. 6314645.653253: f0: (do_filp_open+0x79/0xd0) pos=0x0

 获取方法和前面保持一致。

2.3、获取返回值

​ 前文已经描述采用r 对事件类型进行标识,返回寄存器统一用$retval标识,与ftrace保持一致,以获取do_filp_open函数返回值为例:

  • 命令行:surftrace 'r do_filp_open filp=$retval'
surftrace 'r do_filp_open filp=$retval'
echo 'r:f0 do_filp_open filp=$retval:u64' >> /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable
echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on
 <...>-1362926 [010] d... 6315264.198718: f0: (do_sys_openat2+0x1b6/0x260 <- do_filp_open) filp=18446623804769722880
 <...>-4006154 [008] d... 6315264.256749: f0: (do_sys_openat2+0x1b6/0x260 <- do_filp_open) filp=18446623804770426624
 <...>-4006154 [008] d... 6315264.256776: f0: (do_sys_openat2+0x1b6/0x260 <- do_filp_open) filp=18446623804770425344

​ 获取 struct file 中f_pos成员

  • 命令行:surftrace 'r do_filp_open pos=$retval->f_pos'
surftrace 'r do_filp_open pos=$retval->f_pos'
echo 'r:f0 do_filp_open pos=+0x68($retval):u64' >> /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable
echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on
 <...>-1371049 [008] d... 6315439.568814: f0: (do_sys_openat2+0x1b6/0x260 <- do_filp_open) pos=0
 systemd-journal-3665  [012] d... 6315439.568962: f0: (do_sys_openat2+0x1b6/0x260 <- do_filp_open) pos=0
 systemd-journal-3665  [012] d... 6315439.571519: f0: (do_sys_openat2+0x1b6/0x260 <- do_filp_open) pos=0

2.4、网络报文特殊处理

​ sk_buff 是linux网络协议栈重要的结构体,通过前面的方法,并不能直接解析到我们关注的报文内容,需要进行特殊处理。以追踪icmp接收ping报文为例,我们在__netif_receive_skb_core 函数中进行probe和过滤:

  • 命令行 surftrace 'p __netif_receive_skb_core proto=@(struct iphdr *)l3%0->protocol ip_src=@(struct iphdr *)%0->saddr ip_dst=@(struct iphdr *)l3%0->daddr data=X@(struct iphdr *)l3%0->sdata[1] f:proto==1&&ip_src==127.0.0.1'
  • 同时可能需要 执行 ping127.0.0.1
surftrace 'p __netif_receive_skb_core proto=@(struct iphdr *)l3%0->protocol ip_src=@(struct iphdr *)%0->saddr ip_dst=@(struct iphdr *)l3%0->daddr data=X@(struct iphdr *)l3%0->sdata[1] f:proto==1&&ip_src==127.0.0.1'
echo 'p:f0 __netif_receive_skb_core proto=+0x9(+0xe8(%di)):u8 ip_src=+0xc(+0xe8(%di)):u32 ip_dst=+0x10(+0xe8(%di)):u32 data=+0x16(+0xe8(%di)):x16' >> /sys/kernel/debug/tracing/kprobe_events
echo 'proto==1&&ip_src==0x100007f' > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/filter
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable
echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on
 <...>-1420827 [013] ..s1 6316511.011244: f0: (__netif_receive_skb_core+0x0/0xc10) proto=1 ip_src=127.0.0.1 ip_dst=127.0.0.1 data=0x4a0d
 <...>-1420827 [013] ..s1 6316511.011264: f0: (__netif_receive_skb_core+0x0/0xc10) proto=1 ip_src=127.0.0.1 ip_dst=127.0.0.1 data=0x4a15

​ 协议的获取表达式为 @(struct iphdr *)l3%0->protocol,和之前不一样的是,寄存器的结构体名左括号加了@符号进行特殊标记,表示需要用该结构体来解析skb->data指针数据,结构体名和右括号后加了l3标记(命名为右标记),表示当前skb->data指向了TCP/IP 层3位置。

  • 右标记有l2、l3、l4三个选项,也可以不标记,默认为l3,如 ip_src=@(struct iphdr *)%0->saddr,没有右标记。
  • 报文结构体有 'struct ethhdr', 'struct iphdr', 'struct icmphdr', 'struct tcphdr', 'struct udphdr'五类,如果协议栈层级和报文结构体对应不上,解析器会报参数错误,如右标记为l3,但是报文结构体是 struct ethhdr类型;
  • 'struct icmphdr', 'struct tcphdr', 'struct udphdr'这三个4层结构体增加了xdata成员,用于获取协议对应报文内容。xdata有 cdata. sdata, ldata, qdata, Sdata 五种类型,位宽对应 1 2 4 8 和字符串. 数组下标是按照位宽进行对齐的,如实例表达式中的 data=@(struct icmphdr*)l3%0->sdata[1],sdata[1]表示要提取icmp报文中的2~3字节内容
  • surftrace 会对以 ip_xx开头的变量进行ipv4<->u32 ,如 ip_src=@(struct iphdr *)%0->saddr,会转成对应的IP格式。对B16_、B32_、B64_、b16_、b32_、b64_开头的变量也会进行大小端转换,B开头按照16进制输出,b以10进制输出。

2.5、event

​ trace event 信息参考 /sys/kernel/debug/tracing/events目录下的事件 描述,以追踪wakeup等待超过10ms任务为例

​ 命令行 surftrace 'e sched/sched_stat_wait f:delay>1000000'

surftrace 'e sched/sched_stat_wait f:delay>1000000'
echo 'delay>1000000' > /sys/kernel/debug/tracing/instances/surftrace/events/sched/sched_stat_wait/filter
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/sched/sched_stat_wait/enable
echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on
<idle>-0     [001] dN.. 11868700.419049: sched_stat_wait: comm=h2o pid=3046552 delay=87023763 [ns]
 <idle>-0     [005] dN.. 11868700.419049: sched_stat_wait: comm=h2o pid=3046617 delay=87360020 [ns]

2.6、全局变量以及指定地址访问

2.6.1、访问内核符号:

 以访问 task_group_cache 这个全局符号为例,它的定义如下:

static struct kmem_cache *task_group_cache __read_mostly;

 获取指针信息

surftrace 'p wake_up_new_task point=@task_group_cache'
echo 'p:f0 wake_up_new_task point=@task_group_cache' >> /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable
echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on
 <...>-3626383 [000] .... 12192156.289170: f0: (wake_up_new_task+0x0/0x250) point=0xffff929dc0405500
 <...>-2282088 [006] .... 12192156.294148: f0: (wake_up_new_task+0x0/0x250) point=0xffff929dc0405500
 <...>-3626558 [001] .... 12192156.305044: f0: (wake_up_new_task+0x0/0x250) point=0xffff929dc0405500
 <...>-3626558 [001] .... 12192156.305133: f0: (wake_up_new_task+0x0/0x250) point=0xffff929dc0405500

 解析变量结构体内容:

surftrace 'p wake_up_new_task name=!(struct kmem_cache*)@task_group_cache->name size=!(struct kmem_cache*)@task_group_cache->size'
echo 'p:f0 wake_up_new_task name=+0x0(+0x58(@task_group_cache)):string size=+0x18(@task_group_cache):u32' >> /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable
echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on
 <...>-3736660 [014] .... 12192459.242704: f0: (wake_up_new_task+0x0/0x250) name="task_group" size=704
 <...>-2282088 [008] .... 12192459.266579: f0: (wake_up_new_task+0x0/0x250) name="task_group" size=704
 <...>-3736816 [001] .... 12192459.278101: f0: (wake_up_new_task+0x0/0x250) name="task_group" size=704
 <...>-3736816 [001] .... 12192459.278169: f0: (wake_up_new_task+0x0/0x250) name="task_group" size=704

2.6.2、访问指定地址

 根据ftrace要求,访问地址必须要在内核地址范围内。继续以以访问 task_group_cache 这个全局符号为例,首先获取符号地址

cat /proc/kallsyms |grep task_group_cache
ffffffff8647bc30 d task_group_cache

 查询命令(注意,不同内核地址不一致,不能简单复制):

surftrace 'p wake_up_new_task name=!(struct kmem_cache*)@0xffffffff8647bc30->name size=!(struct kmem_cache*)@0xffffffff8647bc30->size'
echo 'p:f0 wake_up_new_task name=+0x0(+0x58(@0xffffffff8647bc30)):string size=+0x18(@0xffffffff8647bc30):u32' >> /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/events/kprobes/f0/enable
echo 0 > /sys/kernel/debug/tracing/instances/surftrace/options/stacktrace
echo 1 > /sys/kernel/debug/tracing/instances/surftrace/tracing_on
 <...>-3910607 [012] .... 12193362.784157: f0: (wake_up_new_task+0x0/0x250) name="task_group" size=704
 <...>-3386586 [012] .... 12193362.960034: f0: (wake_up_new_task+0x0/0x250) name="task_group" size=704
 <...>-3386586 [012] .... 12193362.963222: f0: (wake_up_new_task+0x0/0x250) name="task_group" size=704

3、surfGuide 使用

​ surfGuide可以直接运行,命令行已经有一些使用帮助提示。现在手头任务紧张,等有空了再补充完善吧。

​ 安装:pip install surfGuide

​ 然后运行 surfGuide 就可以使用了。

4、使用surfGuide发布通用命令

同上

5、接管surftrace数据进行开发处理

同上上

6、pylcc原理以及流程图

 pylcc在libbpf基础上进行封装,将复杂的编译工程交由容器执行 pylcc.png

6、1 准备工作

基本要求

  • 能力要求:熟悉c,libpf开发特性,python
  • python2.7 或者python3都可以运行,无需安装任何第三方库。
  • 环境要求:可以访问pylcc.openanolis.cn。后面编译容器发布了以后,可以自行搭建编译服务执行

6.2 实战

执行pip install pylcc安装

git clone [email protected]:aliyun/surftrace.git

示例代码 在目录 tool/pylcc/guide下

6.3.1 从hello world 开始

hello.py 代码

import time
from pylcc.lbcBase import ClbcBase

bpfPog = r"""
#include "lbc.h"

SEC("kprobe/wake_up_new_task")
int j_wake_up_new_task(struct pt_regs *ctx)
{
    struct task_struct* parent = (struct task_struct *)PT_REGS_PARM1(ctx);
    
    bpf_printk("hello lcc, parent: %d\n", _(parent->tgid));
    return 0;
}

char _license[] SEC("license") = "GPL";
"""

class Chello(ClbcBase):
    def __init__(self):
        super(Chello, self).__init__("hello", bpf_str=bpfPog)
        while True:
            time.sleep(1)

if __name__ == "__main__":
    hello = Chello()
    pass

6.3.1.1 bpf代码说明:

  • bpf代码需要包含 lbc.h 头文件,该头文件会包含以下头文件,并且会加上我们常见的宏定义和数据类型,详情参考后面的附录,
#include "vmlinux.h"
#include <linux/types.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_core_read.h>
#include <bpf/bpf_tracing.h>
  • SEC的定义和函数内部实现与libbpf应用方法保持一致;
  • 访问结构体成员使用了_宏,该方法访问方式相对固定,下一节会提供core的获取方法;
  • 末尾不要遗忘 _license声明

6.3.1.2、python代码实现部分说明:

 python 部分代码从ClbcBase 类继承,__init__函数中,第一入参必须要指定,用于指定生成so的文件名。在执行完__init__函数后,bfp模块就已经注入到内核当中去执行了。

6.3.1.3、执行效果:

 执行 python2 hello.py 运行,并查看编译结果:

#cat /sys/kernel/debug/tracing/trace_pipe
           <...>-1091294 [005] d... 17658161.425644: : hello lcc, parent: 106880
           <...>-4142485 [003] d... 17658161.428568: : hello lcc, parent: 4142485
           <...>-4142486 [002] d... 17658161.430972: : hello lcc, parent: 4142486
           <...>-4142486 [002] d... 17658161.431228: : hello lcc, parent: 4142486
           <...>-4142486 [002] d... 17658161.431557: : hello lcc, parent: 4142486
           <...>-4142485 [003] d... 17658161.435385: : hello lcc, parent: 4142485
           <...>-4142490 [000] d... 17658161.437562: : hello lcc, parent: 4142490

 此时可以看到目录下新增了hello.so 文件,如果文件时间戳有更新,只要bpfProg部分内容不发生改变,就不会触发重编动作。如果bpfProg 发生变换,就会触发重新编译动作,生成新的so

6.3.2 往用户态传递信息

 代码参考 eventOut.py

import ctypes as ct
from pylcc.lbcBase import ClbcBase

bpfPog = r"""
#include "lbc.h"
#define TASK_COMM_LEN 16
struct data_t {
    u32 c_pid;
    u32 p_pid;
    char c_comm[TASK_COMM_LEN];
    char p_comm[TASK_COMM_LEN];
};

LBC_PERF_OUTPUT(e_out, struct data_t, 128);
SEC("kprobe/wake_up_new_task")
int j_wake_up_new_task(struct pt_regs *ctx)
{
    struct task_struct* parent = (struct task_struct *)PT_REGS_PARM1(ctx);
    struct data_t data = {};

    data.c_pid = bpf_get_current_pid_tgid() >> 32;
    bpf_get_current_comm(&data.c_comm, TASK_COMM_LEN);
    data.p_pid = BPF_CORE_READ(parent, pid);
    bpf_core_read(&data.p_comm[0], TASK_COMM_LEN, &parent->comm[0]);
    
    bpf_perf_event_output(ctx, &e_out, BPF_F_CURRENT_CPU, &data, sizeof(data));
    return 0;
}

char _license[] SEC("license") = "GPL";
"""

class CeventOut(ClbcBase):
    def __init__(self):
        super(CeventOut, self).__init__("eventOut", bpf_str=bpfPog)

    def _cb(self, cpu, data, size):
        e = self.getMap('e_out', data, size)
        print("current pid:%d, comm:%s. wake_up_new_task pid: %d, comm: %s" % (
            e.c_pid, e.c_comm, e.p_pid, e.p_comm
        ))

    def loop(self):
        self.maps['e_out'].open_perf_buffer(self._cb)
        try:
            self.maps['e_out'].perf_buffer_poll()
        except KeyboardInterrupt:
            print("key interrupt.")
            exit()

if __name__ == "__main__":
    e = CeventOut()
    e.loop()

6.3.2.1 bpf部分代码说明:

  • LBC_PERF_OUTPUT宏不能用原有的bpf_map_def ……BPF_MAP_TYPE_PERF_EVENT_ARRAY…… 替代,虽然是同样申明一个 perf maps,但如果用原始的声明方式,python在加载的时候将无法识别出对应的内核数据类型。
  • 可以使用 bpf_get_current_pid_tgid 等libbpf helper函数;
  • 可以使用 bpf_core_read 等方法;
  • 不可使用 bcc 独有的方法,如直接指针访问变量等;

6.3.2.2 python部分代码说明

 以loop函数为入口:

  • self.maps['e_out'].open_perf_buffer(self._cb)函数是为 e_out事件注册回调钩子函数,其中e_out命名与bpfProg中LBC_PERF_OUTPUT(e_out, struct data_t, 128) 对应;
  • self.maps['e_out'].perf_buffer_poll() 即poll 对应的event事件,与bpfProg中 bpf_perf_event_output(ctx, &e_out……对应;

 接下来看_cb 回调函数:

  • e = self.getMap('e_out', data, size) 将数据流生成对应的数据对象;
  • 生成了数据对象后,就可以通过成员的方式来访问数据对象,该对象成员与bpfProg中 struct data_t 定义保持一致

6.3.2.3 执行结果

python2 eventOut.py
current pid:241808, comm:python. wake_up_new_task parent pid: 241871, comm: python
current pid:1, comm:systemd. wake_up_new_task parent pid: 1, comm: systemd
……

6.3.3 动态修改bpfProg代码

 在3.2的基础上,参考dynamicVar.py,如果只想动态过滤parent进程id为 241871,可以借鉴bcc的思路进行替换,大部分代码与eventOut.py一致,首先在bpfProg代码添加了过滤动作:

   	……
	u32 pid = BPF_CORE_READ(parent, pid);
    if (pid != FILTER_PID) {
        return 0;
    }
	……

 然后在main入口处进行替换

if __name__ == "__main__":
    bpfPog = bpfPog.replace("FILTER_PID", sys.argv[1])
    e = CdynamicVar()
    e.loop

 将要过滤的参数传入,执行效果

python2 dynamicVar.py 241871
current pid:241808, comm:python. wake_up_new_task pid: 241871, comm: python
current pid:241808, comm:python. wake_up_new_task pid: 241871, comm: python
current pid:241808, comm:python. wake_up_new_task pid: 241871, comm: python

6.3.4 hash map应用

 代码参考 hashMap.py,大部分代码与eventOut.py一致。

6.3.4.1 bpf 部分代码

 定义hashmap

LBC_HASH(pid_cnt, u32, u32, 1024);

使用方法和libbfp一致

	u32 *pcnt, cnt;
    
    pcnt =  bpf_map_lookup_elem(&pid_cnt, &pid);
    cnt  = pcnt ? *pcnt + 1 : 1;
    bpf_map_update_elem(&pid_cnt, &pid, &cnt, BPF_ANY);

6.3.4.2 python部分代码

 查询maps的位置在exit退出之前打印所有信息

			……
            dMap = self.maps['pid_cnt']
            print(dMap.get())
            exit()

 哈希表对象可以直接由 self.maps['pid_cnt'] 方法获取到,可以调用get函数,获取到dict对象。

 除了BPF_MAP_TYPE_HASH,lcc当前还支持BPF_MAP_TYPE_LRU_HASH、BPF_MAP_TYPE_PERCPU_HASH、 BPF_MAP_TYPE_LRU_PERCPU_HASH等类型,更多类型支持在完善中,敬请期待。

6.3.5、call stack获取

 获取内核调用栈是bpf一项非常重要的调试功能,参考 callStack.py,大部分代码与eventOut.py一致。

6.3.5.1、bpf部分代码说明

 外传的数据结构体中增加stack_id成员,接下来定义一个call stack成员

struct data_t {
    u32 c_pid;
    u32 p_pid;
    char c_comm[TASK_COMM_LEN];
    char p_comm[TASK_COMM_LEN];
    u32 stack_id;
};

LBC_PERF_OUTPUT(e_out, struct data_t, 128);
LBC_STACK(call_stack,32);

 在处理函数中记录call stack

data.stack_id = bpf_get_stackid(ctx, &call_stack, KERN_STACKID_FLAGS);

6.3.5.2、python部分代码

 通过getStacks传入stack_id,即可获取调用栈符号数组,然后列出来即可

		stacks = self.maps['call_stack'].getStacks(e.stack_id)
		print("call trace:")
		for s in stacks:
			print(s)

6.3.5.3、执行结果

python callStack.py
remote server compile success.
current pid:1, comm:systemd. wake_up_new_task pid: 1, common: systemd
call trace:
startup_64
do_syscall_64
entry_SYSCALL_64_after_swapgs

6.3.6、py与bpf.c文件分离

 参考 codeSeparate.py 和 independ.bpf.c,它的功能实现和eventOut.py 完全一致,不一样的是将python和bpf.c的功能拆分到了两个文件中去实现。  我们只需要关注下__init__函数

    def __init__(self):
        super(codeSeparate, self).__init__("independ")

 它没有了 bpf_str 入参,此时lcc会尝试从当前目录上下,去找independ.bpf.c并提请编译加载。

6.4 pylcc 与 bcc 对比性能优势

 由于bcc 库内部集成了庞大的 LLVM/Clang 库,使其在使用过程中会遇到一些问题:

    1. 在每个工具启动时,都会占用较高的 CPU 和内存资源来编译 BPF 程序,在系统资源已经短缺的服务器上运行可能引起问题;
    1. 依赖于内核头文件包,必须将其安装在每个目标主机上。即便如此,如果需要内核中未 export 的内容,则需要手动将类型定义复制/粘贴到 BPF 代码中;

 以tools/pylcc/pytool/filelife.py工具为例,与bcc/tools/filelife功能一致的前提下,性能差异对比:

    1. lcc 由于不在本地编译,无本地cpu冲高过程;而采用bcc 可以监控到明显的CPU冲高过程

bcc-compile.png

    1. 运行阶段内存占用对比
pylcc bcc
rss 10352 92288
vmpeak 207444 369672
vmdata 201284 363484

 汇总对比如下表,同样的python应用,pylcc在cpu和mem等资源消耗均比bcc有较明显的优势

pylcc bcc
启动阶段
cpu占用
0% 50%+
运行阶段
rss占用
1 9

7 附录、

7.1、lbc.h头文件已定义的信息

#ifndef LBC_LBC_H
#define LBC_LBC_H

#define _LINUX_POSIX_TYPES_H
#define __ASM_GENERIC_POSIX_TYPES_H

#define PERF_MAX_STACK_DEPTH 127
#define BPF_F_FAST_STACK_CMP	(1ULL << 9)

#define KERN_STACKID_FLAGS (0 | BPF_F_FAST_STACK_CMP)
#define USER_STACKID_FLAGS (0 | BPF_F_FAST_STACK_CMP | BPF_F_USER_STACK)

typedef unsigned long long u64;
typedef signed long long s64;
typedef unsigned int u32;
typedef signed int s32;
typedef unsigned short u16;
typedef signed short s16;
typedef unsigned char u8;
typedef signed char s8;

enum {
    BPF_ANY         = 0, /* create new element or update existing */
    BPF_NOEXIST     = 1, /* create new element if it didn't exist */
    BPF_EXIST       = 2, /* update existing element */
    BPF_F_LOCK      = 4, /* spin_lock-ed map_lookup/map_update */
};

#define LBC_PERF_OUTPUT(MAPS, CELL, ENTRIES) \
    struct bpf_map_def SEC("maps") MAPS = { \
        .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY, \
        .key_size = sizeof(int), \
        .value_size = sizeof(s32), \
        .max_entries = ENTRIES, \
    }

#define LBC_HASH(MAPS, KEY_T, VALUE_T, ENTRIES) \
    struct bpf_map_def SEC("maps") MAPS = { \
        .type = BPF_MAP_TYPE_HASH, \
        .key_size = sizeof(KEY_T), \
        .value_size = sizeof(VALUE_T), \
        .max_entries = ENTRIES, \
    }

#define LBC_LRU_HASH(MAPS, KEY_T, VALUE_T, ENTRIES) \
    struct bpf_map_def SEC("maps") MAPS = { \
        .type = BPF_MAP_TYPE_LRU_HASH, \
        .key_size = sizeof(KEY_T), \
        .value_size = sizeof(VALUE_T), \
        .max_entries = ENTRIES, \
    }

#define LBC_PERCPU_HASH(MAPS, KEY_T, VALUE_T, ENTRIES) \
    struct bpf_map_def SEC("maps") MAPS = { \
        .type = BPF_MAP_TYPE_PERCPU_HASH, \
        .key_size = sizeof(KEY_T), \
        .value_size = sizeof(VALUE_T), \
        .max_entries = ENTRIES, \
    }

#define LBC_LRU_PERCPU_HASH(MAPS, KEY_T, VALUE_T, ENTRIES) \
    struct bpf_map_def SEC("maps") MAPS = { \
        .type = BPF_MAP_TYPE_LRU_PERCPU_HASH, \
        .key_size = sizeof(KEY_T), \
        .value_size = sizeof(VALUE_T), \
        .max_entries = ENTRIES, \
    }

#define LBC_STACK(MAPS, ENTRIES) \
    struct bpf_map_def SEC("maps") MAPS = { \
        .type = BPF_MAP_TYPE_STACK_TRACE, \
        .key_size = sizeof(u32), \
        .value_size = PERF_MAX_STACK_DEPTH * sizeof(u64), \
        .max_entries = ENTRIES, \
    }

#define _(P) ({typeof(P) val = 0; bpf_probe_read((void*)&val, sizeof(val), (const void*)&P); val;})

#include "vmlinux.h"
#include <linux/types.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_core_read.h>
#include <bpf/bpf_tracing.h>

#ifndef NULL
#define NULL ((void*)0)
#endif
#ifndef ntohs
#define ntohs(x) (0xff00 & x << 8) \
                |(0x00ff & x >> 8)
#endif
#ifndef ntohl
#define ntohl(x) (0xff000000 & x << 24) \
                |(0x00ff0000 & x <<  8) \
                |(0x0000ff00 & x >>  8) \
                |(0x000000ff & x >> 24)
#endif
#ifndef ntohll
#define ntohll(x) ((((long long)ntohl(x))<<32) + (ntohl((x)>>32)))
#endif
#define BPF_F_CURRENT_CPU 0xffffffffULL

#endif //LBC_LBC_H

8、生成surftrace db 方法

8.1、准备工作

 以解析anolis发行版,rpm包名:kernel-debug-debuginfo-4.19.91-23.4.an8.x86_64.rpm为例。需要准备好一台x86_64实例,确保该实例可以访问上面的url。

8.1.1、环境上已经安装了docker,

  下载容器镜像

docker pull liaozhaoyan/dbhive

8.1.2、在host机器上准备好存放目录,目录结构如下:

# tree
tree
.
└── x86_64
    ├── btf
    │   └── anolis
    ├── db
    │   └── anolis
    ├── funcs
    │   └── anolis
    ├── head
    │   └── anolis
    ├── pack
    │   └── anolis
    └── vmlinux
        └── anolis

 1.顶级目录可以定义,本例定义为dbhive;

 2.一级目录为arch名,当前容器仅支持x86_64和aarch64

 3.二级目录为各个功能组目录,可以执行以下命令批量创建:

export RELEASE=anolis
mkdir -p btf/$RELEASE  db/$RELEASE  funcs/$RELEASE  head/$RELEASE  pack/$RELEASE  vmlinux/$RELEASE

 4.三级目录是发行版的名字,已经在步骤3中创建好了

8.1.3、拉起容器

docker run --net=host --privileged=true -v /root/1ext/vmhive:/home/vmhive/ --name dbhived -itd liaozhaoyan/dbhive /usr/sbin/init

8.1.4、进入容器里面执行生成db文件动作:

docker exec -it dbhived bash
cd /home/dbhive/
python3 getVmlinux.py
proc kernel-debug-debuginfo-4.19.91-23.4.an8.x86_64.rpm, x86_64
4728267 blocks
strip: /home/vmhive/x86_64/btf/anolis/stlpkyQL: warning: allocated section `.BTF' not in segment
gen /home/vmhive/x86_64/db/anolis/info-debuginfo-4.19.91-23.4.an8.x86_64.db
No symbol "__int128" in current context.
failed to parse type __int128
This context has class, struct or enum irte, not a union.
……

 此时开始解析所有的内核符号,解析完毕以后,会在host侧的vmhive/x86_64/db/anolis 目录下生成用于surftrace使用的db文件。

surftrace's People

Contributors

liaozhaoyan avatar alibaba-oss avatar cylinder1 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.