Problem: our bug database shows that replaying traces in a ord

Evaluation status about hase HOT 5 OPEN

hase-project commented on July 1, 2024

Evaluation status

from hase.

Comments (5)

Mic92 commented on July 1, 2024

$ grep timeout results.csv | wc -l
79
$ grep failed results.csv | grep -v EstimatedTimeTooLong | wc -l
14 # 4 of some should be fixed soon
$ grep succeed results.csv | wc -l
31 # we might bring this number up to 40

from hase.

Mic92 commented on July 1, 2024

$ ls -la recordings | wc -l
165
$ wc -l results.csv
124

from hase.

Mic92 commented on July 1, 2024

Good case/bad case time estimation for some long traces:

https://gist.github.com/Mic92/c8ef5f064206b810b0c3b5ddc83e38bf

from hase.

Mic92 commented on July 1, 2024

Problem:

our bug database shows that replaying traces in a order of 1e6 of instructions
would take too long for many of our traces:
- https://gist.github.com/Mic92/c8ef5f064206b810b0c3b5ddc83e38bf
- order of month(s) to replay
- maybe some con
time spent in mostly symbolic execution/constraint solving (flamegraph: https://dl.thalheim.io/a-x4-8fIhrPwQUNCYblaeQ/flamegraph.svg)
maybe some solving could be avoided/delayed, but probably not in the amount we need

Possible idea:

get the callstack with return pointer (and maybe registers) on every context switch with perf_events
we already have switch events have to re-assemble the instruction stream
Kernel does at least deschedule a process every 20ms by default (50Hz), but at most every 4ms (250Hz)
cpu: 3GHz, 3000_000_000 instructions per core (more like a upper bound):
- snapshots with 50 Hz every 60e6 instructions
- snapshots with 250 Hz every 12e6 instructions
- what are realistic numbers here?
- in case of bad chance this would not produce close-to-exit samples
  -> ignore those traces, easy to detect without decoding the whole trace
every system calls also already introduce context switches
might affect some i/o heavy application -> instruction counter-based rate limiter based on bpf?
Example output: sudo perf script
.perf-wrapped 8653 27352.980404: 1 cycles:ppp:
ffffffffa1b884a3 perf_ctx_unlock+0x3 ([kernel.kallsyms])
ffffffffa1b95dd4 perf_event_exec+0x184 ([kernel.kallsyms])
ffffffffa1c3c848 setup_new_exec+0xc8 ([kernel.kallsyms])
ffffffffa1c91076 load_elf_binary+0x2e6 ([kernel.kallsyms])
ffffffffa1c3a7e0 search_binary_handler+0x90 ([kernel.kallsyms])
ffffffffa1c3c318 __do_execve_file.isra.37+0x6b8 ([kernel.kallsyms])
ffffffffa1c3c654 __x64_sys_execve+0x34 ([kernel.kallsyms])
ffffffffa1a041de do_syscall_64+0x4e ([kernel.kallsyms])
ffffffffa2200088 entry_SYSCALL_64_after_hwframe+0x44 ([kernel.kallsyms])
7facebd2c6a7 [unknown] ([unknown])
...
basically just a flag to set in our existing perf code
use Angr angr's call_state and callstack plugin -> supports pushing initial frames
avoid tricky compiler optimization:
- compile with -fno-omit-frame-pointer, also used by some companies in production (netflix)
- could be in theory also work with -fomit-frame-pointer and some dwarf-based post-processing
Work plan:
- Joerg: implement perf part
- Liran: implement angr part
- initial prototype should be in the order of days

from hase.

haollhao commented on July 1, 2024

Real Application Benchmarking Results

Redis

There is a redis server process running all the time, which is recorded.
Benchmarking is done by redis-benchmark, with the default configuration (100000 3-byte requests)
Results of 5 runs (requests per second):

Benchmark	Hase / Original Ratio (should be < 1)
PING_INLINE	1
PING_BULK	1.01
SET	1.01
GET	1.01
INCR	1
LPUSH	1.01
RPUSH	1
LPOP	1
RPOP	0.99
SADD	0.99
HSET	1.02
SPOP	1
LPUSH (needed to benchmark LRANGE)	1
LRANGE_100 (first 100 elements)	1.03
LRANGE_300 (first 300 elements)	0.99
LRANGE_500 (first 450 elements)	1
LRANGE_600 (first 600 elements)	0.99
MSET (10 keys)	0.91

There is another metric (user time + system time) that can be pretty accurate, because the redis server is probably only running when there is a request.

Hase (10ms)	Original (10ms)	Ratio
1911	1754.6	1.089137125

nginx

nginx is configured to have only one worker process, which is recorded.
Benchmark is done by wrk -t 1 -d 10s -c 10 http://localhost/, using 1 thread with 10 connections for 10 seconds.
Results of 10 runs:

Benchmark	Hase / Original Ratio (should be < 1)
Latency:	1 (should be > 1)
Req/Sec:	0.97
requests:	0.97
Requests/sec:	0.97
Transfer/sec:	0.97

logcabin

logcabind is recorded, which is restarted for each run.
Built-in benchmark logcabin-benchmark --writes 10000
Results (time):

#Run	Hase (ms)	Original (ms)	Ratio
0	24211	23742.9
1	24264.4	23479.3
2	24403.5	23017.3
3	23958.5	22874.9
4	25075.3	23328
average	24382.54	23288.48	1.046978592

leveldb

There is no server running, so the benchmark script itself is recorded.
Built-in benchmark command db_bench
Results of 5 runs (micros/op)

Benchmark	Hase / Original Ratio (should be > 1)
fillseq	0.99
fillsync	1
fillrandom	0.97
overwrite	1.06
readrandom	0.99
readrandom	1
readseq	0.92
readreverse	0.95
compact	1.13
readrandom	1
readseq	1
readreverse	1.02
fill100K	1.03
crc32c	1.02
snappycomp	1
snappyuncomp	1.03
acquireload	1.3

sqlite

There is no server running, so the benchmark program (java forked by python) is recorded.
YCSB benchmark with workloada
The results (10 runs) are weird.

Benchmark	Hase / Original Ratio
RunTime(ms)	0.98 (should be > 1)
Throughput(ops/sec)	1.02 (should be < 1)

Apache

By configuring the mpm_worker module, it seems that I was able to run one worker process with one thread (together with one master process). However, recording the worker process results in no cpu trace.
Running ab while recording the worker process even increases the throughput by 50%! wrk shows no such thing.

from hase.

Evaluation status about hase HOT 5 OPEN

Comments (5)

Real Application Benchmarking Results

Redis

nginx

logcabin

leveldb

sqlite

Apache

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent