Comments (5)
$ grep timeout results.csv | wc -l
79
$ grep failed results.csv | grep -v EstimatedTimeTooLong | wc -l
14 # 4 of some should be fixed soon
$ grep succeed results.csv | wc -l
31 # we might bring this number up to 40
from hase.
$ ls -la recordings | wc -l
165
$ wc -l results.csv
124
from hase.
Good case/bad case time estimation for some long traces:
https://gist.github.com/Mic92/c8ef5f064206b810b0c3b5ddc83e38bf
from hase.
Problem:
- our bug database shows that replaying traces in a order of 1e6 of instructions
would take too long for many of our traces:- https://gist.github.com/Mic92/c8ef5f064206b810b0c3b5ddc83e38bf
- order of month(s) to replay
- maybe some con
- time spent in mostly symbolic execution/constraint solving (flamegraph: https://dl.thalheim.io/a-x4-8fIhrPwQUNCYblaeQ/flamegraph.svg)
- maybe some solving could be avoided/delayed, but probably not in the amount we need
Possible idea:
-
get the callstack with return pointer (and maybe registers) on every context switch with perf_events
-
we already have switch events have to re-assemble the instruction stream
-
Kernel does at least deschedule a process every 20ms by default (50Hz), but at most every 4ms (250Hz)
-
cpu: 3GHz, 3000_000_000 instructions per core (more like a upper bound):
- snapshots with 50 Hz every 60e6 instructions
- snapshots with 250 Hz every 12e6 instructions
- what are realistic numbers here?
- in case of bad chance this would not produce close-to-exit samples
-> ignore those traces, easy to detect without decoding the whole trace
-
every system calls also already introduce context switches
-
might affect some i/o heavy application -> instruction counter-based rate limiter based on bpf?
-
Example output: sudo perf script
.perf-wrapped 8653 27352.980404: 1 cycles:ppp:
ffffffffa1b884a3 perf_ctx_unlock+0x3 ([kernel.kallsyms])
ffffffffa1b95dd4 perf_event_exec+0x184 ([kernel.kallsyms])
ffffffffa1c3c848 setup_new_exec+0xc8 ([kernel.kallsyms])
ffffffffa1c91076 load_elf_binary+0x2e6 ([kernel.kallsyms])
ffffffffa1c3a7e0 search_binary_handler+0x90 ([kernel.kallsyms])
ffffffffa1c3c318 __do_execve_file.isra.37+0x6b8 ([kernel.kallsyms])
ffffffffa1c3c654 __x64_sys_execve+0x34 ([kernel.kallsyms])
ffffffffa1a041de do_syscall_64+0x4e ([kernel.kallsyms])
ffffffffa2200088 entry_SYSCALL_64_after_hwframe+0x44 ([kernel.kallsyms])
7facebd2c6a7 [unknown] ([unknown])
... -
basically just a flag to set in our existing perf code
-
use Angr angr's call_state and callstack plugin -> supports pushing initial frames
-
avoid tricky compiler optimization:
- compile with -fno-omit-frame-pointer, also used by some companies in production (netflix)
- could be in theory also work with -fomit-frame-pointer and some dwarf-based post-processing
-
Work plan:
- Joerg: implement perf part
- Liran: implement angr part
- initial prototype should be in the order of days
from hase.
Real Application Benchmarking Results
Redis
- There is a redis server process running all the time, which is recorded.
- Benchmarking is done by
redis-benchmark
, with the default configuration (100000 3-byte requests) - Results of 5 runs (requests per second):
Benchmark | Hase / Original Ratio (should be < 1) |
---|---|
PING_INLINE | 1 |
PING_BULK | 1.01 |
SET | 1.01 |
GET | 1.01 |
INCR | 1 |
LPUSH | 1.01 |
RPUSH | 1 |
LPOP | 1 |
RPOP | 0.99 |
SADD | 0.99 |
HSET | 1.02 |
SPOP | 1 |
LPUSH (needed to benchmark LRANGE) | 1 |
LRANGE_100 (first 100 elements) | 1.03 |
LRANGE_300 (first 300 elements) | 0.99 |
LRANGE_500 (first 450 elements) | 1 |
LRANGE_600 (first 600 elements) | 0.99 |
MSET (10 keys) | 0.91 |
- There is another metric (user time + system time) that can be pretty accurate, because the redis server is probably only running when there is a request.
Hase (10ms) | Original (10ms) | Ratio |
---|---|---|
1911 | 1754.6 | 1.089137125 |
nginx
nginx
is configured to have only one worker process, which is recorded.- Benchmark is done by
wrk -t 1 -d 10s -c 10 http://localhost/
, using 1 thread with 10 connections for 10 seconds. - Results of 10 runs:
Benchmark | Hase / Original Ratio (should be < 1) |
---|---|
Latency: | 1 (should be > 1) |
Req/Sec: | 0.97 |
requests: | 0.97 |
Requests/sec: | 0.97 |
Transfer/sec: | 0.97 |
logcabin
logcabind
is recorded, which is restarted for each run.- Built-in benchmark
logcabin-benchmark --writes 10000
- Results (time):
#Run | Hase (ms) | Original (ms) | Ratio |
---|---|---|---|
0 | 24211 | 23742.9 | |
1 | 24264.4 | 23479.3 | |
2 | 24403.5 | 23017.3 | |
3 | 23958.5 | 22874.9 | |
4 | 25075.3 | 23328 | |
average | 24382.54 | 23288.48 | 1.046978592 |
leveldb
- There is no server running, so the benchmark script itself is recorded.
- Built-in benchmark command
db_bench
- Results of 5 runs (micros/op)
Benchmark | Hase / Original Ratio (should be > 1) |
---|---|
fillseq | 0.99 |
fillsync | 1 |
fillrandom | 0.97 |
overwrite | 1.06 |
readrandom | 0.99 |
readrandom | 1 |
readseq | 0.92 |
readreverse | 0.95 |
compact | 1.13 |
readrandom | 1 |
readseq | 1 |
readreverse | 1.02 |
fill100K | 1.03 |
crc32c | 1.02 |
snappycomp | 1 |
snappyuncomp | 1.03 |
acquireload | 1.3 |
sqlite
- There is no server running, so the benchmark program (
java
forked bypython
) is recorded. - YCSB benchmark with workloada
- The results (10 runs) are weird.
Benchmark | Hase / Original Ratio |
---|---|
RunTime(ms) | 0.98 (should be > 1) |
Throughput(ops/sec) | 1.02 (should be < 1) |
Apache
- By configuring the
mpm_worker
module, it seems that I was able to run one worker process with one thread (together with one master process). However, recording the worker process results in no cpu trace. - Running
ab
while recording the worker process even increases the throughput by 50%!wrk
shows no such thing.
from hase.
Related Issues (20)
- Evaluate recording overhead of system calls
- Re-Implement recording single processes
- Missing the last instruction HOT 1
- Some traces are not correctly decoded HOT 4
- Data constraints are not enabled due to rsp and rip mismatch HOT 1
- coreutils-7.2-sort.tar.gz trace seems not complete HOT 1
- Support partial trace
- Replay Evaluation
- mprotect cannot handle symbolic addresses HOT 3
- sym_struct has no c_cls attribute HOT 2
- strstr passes None to solver. HOT 9
- WrTmp object has no attribute 'guard' HOT 15
- Unsupported operation: Iop_MAddF64 HOT 4
- SimZeroDivisionException: divide by zero! HOT 4
- What is 'hase'? HOT 2
- w3m-27 trace misses valid sections (maybe library) HOT 11
- Exception: target of reg_concrete is symbolic! HOT 2
- hase: cannot return from a system call loopy HOT 1
- How about a closer collaboration? HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hase.