I would like to use your library to count cycles and instructions as a part of a stand

Provide integration into testing.B about perf-utils HOT 7 CLOSED

hodgesds commented on July 21, 2024

Provide integration into testing.B

from perf-utils.

Comments (7)

hodgesds commented on July 21, 2024

This is a really interesting idea, I think I have it mostly working in master now:

go test  -bench=.
goos: linux
goarch: amd64
pkg: github.com/hodgesds/iouring-go/go/src/github.com/hodgesds/perf-utils
BenchmarkProfiler-8              2050809               588 ns/op              32 B/op          1 allocs/op
BenchmarkCPUCycles-8               30423             39914 ns/op              32 B/op          1 allocs/op
BenchmarkThreadLocking-8        254143965                4.69 ns/op            0 B/op          0 allocs/op
BenchmarkRunBenchmarks-8         3119304               388 ns/op              1336 hardware_cpu_cycles/op             3314 hardware_instructions/op            0 B/op          0 allocs/op
PASS
ok      github.com/hodgesds/iouring-go/go/src/github.com/hodgesds/perf-utils    97.906s

from perf-utils.

clausecker commented on July 21, 2024

Very nice! Some issues I found: you seem to run f for b.N times. That was not really my intent; f should contain the logic to run the benchmark b.N times itself as otherwise, the entire point of b.N is negated (i.e. compensating for overhead in the support code). The RunBenchmark function should only call f once for each performance counter.

I think it might also be useful to shorten the names of the counters. Right now, they are quite bulky and make the out put hard to read. Here are some suggestions:

hardware_instructions → instructions / insns
hardware_cpu_cycles → cycles / cy
hardware_cache_ref → cache_refs / refs
hardware_cache_miss→ cache_misses / miss
hardware_bus_cycles → bus_cycles / buscy
hardware_stalled_cycles_fronted → stalls_front / frstall
hardware_stalled_cycles_backend → stalls_back / bestall
hardware_ref_cpu_cycles → ref_cycles / refcy
software_cpu_clock → cpu_clocks
software_task_clocks → task_clocks
software_page_faults → page_faults / faults
software_context_switches → context_switches / switches
software_cpu_migrations → migrations
software_minor_page_faults → minor_faults
software_major_page_faults → major_faults
software_alignment_faults → align_faults
software_emulation_faults → emu_faults
for the cache part, just leave off the hw_cache_ prefix

Another thing I noticed in my testing is that the strict flag is not ideal as is. If I run a benchmark with this feature turned on on a system that doesn't support performance counters or doesn't allow me to use them, the benchmark output is nearly unreadable among all the log messages (made even worse by the log message being printed b.N times len(eventAttr) times!). So perhaps a better idea would be:

if strict is not set, silently ignored unsupported counters. Do not log anything (testing.Verbose() could be used here, but the guidelines say not to make log output dependent on it as that can influence benchmark timings)
if strict is set, log the reason for the failure and then b.Skip() the benchmark (allowing people without performance counters to still run the benchmark suite to completion)

As for a minor note, I saw that you passed the last argument as

eventAttrs ...*unix.PerfEventAttr

I don't quite see why a pointer is used here. eventAttrs will be passed as a slice, so there shouldn't be any inefficiency with using members of type unix.PerfEventAttr directly. But perhaps there's a good reason for this choice.

If you like, I can make a patch set addressing these issues and send you a PR.

from perf-utils.

hodgesds commented on July 21, 2024

I'm not opposed to any of those things, I had a branch where I tried to get rid of the strict parameter and just have a preB function that could get passed in. That way the caller could determine if they wanted to lock the executing goroutine. It's a little tricky because It kind of goes back to the counts vs samples debate and in theory it should be able to support both (just change a few args to perf_event_open and update the PerfEventAttr struct).

Using the non pointer is fine since most of the helper functions return PerfEventAttr structs.

The naming function is not something i'm proud of, so any improvements would be welcome 😄 I would like to keep the hw_ and sw_ prefixes because it gives insight into the type of PerfEventAttr.

from perf-utils.

clausecker commented on July 21, 2024

It kind of goes back to the counts vs samples debate

I'm unfortunately not aware of this debate. What do you mean?

I tried to get rid of the strict parameter and just have a preB function that could get passed in. That way the caller could determine if they wanted to lock the executing goroutine.

As of now, the strict parameter doesn't influence whether the executing go routine is locked to a thread. It merely influences the behaviour on failure to obtain a performance counter. Is there anything I missed?

from perf-utils.

hodgesds commented on July 21, 2024

I added another parameter to optionally lock to a goroutine. It added a bunch of other logic in the code however, with the basic idea being that you need to create a perf group profiler with all the current threads in the process (ie. where every goroutine can run). The problem is this is racey with the runtime so new threads could be created while the child profilers are being created.

Other than that I cleaned up some of the naming so that should be far more legible on the output.

from perf-utils.

clausecker commented on July 21, 2024

Instead of multiple boolean arguments, it would be much easier to have an int bit mask. This scales better when future options will be added. Also consider the usability of the current design. A call looks like this:

RunBenchmarks(b, f, true, false, attrs...)

Nobody is going to be sure what those true and false operands really mean before looking at the manual for every call.

from perf-utils.

hodgesds commented on July 21, 2024

I've cleaned up these helper functions a bit (they still need some refactoring), but I've tested with both locked and non-locked versions and I get roughly the same results. The non locked version is tricky because the go runtime will create threads which makes the setup more difficult.

from perf-utils.

Provide integration into testing.B about perf-utils HOT 7 CLOSED

Comments (7)

Related Issues (1)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent