nvzqz / divan Goto Github PK

View Code? Open in Web Editor NEW

762.0 7.0 19.0 599 KB

Fast and simple benchmarking for Rust projects

Home Page: https://nikolaivazquez.com/blog/divan/

License: Apache License 2.0

Rust 100.00%

benchmark fast rust simple

divan's Introduction

Divan

Comfy benchmarking for Rust projects, brought to you by Nikolai Vazquez.

Sponsor

If you or your company find Divan valuable, consider sponsoring on GitHub or donating via PayPal. Sponsorships help me progress on what's possible with benchmarking in Rust.

Guide

A guide is being worked on. In the meantime, see:

Getting Started

Add the following to your project's Cargo.toml:

[dev-dependencies]
divan = "0.1.14"

[[bench]]
name = "example"
harness = false

Create a benchmarks file at benches/example.rs¹ with your benchmarking code:

fn main() {
    // Run registered benchmarks.
    divan::main();
}

// Register a `fibonacci` function and benchmark it over multiple cases.
#[divan::bench(args = [1, 2, 4, 8, 16, 32])]
fn fibonacci(n: u64) -> u64 {
    if n <= 1 {
        1
    } else {
        fibonacci(n - 2) + fibonacci(n - 1)
    }
}

Run your benchmarks with cargo bench:

example       fastest  │ slowest  │ median   │ mean     │ samples │ iters
╰─ fibonacci           │          │          │          │         │
   ├─ 1       0.626 ns │ 1.735 ns │ 0.657 ns │ 0.672 ns │ 100     │ 819200
   ├─ 2       2.767 ns │ 3.154 ns │ 2.788 ns │ 2.851 ns │ 100     │ 204800
   ├─ 4       6.816 ns │ 7.671 ns │ 7.061 ns │ 7.167 ns │ 100     │ 102400
   ├─ 8       57.31 ns │ 62.51 ns │ 57.96 ns │ 58.55 ns │ 100     │ 12800
   ├─ 16      2.874 µs │ 3.812 µs │ 2.916 µs │ 3.006 µs │ 100     │ 200
   ╰─ 32      6.267 ms │ 6.954 ms │ 6.283 ms │ 6.344 ms │ 100     │ 100

See #[divan::bench] for info on benchmark function registration.

Examples

Practical example benchmarks can be found in the examples/benches directory. These can be benchmarked locally by running:

git clone https://github.com/nvzqz/divan.git
cd divan

cargo bench -q -p examples --all-features

More thorough usage examples can be found in the #[divan::bench] documentation.

License

Like the Rust project, this library may be used under either the MIT License or Apache License (Version 2.0).

Within your crate directory, i.e. $CARGO_MANIFEST_DIR ↩

divan's People

Contributors

Stargazers

Watchers

Forkers

dnaka91 gmh5225 rstkit younghakim7 thomcc nakedible miguelraz aloso natali9t9 0xi4o oliverkillane oyelowo swatinem cuviper jordimerejo sarah-ek v0idmatr1x michel-slm

divan's Issues

A note of caution on putting benchmarks inside the crate

Rust doesn't inline across crate boundaries unless you use #[inline] or LTO, whereas inside the crate it can inline anything it thinks might be worthwhile. That means that a benchmark defined inside a crate may not be getting an accurate measure of the performance that a user of the crate would see.

A practical example of this was just noticed in a library I use where the existing (libtest, not Divan) benchmarks inside the crate showed nearly twice the performance as Criterion benchmarks outside of the crate.

Therefore, while putting benchmarks inside of your crate is very convenient, it is not necessarily a good idea in all cases.

shepmaster/jetscii#57 (comment)

Using `with_inputs` or not

Hi,

I'm not sure I understand the use of with_inputs.

Let's say I have this benchmark:

#[divan::bench(
    types = [Ascii, Unicode],
    consts = LENS,
)]
fn to_ascii_uppercase<G: GenString, const N: usize>(bencher: Bencher) {
    let mut gen = G::default();
    bencher
        .counter(CharsCount::new(N))
        .with_inputs(|| gen.gen_string(N))
        .input_counter(BytesCount::of_str)
        .bench_local_refs(|s| s.to_ascii_uppercase());
}

I get that the time to create the strings wouldn't affect the benchmark.

But what about if I create the string inside bench_local, like this?

#[divan::bench(
    types = [Ascii, Unicode],
    consts = LENS,
)]
fn to_ascii_uppercase<G: GenString, const N: usize>(bencher: Bencher) {
    let mut gen = G::default();
    let s = gen.gen_string(N)
    bencher
        .bench_local(|| s.to_ascii_uppercase());
}

The time to create the string is also not included in the benchmark right?
Therefore is the use of with_inputs mostly to have additional helpers such as input_counter?

Paired Benchmarks

Paired benchmarking spreads measurement noise across benchmarks. It is used in Tango.

The general algorithm for measuring performance using paired benchmarking is as follows:

Prepare the same input data for both the baseline and candidate algorithms.

Execute the baseline algorithm and measure its execution time.

Execute the candidate algorithm and measure its execution time.

Record the difference in runtime between the baseline and candidate algorithms.

These steps constitute a single run, which serves as a distinct benchmark observation. Multiple runs are then performed, with each subsequent run employing a new payload and a randomized order in which the baseline and candidate functions are executed. Randomization is necessary to account for different CPU states and cache effects.

The advantages of using paired benchmarking include:

The difference metric is less susceptible to ephemeral biases in the system, as common-mode biases are eliminated.

Differences in means typically have lower variance since both algorithms are tested on the same input.

It becomes easier to identify and eliminate outliers that are not caused by algorithm changes, thereby enhancing test sensitivity.

@bazhenov, I would love to collaborate on how this approach would look like in Divan. 🙂

Any good way to set group option "on this level"?

I'd love to separate out my project's benchmark definitions into multiple files, which divan seems to support via the [[bench]] cargo list method. However, I'd also like to set benchmark group options on each of these files, so that the #[bench] annotations don't have to repeat the same arguments over and over. Is there a good way to do that without introducing an "interim" module in each file?

I've tried:

Having a "main" file that uses submodules:

#[bench_group(threads=THREADS)
mod multi_threaded;

fn main() {
    divan::main();
}

That fails because procmacro annotations on external files are not stable.

Using a module-global `bench_group` annotation

Same as above, one main.rs file that uses mod multi_threaded; but inside the multi_threaded.rs file,

#![bench_group(threads=THREADS)

This fails because module-wide procmacro annotations aren't stable.

Using an interim module in each benchmark file

Here, put the main function into multi_threaded.rs, then pull in a submodule inside the file as a layer to set those benchmark options on:

fn main() {
    divan::main();
}

#[divan::bench_group(
    threads = THREADS,
)]
mod multi_threaded {
    // ...
}

This works, but results in a tree like:

     Running benches/multi_threaded.rs (target/release/deps/multi_threaded-77aac12e33714668)
multi_threaded                                                                                                                          fastest       │ slowest       │ median        │ mean          │ samples │ iters
╰─ multi_threaded                                                                                                                                     │               │               │               │         │
   ├─ bench_direct                                                                                                                                    │               │               │               │

...and I'd love to get rid of that interstitial module layer.

Possible solutions

Have a main function that accepts benchmark group options?
Allow setting the "default" benchmark options as globally-mutable state? (yikes?!)

warmup

Hello, it would be nice if it would be possible to add warmup runs, in my benchmark case the first run is always slower as some data has to be fetched into memory, and it always drags down the "real" lowest benchmark lowest as this is done only once.
I would suggest #[divan::bench(warmup=10) or something simular with the default being either zero or a 1.

Benches in src for lib

All the given code examples assume the benches are created in /examples/...
Is it possible to have benches in /src/something.rs for a library project?
I tried to use #[divan::bench] the same way I use the #[test], in every module and used the
[dependencies]
divan = "0.1.0"
in Cargo.toml, however the "cargo bench" does not produce any results (only run tests).
Maybe we need some examples how to do it in that case.

Support for async functions

I was trying out divan, and I'm wondering if there is support for benchmarking async functions, e.g. using a tokio runtime (cf. https://bheisler.github.io/criterion.rs/book/user_guide/benchmarking_async.html#benchmarking-async-functions)?

Feel free to close this issue if this was answered elsewhere.

Benchmark within crate source code isn't running

I'm trying out Divan on Linux. My project is a workspace layout. Within that workspace I have a crate foo. Within foo/benches I create a foo.rs file that has:

fn main() { eprintln!("Benchmark main"); divan::main() }

and register it within foo/Cargo.toml:

[[benches]]
name = "foo"
harness = false

Then within foo/src/lib.rs I put:

#[divan::bench]
pub fn bench() {
  eprintln!("Running registered benchmark");
}

I see "Benchmark main" when I run cargo bench -p foo but not "Running registered benchmark". If I move the bench function to foo/benches/foo.rs then everything works. Not sure what I need to do to make this work.

Semi-related, does putting benchmark code within the crate mean that downstream dependencies compile the benchmarks? I'm hoping it at least is dead-code stripped but would be good to call this out in the docs.

Support for wasm_bindgen

Hi! What do you think about benchmarking in WASM browser environment? This is something that criterion lacks, currently.

Adding unrelated code affects benchmark

Heya! I've been doing advent-of-code and trying out Divan while doing so. Someone showed me some code and I added it to my project to benchmark. Simply adding the code to lib.rs caused one of my benchmarks to rise by 100 microseconds. Commenting or uncommenting this one line (which is not used in the benchmark) in lib.rs causes the part2_nom benchmark to inflate or deflate accordingly.

// pub mod part2_subject;

I've pared the example down as much as I could at the moment. I can try to make a smaller one if I have more time.

https://github.com/ChristopherBiscardi/advent-of-code/tree/divan-benchmark-inflation

This should be enough to run the benchmarks:

cargo bench -p day-01

benchmarks with subject module

day_01                 fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ part1               35.7 µs       │ 93.41 µs      │ 36.14 µs      │ 37.26 µs      │ 100     │ 100
├─ part2               134.3 µs      │ 163.5 µs      │ 134.7 µs      │ 137.1 µs      │ 100     │ 100
├─ part2_aho_corasick  26.7 ms       │ 37.09 ms      │ 27.08 ms      │ 27.31 ms      │ 100     │ 100
├─ part2_nom           467.4 µs      │ 568.4 µs      │ 472.3 µs      │ 480.6 µs      │ 100     │ 100
╰─ part2_subject       135.7 µs      │ 183.7 µs      │ 136.2 µs      │ 138.5 µs      │ 100     │ 100

benchmarks without subject module

day_01                 fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ part1               33.58 µs      │ 99.66 µs      │ 33.99 µs      │ 35.05 µs      │ 100     │ 100
├─ part2               134.9 µs      │ 222.4 µs      │ 139.5 µs      │ 141.3 µs      │ 100     │ 100
├─ part2_aho_corasick  27.6 ms       │ 29.34 ms      │ 28.48 ms      │ 28.41 ms      │ 100     │ 100
╰─ part2_nom           361.2 µs      │ 455.5 µs      │ 375.9 µs      │ 378.3 µs      │ 100     │ 100

Here's each run side-by-side for ease of viewing

├─ part2_nom           467.4 µs      │ 568.4 µs      │ 472.3 µs      │ 480.6 µs      │ 100     │ 100
╰─ part2_nom           361.2 µs      │ 455.5 µs      │ 375.9 µs      │ 378.3 µs      │ 100     │ 100

Support for generic benchmark groups

Generic types and consts options should be providable through #[divan::bench_group]:

#[divan::bench_group(types = [Vec<i32>, HashSet<i32>, BTreeSet<i32>, LinkedList<i32>])]
mod group {
    #[divan::bench]
    fn bench1<T>() {}

    #[divan::bench]
    fn bench2<T>() {}
}

This would be equivalent to:

#[divan::bench_group]
mod group {
    #[divan::bench(types = [Vec<i32>, HashSet<i32>, BTreeSet<i32>, LinkedList<i32>])]
    fn bench1<T>() {}

    #[divan::bench(types = [Vec<i32>, HashSet<i32>, BTreeSet<i32>, LinkedList<i32>])]
    fn bench2<T>() {}
}

This can be achieved by having #[divan::bench_group] rewrite each #[divan::bench] to include the appropriate types or consts option. If a benchmark already has its own types or consts option, we can skip it. To automatically make types visible within the group without importing from the super scope, we could create our own type aliases in the parent to then be used via super:: on rewrite.

RFC: Introduce `BenchmarkContext` argument for closure passed to benchmark function

For the bench function on divan::Bencher, I'd like to introduce a BenchmarkContext (naming open to bikeshedding). The motivating use-case is to expose a function called something like inject_simulated_delay which can be used to add simulated overhead to the benchmark without needing to actually do it. An example of this is for example benchmarking a caching algorithm that would normally do expensive I/O on a miss. In practice, it's more convenient in a microbenchmark to simulate performance by just simulating how long I think the I/O would take.

You can hang other state off the context too (e.g. what sample/iteration the current iteration is on in case that matters for some applications).

Not sure if there's any better alternative to make this work since bench functions consume self which is intrinsic to the design.

Happy to actually implement it but was curious about thoughts first.

Allow for filtering beyond function names

Currently, filtering only works up to crate → module → function. Parameters to the function are not considered: generic types, constants, and thread counts.

The following code:

#[divan::bench(
    types = [i32, String],
    consts = [0, 42],
    threads = [1, 2],
)]
fn bench<T, const N: usize>() {}

...produces this output tree structure:

example
╰─ bench
   ├─ i32
   │  ├─ 0
   │  │  ├─ t=1
   │  │  ╰─ t=2
   │  ╰─ 42
   │     ├─ t=1
   │     ╰─ t=2
   ╰─ String
      ├─ 0
      │  ├─ t=1
      │  ╰─ t=2
      ╰─ 42
         ├─ t=1
         ╰─ t=2

However, we can only filter against example::bench. We cannot filter against example::bench::String::42::t=2. I think this is something we should try to support.

We should also consider whether we want benchmark parameters to be treated like namespacing when filtering. Perhaps instead they should be made to look more like Rust code: example::bench::<String, 42>(t=2)? I think the previous format is easier to reason about.

`bencher.with_inputs()` affects benchmark timing

While trying Divan I noticed when I added a bench with a heavier input the timing was quite a bit high.
Since I had only used basic benches prior I thought it was my function in bench_values(), yet when I removed it there was still notable overhead.
After removing the mock data function call for input, that overhead was gone.

The docs for with_inputs() imply this should not contribute to timing?:

Generate inputs for the benchmarked function.

Time spent generating inputs does not affect benchmark timing.

$ cargo bench

Timer precision: 10 ns
example                fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ with_dynamic_input  660.7 ns      │ 1.312 µs      │ 670.7 ns      │ 718.5 ns      │ 100     │ 200
╰─ with_static_input   0.004 ns      │ 0.008 ns      │ 0.004 ns      │ 0.004 ns      │ 100     │ 819200

// benches/example.rs
use divan::{black_box, Bencher};
use std::collections::HashMap;

fn main() {
  divan::main();
}

#[divan::bench]
fn with_dynamic_input(b: Bencher) {
    b.with_inputs(|| {
        mock_data()
    }).bench_values(|_| {
        black_box(42);
    });
}

#[divan::bench]
fn with_static_input(b: Bencher) {
    b.with_inputs(|| {
        42
    }).bench_values(|_| {
        black_box(42);
    });
}


// Something to generate a larger input:
// Actual method randomized fixed length strings

fn mock_data() -> Vec<String> {
    (0..128).map(|_| {
      "divan".to_string()
    }).collect()
}

I've tried to look through the current docs and announcement blogpost, but it's not clear to me why this is happening?

If I instead swap the mock data generator function for one that just sleeps instead of allocates:

use std::{thread, time};

fn wait_a_bit() {
  let delay = time::Duration::from_nanos(500);
  thread::sleep(delay);
}

It will take longer to complete, but does not affect the timing:

Timer precision: 10 ns
example                fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ with_dynamic_input  0.002 ns      │ 1.747 ns      │ 0.018 ns      │ 0.093 ns      │ 100     │ 409600
╰─ with_static_input   0 ns          │ 0.002 ns      │ 0 ns          │ 0.001 ns      │ 100     │ 819200

It only affects timing when I return the mocked data and the impact scales with the size of the input. I assume it's related to the example description of Drop behaviour here, but I'm not sure how to exclude that overhead.

Estimating timer accuracy

Had a thought about this and was wondering about your thoughts:

Divan does not use timer accuracy because it wasn’t clear how accuracy can be obtained without a more accurate reference timer, when Instant is usually implemented with the most accurate timer. I’m open to making sample size scaling smarter, but the current approach works well enough.

I don't have the idea fully realized but I'm wondering if this approach might work:

Benchmark how long it takes to add a number a clock frequency amount of time (e.g. if CPU is 6Ghz, do 6 billion additions). Don't actually need to retrieve the clock frequency unless you want to provide better bounds on how long the measurement takes. This gives us an estimate for how long it takes to add a single number. By picking a large number, we can make the overhead of the more coarse clock arbitrarily insignificant (e.g. if it's 1us and we measured 1s worth of additions, our estimate of 1 addition is accurate to within 0.0001%).
Sample the inaccurate clock every N iterations of the loop such that time for N additions is within the accuracy you want out of the inaccurate clock (e.g. if an addition takes 1ns and RDTSC takes 41ns, 1 million iterations through the loop should get us to within 1.68 ns clock accuracy). That way you're only sampling the timer infrequently to get a higher resolution estimate across more samples.

Measuring RDTSC cost accurately in terms of instant is a similar approach - measure instant::now, RDTSC, nanosleep for 1 second. The Instant elapsed gives you 1s + sleep jitter + instant overhead + 2 * rdtsc overhead. Since nanosleep has a granularity of 100ns, you should be able to just reverse compute the most likely values of the overheads based on the actual value (ie. round down to nearest 100ns and see if the answers make sense & if they're too fast for rdtsc/instant overhead estimates, subtract another 100ns).

It may be a good idea to pin the CPU affinity to a single core when doing this to remove linux kernel scheduling vagaries. Also, P/E cores might make things complicated but I think that needs to be solved at a different level (i.e. controlling for affinity of running benchmarks on these cores & knowing which core a benchmark is pinned to). For benchmarks the simpler approach of only using P cores by default may be a good default & have the user explicitly select if they want it run on E cores instead (or if they want the same benchmark run for P & E).

I'm not 100% sure about the math/logic here, but it seems like this approach should work to me. Also, I'll note that additions aren't actually 1ns unless you have blocked the pipeline. For example, my CPU has 12 execution ports which means my CPU can actually do a dependency-free addition in 83 picoseconds. That's something else to be careful with when reasoning about benchmarking additions & counting loops; in the former I think you want to benchmark dependent additions because the loop over the benchmark is typically going to be a similarly dependent addition on the previous value of the loop counter. Obviously this stuff only matters for functions t

RFC: Ability to add post-result counters after

I was benchmarking a cache library and would like to have the output contain a result of the cache hit rate so that I can understand the benchmark performance across multiple dimensions. I'm thinking something like:

let mut hits = 0;
let mut misses = 0;
b
.output_counter(|| counter::Ratio::new("Hit performance", hits, hits + misses))
.bench(|| {
   // adjust hits & misses
})

It's a little bit complicated since hits & misses borrowing is not going to work like that without unsafe / RefCell / atomics. An alternative approach could be for .bench to return something:

let mut hits = 0;
let mut misses = 0;
b.bench(|| {
   // adjust hits & misses
  (hits, misses)
}).output_counter(|(hits, misses)| counter::Ratio::new("Hit performance", hits, hits + misses))

This is all a bit hand-wavy as I don't have the exact details in mind about the API (open to suggestions), but curious if there's any interest in me adding something like this.

Differentiate provided generic types on name collision

The following code:

struct String;

#[divan::bench(types = [String, std::string::String])]
fn bench<T>() {}

...produces this output tree structure:

example
╰─ bench
   ├─ String
   ╰─ String

This makes it very difficult to differentiate which String is being referred to. It's not impossible because the sorting is based on input order if names collide.

LLD does not seem amused by your linker tricks

Attempting to compile any divan benchmark with rustc configured to use LLD as a linker results in this sort of errors wall:

error: linking with `cc` failed: exit status: 1
  |
  = note: LC_ALL="C" PATH="/home/hadrien/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/bin:/home/hadrien/.cargo/bin:/home/hadrien/bin:/usr/local/bin:/usr/bin:/bin" VSLANG="1033" "cc" "-m64" "/tmp/rustcEomvAT/symbols.o" "/home/hadrien/Bureau/RustModernization/divan/target/release/deps/string-2da0f7c46c7d5894.string.4d8b2c88bc979b43-cgu.00.rcgu.o" "/home/hadrien/Bureau/RustModernization/divan/target/release/deps/string-2da0f7c46c7d5894.string.4d8b2c88bc979b43-cgu.01.rcgu.o" "/home/hadrien/Bureau/RustModernization/divan/target/release/deps/string-2da0f7c46c7d5894.string.4d8b2c88bc979b43-cgu.02.rcgu.o" "/home/hadrien/Bureau/RustModernization/divan/target/release/deps/string-2da0f7c46c7d5894.string.4d8b2c88bc979b43-cgu.03.rcgu.o" "/home/hadrien/Bureau/RustModernization/divan/target/release/deps/string-2da0f7c46c7d5894.string.4d8b2c88bc979b43-cgu.04.rcgu.o" "/home/hadrien/Bureau/RustModernization/divan/target/release/deps/string-2da0f7c46c7d5894.string.4d8b2c88bc979b43-cgu.05.rcgu.o" "/home/hadrien/Bureau/RustModernization/divan/target/release/deps/string-2da0f7c46c7d5894.string.4d8b2c88bc979b43-cgu.06.rcgu.o" "/home/hadrien/Bureau/RustModernization/divan/target/release/deps/string-2da0f7c46c7d5894.string.4d8b2c88bc979b43-cgu.07.rcgu.o" "/home/hadrien/Bureau/RustModernization/divan/target/release/deps/string-2da0f7c46c7d5894.string.4d8b2c88bc979b43-cgu.08.rcgu.o" "/home/hadrien/Bureau/RustModernization/divan/target/release/deps/string-2da0f7c46c7d5894.string.4d8b2c88bc979b43-cgu.09.rcgu.o" "/home/hadrien/Bureau/RustModernization/divan/target/release/deps/string-2da0f7c46c7d5894.string.4d8b2c88bc979b43-cgu.10.rcgu.o" "/home/hadrien/Bureau/RustModernization/divan/target/release/deps/string-2da0f7c46c7d5894.string.4d8b2c88bc979b43-cgu.11.rcgu.o" "/home/hadrien/Bureau/RustModernization/divan/target/release/deps/string-2da0f7c46c7d5894.string.4d8b2c88bc979b43-cgu.12.rcgu.o" "/home/hadrien/Bureau/RustModernization/divan/target/release/deps/string-2da0f7c46c7d5894.string.4d8b2c88bc979b43-cgu.13.rcgu.o" "/home/hadrien/Bureau/RustModernization/divan/target/release/deps/string-2da0f7c46c7d5894.string.4d8b2c88bc979b43-cgu.14.rcgu.o" "/home/hadrien/Bureau/RustModernization/divan/target/release/deps/string-2da0f7c46c7d5894.string.4d8b2c88bc979b43-cgu.15.rcgu.o" "/home/hadrien/Bureau/RustModernization/divan/target/release/deps/string-2da0f7c46c7d5894.1eid11ug1qgzifv7.rcgu.o" "-Wl,--as-needed" "-L" "/home/hadrien/Bureau/RustModernization/divan/target/release/deps" "-L" "/home/hadrien/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib" "-Wl,-Bstatic" "/home/hadrien/Bureau/RustModernization/divan/target/release/deps/libfastrand-f6661ca0d6f249f5.rlib" "/home/hadrien/Bureau/RustModernization/divan/target/release/deps/libdivan-988a33f007dc5e94.rlib" "/home/hadrien/Bureau/RustModernization/divan/target/release/deps/libcondtype-a66710f66668f394.rlib" "/home/hadrien/Bureau/RustModernization/divan/target/release/deps/libcore_affinity-0f15c446ef84b12c.rlib" "/home/hadrien/Bureau/RustModernization/divan/target/release/deps/libnum_cpus-8d64a2a3fd5f3ded.rlib" "/home/hadrien/Bureau/RustModernization/divan/target/release/deps/liblibc-9bb75f9a49b4aeff.rlib" "/home/hadrien/Bureau/RustModernization/divan/target/release/deps/libregex_lite-cfbb67c0f114717e.rlib" "/home/hadrien/Bureau/RustModernization/divan/target/release/deps/libclap-043aae451a83b000.rlib" "/home/hadrien/Bureau/RustModernization/divan/target/release/deps/libclap_builder-1859ccd5719abfb6.rlib" "/home/hadrien/Bureau/RustModernization/divan/target/release/deps/libterminal_size-c8ccc9ba9089180c.rlib" "/home/hadrien/Bureau/RustModernization/divan/target/release/deps/librustix-a69c472ce277be69.rlib" "/home/hadrien/Bureau/RustModernization/divan/target/release/deps/libbitflags-68fc6e412662e218.rlib" "/home/hadrien/Bureau/RustModernization/divan/target/release/deps/liblinux_raw_sys-262a6c514fb0f9de.rlib" "/home/hadrien/Bureau/RustModernization/divan/target/release/deps/libclap_lex-a0ccca36c9bddcbd.rlib" "/home/hadrien/Bureau/RustModernization/divan/target/release/deps/libanstyle-7f855882e18093b3.rlib" "/home/hadrien/Bureau/RustModernization/divan/target/release/deps/liblinkme-21c2371f4d24e80c.rlib" "/home/hadrien/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libstd-6498d8891e016dca.rlib" "/home/hadrien/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libpanic_unwind-3debdee1a9058d84.rlib" "/home/hadrien/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libobject-8339c5bd5cbc92bf.rlib" "/home/hadrien/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libmemchr-160ebcebb54c11ba.rlib" "/home/hadrien/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libaddr2line-95c75789f1b65e37.rlib" "/home/hadrien/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libgimli-7e8094f2d6258832.rlib" "/home/hadrien/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/librustc_demangle-bac9783ef1b45db0.rlib" "/home/hadrien/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libstd_detect-a1cd87df2f2d8e76.rlib" "/home/hadrien/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libhashbrown-7fd06d468d7dba16.rlib" "/home/hadrien/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/librustc_std_workspace_alloc-5ac19487656e05bf.rlib" "/home/hadrien/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libminiz_oxide-c7c35d32cf825c11.rlib" "/home/hadrien/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libadler-c523f1571362e70b.rlib" "/home/hadrien/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libunwind-85f17c92b770a911.rlib" "/home/hadrien/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libcfg_if-598d3ba148dadcea.rlib" "/home/hadrien/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/liblibc-a58ec2dab545caa4.rlib" "/home/hadrien/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/liballoc-f9dda8cca149f0fc.rlib" "/home/hadrien/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/librustc_std_workspace_core-7ba4c315dd7a3503.rlib" "/home/hadrien/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libcore-5ac2993e19124966.rlib" "/home/hadrien/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libcompiler_builtins-df2fb7f50dec519a.rlib" "-Wl,-Bdynamic" "-lgcc_s" "-lutil" "-lrt" "-lpthread" "-lm" "-ldl" "-lc" "-Wl,--eh-frame-hdr" "-Wl,-z,noexecstack" "-L" "/home/hadrien/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib" "-o" "/home/hadrien/Bureau/RustModernization/divan/target/release/deps/string-2da0f7c46c7d5894" "-Wl,--gc-sections" "-pie" "-Wl,-z,relro,-z,now" "-Wl,-O1" "-nodefaultlibs" "-fuse-ld=lld"
  = note: ld.lld: error: undefined symbol: __start_linkme_GROUP_ENTRIES
          >>> referenced by divan.4931c5b03bfdaec2-cgu.11
          >>>               divan-988a33f007dc5e94.divan.4931c5b03bfdaec2-cgu.11.rcgu.o:(divan::entry::GROUP_ENTRIES::he3bc601364989c0c) in archive /home/hadrien/Bureau/RustModernization/divan/target/release/deps/libdivan-988a33f007dc5e94.rlib
          >>> referenced by divan.4931c5b03bfdaec2-cgu.12
          >>>               divan-988a33f007dc5e94.divan.4931c5b03bfdaec2-cgu.12.rcgu.o:(divan::divan::Divan::run_action::h868ac4b65a34c7a0) in archive /home/hadrien/Bureau/RustModernization/divan/target/release/deps/libdivan-988a33f007dc5e94.rlib
          >>> the encapsulation symbol needs to be retained under --gc-sections properly; consider -z nostart-stop-gc (see https://lld.llvm.org/ELF/start-stop-gc)
          
          ld.lld: error: undefined symbol: __stop_linkme_GROUP_ENTRIES
          >>> referenced by divan.4931c5b03bfdaec2-cgu.11
          >>>               divan-988a33f007dc5e94.divan.4931c5b03bfdaec2-cgu.11.rcgu.o:(divan::entry::GROUP_ENTRIES::he3bc601364989c0c) in archive /home/hadrien/Bureau/RustModernization/divan/target/release/deps/libdivan-988a33f007dc5e94.rlib
          >>> referenced by divan.4931c5b03bfdaec2-cgu.12
          >>>               divan-988a33f007dc5e94.divan.4931c5b03bfdaec2-cgu.12.rcgu.o:(divan::divan::Divan::run_action::h868ac4b65a34c7a0) in archive /home/hadrien/Bureau/RustModernization/divan/target/release/deps/libdivan-988a33f007dc5e94.rlib
          
          ld.lld: error: undefined symbol: __start_linkm2_GROUP_ENTRIES
          >>> referenced by divan.4931c5b03bfdaec2-cgu.11
          >>>               divan-988a33f007dc5e94.divan.4931c5b03bfdaec2-cgu.11.rcgu.o:(divan::entry::GROUP_ENTRIES::he3bc601364989c0c) in archive /home/hadrien/Bureau/RustModernization/divan/target/release/deps/libdivan-988a33f007dc5e94.rlib
          >>> referenced by divan.4931c5b03bfdaec2-cgu.12
          >>>               divan-988a33f007dc5e94.divan.4931c5b03bfdaec2-cgu.12.rcgu.o:(divan::divan::Divan::run_action::h868ac4b65a34c7a0) in archive /home/hadrien/Bureau/RustModernization/divan/target/release/deps/libdivan-988a33f007dc5e94.rlib
          >>> the encapsulation symbol needs to be retained under --gc-sections properly; consider -z nostart-stop-gc (see https://lld.llvm.org/ELF/start-stop-gc)
          
          ld.lld: error: undefined symbol: __stop_linkm2_GROUP_ENTRIES
          >>> referenced by divan.4931c5b03bfdaec2-cgu.11
          >>>               divan-988a33f007dc5e94.divan.4931c5b03bfdaec2-cgu.11.rcgu.o:(divan::entry::GROUP_ENTRIES::he3bc601364989c0c) in archive /home/hadrien/Bureau/RustModernization/divan/target/release/deps/libdivan-988a33f007dc5e94.rlib
          >>> referenced by divan.4931c5b03bfdaec2-cgu.12
          >>>               divan-988a33f007dc5e94.divan.4931c5b03bfdaec2-cgu.12.rcgu.o:(divan::divan::Divan::run_action::h868ac4b65a34c7a0) in archive /home/hadrien/Bureau/RustModernization/divan/target/release/deps/libdivan-988a33f007dc5e94.rlib
          >>> referenced by divan.4931c5b03bfdaec2-cgu.12
          >>>               divan-988a33f007dc5e94.divan.4931c5b03bfdaec2-cgu.12.rcgu.o:(divan::divan::Divan::run_action::h868ac4b65a34c7a0) in archive /home/hadrien/Bureau/RustModernization/divan/target/release/deps/libdivan-988a33f007dc5e94.rlib
          
          ld.lld: error: undefined symbol: __start_linkme_BENCH_ENTRIES
          >>> referenced by divan.4931c5b03bfdaec2-cgu.11
          >>>               divan-988a33f007dc5e94.divan.4931c5b03bfdaec2-cgu.11.rcgu.o:(divan::entry::BENCH_ENTRIES::h58af2438b168de8a) in archive /home/hadrien/Bureau/RustModernization/divan/target/release/deps/libdivan-988a33f007dc5e94.rlib
          >>> referenced by divan.4931c5b03bfdaec2-cgu.12
          >>>               divan-988a33f007dc5e94.divan.4931c5b03bfdaec2-cgu.12.rcgu.o:(divan::divan::Divan::run_action::h868ac4b65a34c7a0) in archive /home/hadrien/Bureau/RustModernization/divan/target/release/deps/libdivan-988a33f007dc5e94.rlib
          >>> the encapsulation symbol needs to be retained under --gc-sections properly; consider -z nostart-stop-gc (see https://lld.llvm.org/ELF/start-stop-gc)
          
          ld.lld: error: undefined symbol: __stop_linkme_BENCH_ENTRIES
          >>> referenced by divan.4931c5b03bfdaec2-cgu.11
          >>>               divan-988a33f007dc5e94.divan.4931c5b03bfdaec2-cgu.11.rcgu.o:(divan::entry::BENCH_ENTRIES::h58af2438b168de8a) in archive /home/hadrien/Bureau/RustModernization/divan/target/release/deps/libdivan-988a33f007dc5e94.rlib
          >>> referenced by divan.4931c5b03bfdaec2-cgu.12
          >>>               divan-988a33f007dc5e94.divan.4931c5b03bfdaec2-cgu.12.rcgu.o:(divan::divan::Divan::run_action::h868ac4b65a34c7a0) in archive /home/hadrien/Bureau/RustModernization/divan/target/release/deps/libdivan-988a33f007dc5e94.rlib
          
          ld.lld: error: undefined symbol: __start_linkm2_BENCH_ENTRIES
          >>> referenced by divan.4931c5b03bfdaec2-cgu.11
          >>>               divan-988a33f007dc5e94.divan.4931c5b03bfdaec2-cgu.11.rcgu.o:(divan::entry::BENCH_ENTRIES::h58af2438b168de8a) in archive /home/hadrien/Bureau/RustModernization/divan/target/release/deps/libdivan-988a33f007dc5e94.rlib
          >>> referenced by divan.4931c5b03bfdaec2-cgu.12
          >>>               divan-988a33f007dc5e94.divan.4931c5b03bfdaec2-cgu.12.rcgu.o:(divan::divan::Divan::run_action::h868ac4b65a34c7a0) in archive /home/hadrien/Bureau/RustModernization/divan/target/release/deps/libdivan-988a33f007dc5e94.rlib
          >>> the encapsulation symbol needs to be retained under --gc-sections properly; consider -z nostart-stop-gc (see https://lld.llvm.org/ELF/start-stop-gc)
          
          ld.lld: error: undefined symbol: __stop_linkm2_BENCH_ENTRIES
          >>> referenced by divan.4931c5b03bfdaec2-cgu.11
          >>>               divan-988a33f007dc5e94.divan.4931c5b03bfdaec2-cgu.11.rcgu.o:(divan::entry::BENCH_ENTRIES::h58af2438b168de8a) in archive /home/hadrien/Bureau/RustModernization/divan/target/release/deps/libdivan-988a33f007dc5e94.rlib
          >>> referenced by divan.4931c5b03bfdaec2-cgu.12
          >>>               divan-988a33f007dc5e94.divan.4931c5b03bfdaec2-cgu.12.rcgu.o:(divan::divan::Divan::run_action::h868ac4b65a34c7a0) in archive /home/hadrien/Bureau/RustModernization/divan/target/release/deps/libdivan-988a33f007dc5e94.rlib
          collect2: error: ld returned 1 exit status
          

error: could not compile `examples` (bench "string") due to previous error

Given that many people who care about build performance use LLD as their linker, and IIRC it's destined to become rustc's default linker on Linux once remaining bugs are sorted out, you'll probably want to figure out what is going on here and make your linker tricks work with LLD if at all possible :)

Output format to JSON

I'm trying divan at the moment and find it super easy to get started with 🎉

Unfortunately I will need to integrate things into a CI. Normally I should find a way to get the results in a JSON file so I can create a comment on a PR that shows the results. (Example here)

Is it possible to do this with divan?

Add support for optional callgrind / cachegrind integration

I'm not sure if this one falls in scope, but it would be lovely to get the kinds of results you get from https://github.com/iai-callgrind/iai-callgrind (a maintained fork of the original iai), but with divans nice interface / register-anywhere system!

Using these main! macros and listing all of the test function names there really feels primitive after being spoiled by divan!

How do I initialise a structure so it's shared between threads of the same bench iteration?

I'm trying to benchmark a concurrent data structure, and I want to benchmark its read/write behaviour under thread contention. However, unlike all of the threaded examples in documentation, this structure's performance characteristics change as it is modified: internal parts of it are consumed or rearranged by different threads, so it needs to be constructed again for each run of the benchmark.

This means:

If the tested structure is made static, or initialised in the benchmark function body before calling divan::Bencher methods, then only the first iteration sees the structure as it was constructed. The other iterations see one which contents have been consumed by the first iteration, with almost no work left to benchmark.
```
#[divan::bench(threads=[1, 2, 4, 8, 16])]
fn benchmark_function(bencher: divan::Bencher) {
    static x: MyStruct = create_structure();
    bencher
        .bench(|| x.consume_contents());
}
```

If it is initialised in with_inputs, each thread gets its own copy of the whole structure, so they never contend.

#[divan::bench(threads=[1, 2, 4, 8, 16])]
fn benchmark_function(bencher: divan::Bencher) {
    bencher
        .with_inputs(|| create_structure())
        .bench_values(|x| x.consume_contents());
}

Either the structure is constructed once, then shared among all iterations (the first option), or constructed separately for each thread, and never shared (the second option). I need a way to make it constructed once per benchmark run, and shared only among threads that are part of the same benchmark run. Do I correctly understand that this is currently not possible using the threads option?

My current workaround is to start a const number of threads myself inside the with_inputs closure and have them wait at a std::sync::Barrier, then as part of the bench_local_values closure, release the Barrier and join the threads to time them:

#[divan::bench(consts = [1, 2, 4, 8, 16])]
fn benchmark_function<const THREADS: usize>(bencher: divan::Bencher) {
    use std::sync::{Arc, Barrier};
    bencher
        .with_inputs(|| -> (Vec<std::thread::JoinHandle<_>>, _) {
            let x: MyStruct = Arc::new(create_structure());
            let barrier = Arc::new(Barrier::new(THREADS + 1));
            let threads = (0..THREADS).map(|_| {
                let x = x.clone();
                let barrier = barrier.clone();
                std::thread::spawn(move || {
                    barrier.wait();
                    x.consume_contents();
                })
            }).collect();
            (threads, barrier)
        })
        .bench_local_values(|(threads, barrier)| {
            barrier.wait();
            for t in threads {
                t.join().unwrap();
            }
        });
}

This works, but there's a lot of code duplicating what I imagine Divan would do internally to implement the threads option.

I also see worse performance when benchmarking with 1 thread using this method than I do from an otherwise-identical benchmark with #[divan::bench(threads = [1])]. Probably because Divan doesn't use a Barrier when single-threaded. Which is smart, and another reason why I feel like this could be handled.

Am I missing a better existing way to do this?

If yes, could an example be added illustrating it?
If no, do you think this use-case could be handled by Divan?

Feature request: nextest support for divan

The test runner nextest has support for running benchmarks with cargo nextest run --benches. Currently, benchmarks written with criterion create the necessary output for nextest to parse and understand.

It would be nice for divan to produce output in this format too. The format is not nextest-specific, to quote from the other issue:

Note that none of these changes do anything special for nextest -- they just make criterion implement more of the libtest command-line interface, just enough to make nextest happy.

Additional context:

Original issue in nextest for requesting criterion support
Pull request for implementing the expected output format in criterion

Thanks again for creating such a wonderful library! 🧡

Fluctuating CPU MHz makes `tsc` profiling lie

This isn't a fully debugged issue report yet, nor do I have a clean reproduction recipe yet...

Sometimes when I benchmark, attempting to force divan to use TSC, I get results where timer accuracy is 20 ns (this is Github Codespaces so extremely heavily virtualized):

$ cargo bench --bench basic -- --timer tsc --sample-size 81920
    Finished bench [optimized] target(s) in 0.02s
     Running benches/basic.rs (target/release/deps/basic-2115956f1e3e5fd7)
Timer precision: 20.03 ns
basic               fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ get_bounds       0.306 ns      │ 0.564 ns      │ 0.306 ns      │ 0.322 ns      │ 100     │ 8192000
├─ get_cold         0.306 ns      │ 0.505 ns      │ 0.306 ns      │ 0.313 ns      │ 100     │ 8192000
├─ get_sat          0.307 ns      │ 0.697 ns      │ 0.307 ns      │ 0.315 ns      │ 100     │ 8192000
├─ get_unreachable  0.306 ns      │ 0.924 ns      │ 0.338 ns      │ 0.407 ns      │ 100     │ 8192000
├─ get_unsafe       0.306 ns      │ 0.468 ns      │ 0.306 ns      │ 0.308 ns      │ 100     │ 8192000
├─ harp             1.847 ns      │ 3.775 ns      │ 1.848 ns      │ 2.088 ns      │ 100     │ 8192000
╰─ wrapping_add     1.848 ns      │ 2.068 ns      │ 1.848 ns      │ 1.862 ns      │ 100     │ 8192000

However, every now and then, divan figures out the timer accuracy is 29.85 ns, in which case I see wildly differing values:

$ cargo bench --bench basic -- --timer tsc --sample-size 81920
    Finished bench [optimized] target(s) in 0.05s
     Running benches/basic.rs (target/release/deps/basic-2115956f1e3e5fd7)
Timer precision: 29.85 ns
basic               fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ get_bounds       0.228 ns      │ 0.364 ns      │ 0.228 ns      │ 0.237 ns      │ 100     │ 8192000
├─ get_cold         0.228 ns      │ 0.823 ns      │ 0.228 ns      │ 0.238 ns      │ 100     │ 8192000
├─ get_sat          0.229 ns      │ 0.354 ns      │ 0.229 ns      │ 0.232 ns      │ 100     │ 8192000
├─ get_unreachable  0.228 ns      │ 0.502 ns      │ 0.26 ns       │ 0.26 ns       │ 100     │ 8192000
├─ get_unsafe       0.228 ns      │ 1.162 ns      │ 0.228 ns      │ 0.269 ns      │ 100     │ 8192000
├─ harp             1.769 ns      │ 3.097 ns      │ 1.892 ns      │ 1.972 ns      │ 100     │ 8192000
╰─ wrapping_add     1.77 ns       │ 3.783 ns      │ 1.77 ns       │ 1.97 ns       │ 100     │ 8192000

This happens also if I don't specify a sample size, but I wanted to fix it for this example so that it's clear the difference isn't because of a different iters value.

This is an extreme microbenchmark, but I think I was seeing similar on larger benchmarks at well.

Actually, while writing this, I think I might've figure out the issue – different CPUs on this system have different MHz, and they seem to be constantly changing.

Not sure what divan can do about that :-)

$ cat /proc/cpuinfo  | grep MHz
cpu MHz         : 3242.971
cpu MHz         : 2785.335
cpu MHz         : 3234.546
cpu MHz         : 3245.406
cpu MHz         : 2913.887
cpu MHz         : 2933.662
cpu MHz         : 3096.559
cpu MHz         : 2445.426

Add statistically-significant improvement reporting

Similar to what criterion does, but I think a useful starting point would just be a ±% change in times between runs (if it's determined that the two runs differ significantly given the variance of each)!

I imagine this is somewhat blocked on writing out the previous benchmark results somewhere they can be referenced first!

`cargo flamegraph` support

Using https://github.com/flamegraph-rs/flamegraph with

cargo flamegraph --bench benchmarks

with benches/benchmarks.rs being present and defined in Cargo.toml leads to benchmark names being printed but no benchmark-code being actually executed.

It would be great if there was support between divan and flamegraph!
If I do something wrong and this should already work, please let me know.

Weird different benchmark results for code that should be fairly identical

I have some benchmarks that looks like this:

use std::mem::MaybeUninit;

fn main() {
    let _ = memcache::CRATE_USED;
    divan::main();
}

fn weird_results_impl(b: divan::Bencher, size: usize) {
    const NUM_ITEMS: usize = 100_000;
    const CAPACITY: usize = NUM_ITEMS;
    let cache = vec![Default::default(); CAPACITY];
    let values = (0..NUM_ITEMS)
        .map(|_| vec![std::mem::MaybeUninit::<u8>::uninit(); size].into_boxed_slice())
        .collect::<Vec<_>>();
    b.counter(divan::counter::ItemsCount::new(NUM_ITEMS))
        .with_inputs(|| {
            (
                cache.clone(),
                values
                    .iter()
                    .enumerate()
                    .map(|(idx, v)| (idx % CAPACITY, v.clone()))
                    .collect::<Vec<_>>(),
            )
        })
        .bench_local_refs(|(cache, refs)| {
            for (entry, mem) in refs {
                cache[*entry] = std::mem::take(mem);
            }
        });
}

#[divan::bench]
fn weird_results_4kib(b: divan::Bencher) {
    weird_results_impl(b, 4 * 1024);
}

#[divan::bench]
fn weird_results_10b(b: divan::Bencher) {
    weird_results_impl(b, 10);
}

There's a fairly large discrepancy between the two

my-crate               fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ weird_results_4kib  165.4 µs      │ 211.1 µs      │ 173.5 µs      │ 174.8 µs      │ 100     │ 100
│                      604.2 Mitem/s │ 473.5 Mitem/s │ 576.2 Mitem/s │ 571.8 Mitem/s │         │
╰─ weird_results_10b   80.53 µs      │ 110.5 µs      │ 83.22 µs      │ 84.07 µs      │ 100     │ 100
                       1.241 Gitem/s │ 904.2 Mitem/s │ 1.201 Gitem/s │ 1.189 Gitem/s │         │

This was run with mimalloc set as the allocator. AFAICT I'm not dropping any memory within the benchmark loop and the body of the loop shouldn't be doing anything more than shuffling some pointers around (i.e. should be the same amount of shuffling between the two runs I think). Is there something wrong with my benchmark or a bug in divan?

Output format to Markdown Tables

It would be very cool to be able to output the results of a diff between two benchmarks as a markdown table. Highlighting the best numbers in bold or italic or both and even keep this configurable.

Reuse thread pool for threaded tests

Being able to run tests multi-threaded simply by adding DIVAN_THREADS=XXX to get a feeling for contention is super nice.
However it seems like every benchmark run is using scoped threads under the hood.

Running a threaded benchmark through samply record, I end up with well beyond 6k "tracks", each of which is extremely short lived and it is pretty much impossible to select any of the background threads to do proper profiling.

It also appears that a large portion of the main thread time is actually spent creating / destroying threads themselves, at least on macOs where I tested this:

Better document running benchmarks outside of `cargo bench`

Running the benchmarks using cargo bench works just fine as expected.
It also prints the typical cargo Running benches/functions.rs (target/release/deps/functions-4227941eb3ee4115) line.

However running that target directly only lists the included benchmarks, it does not run them.
This is a bit confusing if you want to run the benchmark directly in a profiler like samply.

Actually running the benchmarks needs the --bench flag:

divan/src/divan.rs

Lines 367 to 376 in 0ff8585

    
           self.action = if matches.get_flag("list") { 
        
               Action::List 
        
           } else if matches.get_flag("test") || !matches.get_flag("bench") { 
        
               // Either of: 
        
               // `cargo bench -- --test` 
        
               // `cargo test --benches` 
        
               Action::Test 
        
           } else { 
        
               Action::Bench 
        
           };

However that flag is not documented at all in the command line flags:

divan/src/cli.rs

Lines 128 to 129 in 0ff8585

    
           // ignored: 
        
           .args([ignored_flag("bench"), ignored_flag("nocapture"), ignored_flag("show-output")])

It would be nice to actually document that flag, and maybe even provide an example how to run the benchmarks in a profiler like samply, which for me is as simple as running samply record target/release/deps/functions-4227941eb3ee4115 --bench

	self.action = if matches.get_flag("list") {
	Action::List
	} else if matches.get_flag("test") \|\| !matches.get_flag("bench") {
	// Either of:
	// `cargo bench -- --test`
	// `cargo test --benches`
	Action::Test
	} else {
	Action::Bench
	};

	// ignored:
	.args([ignored_flag("bench"), ignored_flag("nocapture"), ignored_flag("show-output")])