ordo-one / package-benchmark Goto Github PK

Swift benchmark runner with many performance metrics and great CI support

License: Apache License 2.0

Swift 95.08% C 4.92%

benchmark-framework swift swift-package-manager-plugin benchmarks benchmark performance performance-metrics swift-package ordo-core-package server-side-swift

package-benchmark's People

Stargazers

Watchers

Forkers

adam-fowler t089 ser-0xff toxich1 oryanhampton jaapwijnen heckj hanshantao lilsunny243 gnmoseke kkebo austinpayne astrotuna201 clackary mahdibm finestructure jcavar tayloraswift jpgrayson

package-benchmark's Issues

Simplify boilerplate and get rid of dynamicReplacement if possible

As suggested, maybe this can help us kill @dynamicReplacement:

Have you tried exploring calling the same methods as Argument Parser does but just from your library, e.g. you could provide a protocol called Benchmark that has a main method that looks like this

  public static func main() async {
    // Setup everything that benchmark needs
    do {
      var command = try parseAsRoot()
      if var asyncCommand = command as? AsyncParsableCommand {
        try await asyncCommand.run()
      } else {
        try command.run()
      }
    } catch {
      exit(withError: error)
    }

  // Record benchmark results etc.
  }

Since all methods are public in Argument Parser you should be able to replicate anything the package is doing in your own main method. Maybe I am missing something here though!

Reduce dependencies

We can integrate a few of the dependencies into the project (e.g. BenchmarkClock isn't used anywhere else).

no benchmark output from `swift package benchmark`

I'm sorry to be popping open a bunch of issues here - if there's a better place to ask for guidance, I'm totally game.

I tried adding a simple benchmark setup to a library I'm working on, and adding it as an external package that imports mine with a local reference. That all appears to work correctly, swift package resolve is happy, swift build works, and swift package benchmark appears to work as well - except that there's no benchmark data visible.

The output I get from the command is:

Building for debugging...
Build complete! (0.35s)
Building targets in release mode for benchmark run...
Build complete! Running benchmarks...

If I run swift package benchmark list, it's the exact same output.

(this is with package-benchmark 0.8.0, and swift 5.8 (installed from Xcode 14.3 beta) on macOS with an M1 processor)
swift -version:

swift-driver version: 1.75.1 Apple Swift version 5.8 (swiftlang-5.8.0.117.11 clang-1403.0.22.8.60)
Target: arm64-apple-macosx13.0

The source of what I've done is public if you're willing to take a look, or I'd be happy to have any guidance on how to debug what's happening. I suspect I may be using the plugin in a manner that was slightly unexpected.

The work in progress can be seen for a quick view of how I added things at https://github.com/heckj/CRDT/pull/34/files, and it should be very easy to reproduce with these steps:

git clone https://github.com/heckj/CRDT -b benchmark2
cd CRDT/ExternalBenchmarks
swift package resolve
swift package benchmark

Add support to run specific benchmark name via command line for debugging

It'd be nice to be able to do e.g.

.build/debug/Basic --filter myBenchmarkTarget --skip "slowstuff.*"

to regex filter which benchmarks should be run for the target also in debug mode, for more complex targets that would be nice for debugability.

Question: export option for full fidelity Histogram?

I was looking through the export options, spotted the JMH and .tsv exports, but one of the things I wanted to explore was taking the full-fidelity Histogram from package-histogram, loading the file back in using Histogram's Codable conformance, and exploring some local visualization with it (SwiftUI Charts, etc).

I'm not familiar with .tsv (other than an assumption is stands for "time series values") and wanted to ask - is that a more sane path for reloading into my own Histogram instance, or would it be reasonable to extend the exports to drop out a JSON from Histogram's codable? I'm reading and learning on package-histogram, but not 100% on all the pieces and parts, and what's legit for reproducing it by reading in a file.

I'd presumed, but not yet traced, that dumping a ton of values into even a HdrHistogram was fundamentally lossy - and that it didn't store the entirety of all the values submitted to it. I'm happy to do the PR to enable this thing I want, figured I'd best ask first - maybe there's a path that's easily there I'm overlooking.

Refactor boolean arguments to be actual booleans and not Ints

Add support for CSV in addition to TSV

Should allow to specify whether it should be tabs or commas between output in e.g. histogramPercentiles as some tools only handle CSV.

Consistent support for scaling

We should rename throughputScalingFactor -> scalingFactor and add support for both scaled and unscaled output of benchmark metrics. This allows one to get e.g. the number of mallocs per actual invocation of some code under measurement, or the actual time spent in user cpu time on an actual invocation, and not just the throughput.

`swift package benchmark --grouping metric` missing benchmark names

In the current main, (March 4th - commit: fa1a955) - when you use the option --grouping metric, the benchmark names are removed.

Examples:

the stock swift package benchmark command output
swift package benchmark --grouping metric command output

bug: tab separated values output may be broken

On the main branch, I ran through the various formats and exported them to verify all the formats worked as expected. The tsv case looks like it might well be "broken", although I don't have historical data to compare easily.

When I looked through the files written, there was only ever a single value in a long series down the list. For example, the exported file default.HistogramBenchmark.Mean.Syscalls_(total).tsv (attached for convenience) looks almost meaningless in it's output with the variety of numbers, and flipping through the metrics makes me suspect it's iterating incorrectly.

If you flip through the various "Mean Syscalls" files in the attached zip, it looks like they build on each other - which doesn't seem correct.

Mean.Syscalls.zip

Add new command for generating executableTarget and source boilerplate for new benchmarks

Something like

swift package benchmark init MyBenchmark

creates the Benchmarks/MyBenchmark directory
creates the Benchmarks/MyBenchmark/MyBenchmark.swift source file with the boilerplate:

import Benchmark
import Foundation

let benchmarks = {
    Benchmark.defaultConfiguration = .init(scalingFactor: .kilo)

    Benchmark("SomeBenchmark") { benchmark in
        for _ in benchmark.scaledIterations {
            blackHole(Date()) // replace this line with your own benchmark
        }
    }
}

generates this to standard out (preferably so we can pipe it to pbcopy) :

        // MyBenchmark benchmark target
        .executableTarget(
            name: "MyBenchmark",
            dependencies: [
                .product(name: "Benchmark", package: "package-benchmark"),
                .product(name: "BenchmarkPlugin", package: "package-benchmark"),
            ],
            path: "Benchmarks/MyBenchmark"
        ),

Which can then just be copy-pasted into Package.swift and we're ready to go

Add support for regex filtering of targets/benchmarks (include+skip)

Add convenience method to set default time units for a full benchmark suite

feature request: text-based output without all the pretty around it

I was applying package-benchmark to a comparison effort between a number of different existing libraries and external packages, so compare speeds of how they did their work. When I was dumping the output, I found that I wanted to sort and view the results as an ordered list, and then wanted to go grab and knock together some simple graphs for the comparison between the different libraries. The TextTable output, which looks so great, was completely in the way, and I ended up having to do quite a lot of text editing to grab the data out into a spreadsheet, which I then used to re-order things, make charts, etc.

(I also found that TextTable was clipping some names, as I was getting pretty descriptive there)

The request is for a text based output that applies a bit more directly in this direction, or even a plain .csv output - something that can be easily copy & pasted, or a .csv output that you could open with spreadsheet-of-your-choice kind of setup.

I did look at the tsv format output, but didn't quite know what it was an how to interpret it - and it wasn't in any sort of obvious columnar form - so I don't think that applies.

Add support for listing stored baselines

E.g. swift package benchmark list baselines.

Should list baseline name, host machine, cpus, memory and timestamp for latest update - pick up from FS / .json as needed.

Use FilePath.DirectoryView for iteration to avoid pulling in Foundation.

Support running benchmarks with isolation

Currently all benchmarks in the same suite is run in the same process context to avoid process start/stop overhead and run the benchmarks faster - this is fundamentally fine for most use cases (cpu, malloc count, context switches etc), but fails for real/VM memory counters.

We should automatically run any memory size related benchmarks in isolation to get more usable numbers.

chore: maxDuration should affect wall clock time of test run time, not of benchmark execution

Currently, maxDuration specifies the max wallClock time of the test under measure, excluding benchmark measurement overhead. This leads to the non-intuitive situation where a very fast test, where the benchmark measurement overhead is larger than the actual benchmark runtime, will lead to a true wall clock that can be significantly longer than expected.

We should let maxDuration control read world wall clock including benchmark overhead instead, as the main reason for controlling the maxDuration is that you want to have a known runtime of a test - the current implementation does not give that.

Consider removing pthread semaphores and move to groue/Semaphore

https://github.com/groue/Semaphore

Evaluate if we can add an additional target that allows for benchmarking without installing jemalloc

Should possibly work if we restructure the targets a bit - then one can choose whether Malloc analytics are required - useful for when jemalloc can't/ shouldn't be required and possibly we can work around the Xcode crash for unit tests when jemalloc is interposed.

CLI exploration of the `swift package benchmark` command doesn't exist

I found the commands to invoke benchmarks in the online docs, but I was hoping to be able to invoke something like: swift package benchmark --help to get a list of the commands and how to use them, or perhaps swift package plugin benchmark -help.

Getting to CLI arguments exposed is what I was after - the DocC compiler plugin exposes some of its help in this fashion. Is that something possible here?

`swift test` failing locally due to (FB12061292)

Likely a question more than a bug report. I checked out the repo (main branch), installed jemalloc per the prerequisites (brew install jemalloc). After that swift build worked fine, but swift test in the repository failed for me locally:

Building for debugging...
[10/10] Linking BenchmarkPackageTests
Build complete! (7.69s)
error: Exited with signal code 11

When I run the tests from within Xcode (currently using the 14.3 beta), the tests trap on jemalloc - je_free_default in the stack trace.

Is there something additional I should be doing re: jemalloc?

I'm not familiar with using custom allocators and what the requirements are around that, so I suspect I'm missing something in my setup to support running swift test without issue.

Add ARC traffic as a metric

Try to hooking swift_retain/swift_release to allow capturing of ARC traffic as a metric.

Package SPIManifest seems to be unused

Saw there's a dependency on https://github.com/SwiftPackageIndex/SPIManifest which is not used at the moment!

feat: Add support for specifying exact metrics to run dynamically

E.g. support a --metric command line option.

kill child processes when benchmarking is terminated with Ctrl-C

Now when a benchmark hangs and I terminate it with Ctrl-C spawned processes remain running.

Benchmark is not capturing malloc/free appropriately for linked test case

As outlined in https://forums.swift.org/t/benchmark-package-initial-release/60535/40
and using the repo at https://github.com/corymosiman12/swift-benchmark-testing - we don't properly capture the mallocs/frees, need to investigate whether it's due to failing to properly interpose jemalloc for Foundation, or if we simply miss some relevant jemalloc statistics that should be included.

Fix truncated throughput metric precision

Currently it doesn't perfectly match wallClock inverted, due to truncation of precision in measurements - we need to scale appropriately.

OS and malloc capturing should only be enabled when requested

Currently we capture OS and malloc stats for any benchmark run, we should avoid the overhead of doing that if we have a benchmark run that doesn't request those stats. Especially important for malloc stats that turned out to have the most significant overhead.

consistency of CPU time units

Exploring benchmarks, I was exploring setting
Benchmark.defaultConfiguration.timeUnits = .microseconds. It does exactly what it says for the metric Time (wall clock), but doesn't appear to do anything with Time (system CPU) (μs), Time (total CPU) (μs), or Time (user CPU) (μs). And in some cases, I'm seeing the Time (system CPU) come back in ns regardless of what I've chosen.

I suspect it'll be a little easier to track if you can either nail is down explicitly across the board (and sometimes forcing a "round to 0" result), or just letting them float with whatever comes back from the metric gathering system and floating with the relevant range.

Detect and handle multiple identical Benchmark definitions

Easy to do mistake when copy/pasting benchmarks, we should fatalError() so it's not silently discarded as it currently is.

chore: Factor out common dependencies from Benchmark target

(missed cleanup from PR review)

Currently it's duplicated, should just refactor.

Add ability to set default for all benchmark settings for a whole suite

Refactor benchmarkrunner

Fix closures to be non-escaping instead (should fix the zero-time measurements properly).

Start using HDR histograms instead of linear/power of two

Should port to Swift and migrate to http://hdrhistogram.org - there's a c implementation that we can wrap up with a swift package. Would make sense - seems very neat!

Move warmup -> warmupIterations

naming for iterations and/or duration

Working through writing a few sample benchmarks, I was exploring Benchmark.defaultConfiguration.desiredIterations, Benchmark.defaultConfiguration.desiredDuration, and how they relate to each other.

I do think the names could be improved by renaming them to maxIterations and maxDuration, respectively. I'm also thinking that as we cobble the documentation, expanding on Benchmark.Configuration (and maybe the article WritingBenchmarks.md), it would be worth calling out specifically that the running will go until the first of these two are hit.

And just to double check - if you specify warmupIterations on the configuration, does that take place before either the iterations vs. duration markers are measured?

Add support for absolute thresholds

Currently relative (delta) and absolute (delta) is supported, should be able to use thresholds as relative (absolute) and absolute (absolute) optionally as an addition.

Pick up benchmark target names from FS

Currently there's a requirement that the executable target has a Benchmark suffix for discovery. Let's change the heuristics to instead pick up the targets that have source paths in Benchmarks/ without considering their names.

Investigate supporting JMH format for output

As suggested by @ktoso, it'd allow for analysing results using e.g. https://jmh.morethan.io

Cleanup benchmark.throughputScalingFactor

Would be nice with a function on the enum that returns a range for easier iterations instead of the typical

for _ in 0..<benchmark.throughputScalingFactor.rawValue

Add support for setting default threshold for a benchmark suite

Measure benchmark overhead on both Linux and macOS and fix any bottlenecks if excessive

Should validate what the overhead of the time not spent in the actual benchmark code is (capturing OS, malloc statistics, sampling) and see what the magnitude is - and optimise if needed - there are some cheap wins (especially on Linux, where we could cache some open files etc) that could be done if needed. Need to put in probes and measure overhead as first step (i.e. total wall clock runtime vs. measured wall clock runtime delta).

Add convenience blackHole static func on Benchmark type

As we've got a module and type that has the same name, we can get problems disambiguating:

xxx.swift:119:23: error: type 'Benchmark' has no member 'blackHole'
            Benchmark.blackHole(transaction.getRetainedData())

The type can be disambiguated using the little-known import (class|struct|func|protocol|enum) Module.Symbol syntax.

import func Benchmark.blackHole

document or add static func on Benchmark.

Remove Foundation dependency for Benchmark target

Shouldn’t need it, cut down on linking dependencies.

Remove all usage of fflush

Instead:

// Use unbuffered stdout to help detect exactly which test was running in the event of a crash.
setbuf(stdout, nil)

Statistics should revert buckets for reversed polarity metrics

Right now we just swap in the runner, should just reverse the array in Statistics when calculating instead so we also get p99/p99.

bug: fatal error while running default compare

Working on the main branch (commit: 7d6f5f9), and running through the documentation and trying out the various examples to double-check them, I found that a default baseline compare was failing:

reproduction:

swift package --allow-writing-to-package-directory benchmark baseline update
swift package benchmark baseline compare

Output of failure:

Building for debugging...
Build complete! (0.21s)
Building benchmark targets in release mode for benchmark run...
Building HistogramBenchmark
Building BenchmarkDateTime
Building Basic
Build complete!
Swift/RangeReplaceableCollection.swift:870: Fatal error: Can't remove last element from an empty collection
error: plugin process ended by an uncaught signal: 5 <command: /usr/bin/sandbox-exec -p '(version 1)
(deny default)
(import "system.sb")
(allow file-read*)
(allow process*)
(allow file-write*
    (subpath "/private/tmp")
    (subpath "/private/var/folders/8t/k6nw7pyx2qq77g8qq_g429080000gn/T")
)
(deny file-write*
    (subpath "/Users/heckj/src/package-benchmark")
)
(allow file-write*
    (subpath "/Users/heckj/src/package-benchmark/.build/plugins/Benchmark-Plugin/outputs")
    (subpath "/Users/heckj/src/package-benchmark/.build/plugins/Benchmark-Plugin/cache")
)
' /Users/heckj/src/package-benchmark/.build/plugins/Benchmark-Plugin/cache/Benchmark_Plugin>, <output:
'Building benchmark targets in release mode for benchmark run...
Building HistogramBenchmark
Building BenchmarkDateTime
Building Basic
Build complete!
Swift/RangeReplaceableCollection.swift:870: Fatal error: Can'\''t remove last element from an empty collection
'>

Workaround: if you name the baseline, there's no issue - so running swift package benchmark baseline compare default works.

Update GitHub workflows to use new command line

Currently the sample GitHub workflows fails as they use the old command line arguments, needs to update and test.

Absolute thresholds should be scaled

Currently absolute threshold compare with the current unit of measurements, e.g. a p25 = 10 for resident memory size would mean 10MB if resident memory size is measured in M, but 10KB if the current scale is K. It should be changed such that the absolute threshold is in the unscaled unit (bytes in this case).

fix: Max width needs to allow for larger max size

Check the output at apple/swift-nio#2392 (comment) - with very long names it gets truncated (now, those names in specific have a duplicate preamble that matches the benchmark target and could probably be shortened, but it'd be nice to have full dynamism - I think we put in an artificial max limit, could bump it a bit)

ordo-one / package-benchmark Goto Github PK

package-benchmark's People

Stargazers

Watchers

Forkers

package-benchmark's Issues

Recommend Projects

Recommend Topics

Recommend Org