ordo-one / package-benchmark Goto Github PK
View Code? Open in Web Editor NEWSwift benchmark runner with many performance metrics and great CI support
License: Apache License 2.0
Swift benchmark runner with many performance metrics and great CI support
License: Apache License 2.0
As suggested, maybe this can help us kill @dynamicReplacement:
Have you tried exploring calling the same methods as Argument Parser does but just from your library, e.g. you could provide a protocol called Benchmark that has a main method that looks like this
public static func main() async {
// Setup everything that benchmark needs
do {
var command = try parseAsRoot()
if var asyncCommand = command as? AsyncParsableCommand {
try await asyncCommand.run()
} else {
try command.run()
}
} catch {
exit(withError: error)
}
// Record benchmark results etc.
}
Since all methods are public in Argument Parser you should be able to replicate anything the package is doing in your own main method. Maybe I am missing something here though!
We can integrate a few of the dependencies into the project (e.g. BenchmarkClock isn't used anywhere else).
I'm sorry to be popping open a bunch of issues here - if there's a better place to ask for guidance, I'm totally game.
I tried adding a simple benchmark setup to a library I'm working on, and adding it as an external package that imports mine with a local reference. That all appears to work correctly, swift package resolve
is happy, swift build
works, and swift package benchmark
appears to work as well - except that there's no benchmark data visible.
The output I get from the command is:
Building for debugging...
Build complete! (0.35s)
Building targets in release mode for benchmark run...
Build complete! Running benchmarks...
If I run swift package benchmark list
, it's the exact same output.
(this is with package-benchmark 0.8.0, and swift 5.8 (installed from Xcode 14.3 beta) on macOS with an M1 processor)
swift -version:
swift-driver version: 1.75.1 Apple Swift version 5.8 (swiftlang-5.8.0.117.11 clang-1403.0.22.8.60)
Target: arm64-apple-macosx13.0
The source of what I've done is public if you're willing to take a look, or I'd be happy to have any guidance on how to debug what's happening. I suspect I may be using the plugin in a manner that was slightly unexpected.
The work in progress can be seen for a quick view of how I added things at https://github.com/heckj/CRDT/pull/34/files, and it should be very easy to reproduce with these steps:
git clone https://github.com/heckj/CRDT -b benchmark2
cd CRDT/ExternalBenchmarks
swift package resolve
swift package benchmark
It'd be nice to be able to do e.g.
.build/debug/Basic --filter myBenchmarkTarget --skip "slowstuff.*"
to regex filter which benchmarks should be run for the target also in debug mode, for more complex targets that would be nice for debugability.
I was looking through the export options, spotted the JMH
and .tsv
exports, but one of the things I wanted to explore was taking the full-fidelity Histogram from package-histogram, loading the file back in using Histogram's Codable conformance, and exploring some local visualization with it (SwiftUI Charts, etc).
I'm not familiar with .tsv
(other than an assumption is stands for "time series values") and wanted to ask - is that a more sane path for reloading into my own Histogram instance, or would it be reasonable to extend the exports to drop out a JSON from Histogram's codable? I'm reading and learning on package-histogram, but not 100% on all the pieces and parts, and what's legit for reproducing it by reading in a file.
I'd presumed, but not yet traced, that dumping a ton of values into even a HdrHistogram was fundamentally lossy - and that it didn't store the entirety of all the values submitted to it. I'm happy to do the PR to enable this thing I want, figured I'd best ask first - maybe there's a path that's easily there I'm overlooking.
Should allow to specify whether it should be tabs or commas between output in e.g. histogramPercentiles
as some tools only handle CSV.
We should rename throughputScalingFactor
-> scalingFactor
and add support for both scaled and unscaled output of benchmark metrics. This allows one to get e.g. the number of mallocs per actual invocation of some code under measurement, or the actual time spent in user cpu time on an actual invocation, and not just the throughput.
In the current main
, (March 4th - commit: fa1a955) - when you use the option --grouping metric
, the benchmark names are removed.
Examples:
swift package benchmark
command outputswift package benchmark --grouping metric
command outputOn the main
branch, I ran through the various formats and exported them to verify all the formats worked as expected. The tsv
case looks like it might well be "broken", although I don't have historical data to compare easily.
When I looked through the files written, there was only ever a single value in a long series down the list. For example, the exported file default.HistogramBenchmark.Mean.Syscalls_(total).tsv
(attached for convenience) looks almost meaningless in it's output with the variety of numbers, and flipping through the metrics makes me suspect it's iterating incorrectly.
If you flip through the various "Mean Syscalls" files in the attached zip, it looks like they build on each other - which doesn't seem correct.
Something like
swift package benchmark init MyBenchmark
creates the Benchmarks/MyBenchmark
directory
creates the Benchmarks/MyBenchmark/MyBenchmark.swift
source file with the boilerplate:
import Benchmark
import Foundation
let benchmarks = {
Benchmark.defaultConfiguration = .init(scalingFactor: .kilo)
Benchmark("SomeBenchmark") { benchmark in
for _ in benchmark.scaledIterations {
blackHole(Date()) // replace this line with your own benchmark
}
}
}
pbcopy
) : // MyBenchmark benchmark target
.executableTarget(
name: "MyBenchmark",
dependencies: [
.product(name: "Benchmark", package: "package-benchmark"),
.product(name: "BenchmarkPlugin", package: "package-benchmark"),
],
path: "Benchmarks/MyBenchmark"
),
Which can then just be copy-pasted into Package.swift and we're ready to go
I was applying package-benchmark to a comparison effort between a number of different existing libraries and external packages, so compare speeds of how they did their work. When I was dumping the output, I found that I wanted to sort and view the results as an ordered list, and then wanted to go grab and knock together some simple graphs for the comparison between the different libraries. The TextTable output, which looks so great, was completely in the way, and I ended up having to do quite a lot of text editing to grab the data out into a spreadsheet, which I then used to re-order things, make charts, etc.
(I also found that TextTable was clipping some names, as I was getting pretty descriptive there)
The request is for a text based output that applies a bit more directly in this direction, or even a plain .csv
output - something that can be easily copy & pasted, or a .csv output that you could open with spreadsheet-of-your-choice kind of setup.
I did look at the tsv
format output, but didn't quite know what it was an how to interpret it - and it wasn't in any sort of obvious columnar form - so I don't think that applies.
E.g. swift package benchmark list baselines
.
Should list baseline name, host machine, cpus, memory and timestamp for latest update - pick up from FS / .json as needed.
Use FilePath.DirectoryView
for iteration to avoid pulling in Foundation.
Currently all benchmarks in the same suite is run in the same process context to avoid process start/stop overhead and run the benchmarks faster - this is fundamentally fine for most use cases (cpu, malloc count, context switches etc), but fails for real/VM memory counters.
We should automatically run any memory size related benchmarks in isolation to get more usable numbers.
Currently, maxDuration specifies the max wallClock time of the test under measure, excluding benchmark measurement overhead. This leads to the non-intuitive situation where a very fast test, where the benchmark measurement overhead is larger than the actual benchmark runtime, will lead to a true wall clock that can be significantly longer than expected.
We should let maxDuration control read world wall clock including benchmark overhead instead, as the main reason for controlling the maxDuration is that you want to have a known runtime of a test - the current implementation does not give that.
Should possibly work if we restructure the targets a bit - then one can choose whether Malloc analytics are required - useful for when jemalloc can't/ shouldn't be required and possibly we can work around the Xcode crash for unit tests when jemalloc is interposed.
I found the commands to invoke benchmarks in the online docs, but I was hoping to be able to invoke something like: swift package benchmark --help
to get a list of the commands and how to use them, or perhaps swift package plugin benchmark -help
.
Getting to CLI arguments exposed is what I was after - the DocC compiler plugin exposes some of its help in this fashion. Is that something possible here?
Likely a question more than a bug report. I checked out the repo (main
branch), installed jemalloc per the prerequisites (brew install jemalloc
). After that swift build
worked fine, but swift test
in the repository failed for me locally:
Building for debugging...
[10/10] Linking BenchmarkPackageTests
Build complete! (7.69s)
error: Exited with signal code 11
When I run the tests from within Xcode (currently using the 14.3 beta), the tests trap on jemalloc - je_free_default
in the stack trace.
Is there something additional I should be doing re: jemalloc?
I'm not familiar with using custom allocators and what the requirements are around that, so I suspect I'm missing something in my setup to support running swift test
without issue.
Try to hooking swift_retain/swift_release to allow capturing of ARC traffic as a metric.
Saw there's a dependency on https://github.com/SwiftPackageIndex/SPIManifest which is not used at the moment!
E.g. support a --metric
command line option.
Now when a benchmark hangs and I terminate it with Ctrl-C spawned processes remain running.
As outlined in https://forums.swift.org/t/benchmark-package-initial-release/60535/40
and using the repo at https://github.com/corymosiman12/swift-benchmark-testing - we don't properly capture the mallocs/frees, need to investigate whether it's due to failing to properly interpose jemalloc for Foundation, or if we simply miss some relevant jemalloc statistics that should be included.
Currently it doesn't perfectly match wallClock inverted, due to truncation of precision in measurements - we need to scale appropriately.
Currently we capture OS and malloc stats for any benchmark run, we should avoid the overhead of doing that if we have a benchmark run that doesn't request those stats. Especially important for malloc stats that turned out to have the most significant overhead.
Exploring benchmarks, I was exploring setting
Benchmark.defaultConfiguration.timeUnits = .microseconds
. It does exactly what it says for the metric Time (wall clock), but doesn't appear to do anything with Time (system CPU) (μs), Time (total CPU) (μs), or Time (user CPU) (μs). And in some cases, I'm seeing the Time (system CPU) come back in ns regardless of what I've chosen.
I suspect it'll be a little easier to track if you can either nail is down explicitly across the board (and sometimes forcing a "round to 0" result), or just letting them float with whatever comes back from the metric gathering system and floating with the relevant range.
Easy to do mistake when copy/pasting benchmarks, we should fatalError() so it's not silently discarded as it currently is.
(missed cleanup from PR review)
Currently it's duplicated, should just refactor.
Fix closures to be non-escaping instead (should fix the zero-time measurements properly).
Should port to Swift and migrate to http://hdrhistogram.org - there's a c implementation that we can wrap up with a swift package. Would make sense - seems very neat!
Working through writing a few sample benchmarks, I was exploring Benchmark.defaultConfiguration.desiredIterations
, Benchmark.defaultConfiguration.desiredDuration
, and how they relate to each other.
I do think the names could be improved by renaming them to maxIterations
and maxDuration
, respectively. I'm also thinking that as we cobble the documentation, expanding on Benchmark.Configuration
(and maybe the article WritingBenchmarks.md
), it would be worth calling out specifically that the running will go until the first of these two are hit.
And just to double check - if you specify warmupIterations on the configuration, does that take place before either the iterations vs. duration markers are measured?
Currently relative (delta) and absolute (delta) is supported, should be able to use thresholds as relative (absolute) and absolute (absolute) optionally as an addition.
Currently there's a requirement that the executable target has a Benchmark suffix for discovery. Let's change the heuristics to instead pick up the targets that have source paths in Benchmarks/ without considering their names.
As suggested by @ktoso, it'd allow for analysing results using e.g. https://jmh.morethan.io
Would be nice with a function on the enum that returns a range for easier iterations instead of the typical
for _ in 0..<benchmark.throughputScalingFactor.rawValue
Should validate what the overhead of the time not spent in the actual benchmark code is (capturing OS, malloc statistics, sampling) and see what the magnitude is - and optimise if needed - there are some cheap wins (especially on Linux, where we could cache some open files etc) that could be done if needed. Need to put in probes and measure overhead as first step (i.e. total wall clock runtime vs. measured wall clock runtime delta).
As we've got a module and type that has the same name, we can get problems disambiguating:
xxx.swift:119:23: error: type 'Benchmark' has no member 'blackHole'
Benchmark.blackHole(transaction.getRetainedData())
The type can be disambiguated using the little-known import (class|struct|func|protocol|enum) Module.Symbol syntax.
import func Benchmark.blackHole
document or add static func on Benchmark.
Shouldn’t need it, cut down on linking dependencies.
Instead:
// Use unbuffered stdout to help detect exactly which test was running in the event of a crash.
setbuf(stdout, nil)
Right now we just swap in the runner, should just reverse the array in Statistics when calculating instead so we also get p99/p99.
Working on the main
branch (commit: 7d6f5f9), and running through the documentation and trying out the various examples to double-check them, I found that a default baseline compare was failing:
reproduction:
swift package --allow-writing-to-package-directory benchmark baseline update
swift package benchmark baseline compare
Output of failure:
Building for debugging...
Build complete! (0.21s)
Building benchmark targets in release mode for benchmark run...
Building HistogramBenchmark
Building BenchmarkDateTime
Building Basic
Build complete!
Swift/RangeReplaceableCollection.swift:870: Fatal error: Can't remove last element from an empty collection
error: plugin process ended by an uncaught signal: 5 <command: /usr/bin/sandbox-exec -p '(version 1)
(deny default)
(import "system.sb")
(allow file-read*)
(allow process*)
(allow file-write*
(subpath "/private/tmp")
(subpath "/private/var/folders/8t/k6nw7pyx2qq77g8qq_g429080000gn/T")
)
(deny file-write*
(subpath "/Users/heckj/src/package-benchmark")
)
(allow file-write*
(subpath "/Users/heckj/src/package-benchmark/.build/plugins/Benchmark-Plugin/outputs")
(subpath "/Users/heckj/src/package-benchmark/.build/plugins/Benchmark-Plugin/cache")
)
' /Users/heckj/src/package-benchmark/.build/plugins/Benchmark-Plugin/cache/Benchmark_Plugin>, <output:
'Building benchmark targets in release mode for benchmark run...
Building HistogramBenchmark
Building BenchmarkDateTime
Building Basic
Build complete!
Swift/RangeReplaceableCollection.swift:870: Fatal error: Can'\''t remove last element from an empty collection
'>
Workaround: if you name the baseline, there's no issue - so running swift package benchmark baseline compare default
works.
Currently the sample GitHub workflows fails as they use the old command line arguments, needs to update and test.
Currently absolute threshold compare with the current unit of measurements, e.g. a p25 = 10
for resident memory size would mean 10MB if resident memory size is measured in M, but 10KB if the current scale is K. It should be changed such that the absolute threshold is in the unscaled unit (bytes in this case).
Check the output at apple/swift-nio#2392 (comment) - with very long names it gets truncated (now, those names in specific have a duplicate preamble that matches the benchmark target and could probably be shortened, but it'd be nice to have full dynamism - I think we put in an artificial max limit, could bump it a bit)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.