Comments (10)
Well that's entertaining. Could you tell me more about the environment you're running in? OS, GHC version, etc?
from criterion.
OS: Ubuntu 11.10
Arch: x86_64
GHC: 7.0.3 (standard version in the Ubuntu repository)
I'm not able to reproduce it on GHC 7.4.1, but it reliably reproduces on GHC 7.0.3.
What library is responsible for measuring clock resolution (statistics, criteiron, etc)? I'd be happy to put in some traces or something to help debug.
from criterion.
It's criterion itself that measures the clock resolution.https://github.com/bos/criterion/blob/master/Criterion/Environment.hs#L37
from criterion.
It looks like the outlier calculation assumes there will be outliers, and produces bad results if all the samples are nearly identical.
In this case, the measurement of clock resolution on my system is coming up with the same value almost every time. I added some traces to Criterion.Analysis.classifyOutliers so it would use only the first ten values, and print out its internal state. Here's the output:
outliers = Outliers {samplesSeen = 10, lowSevere = 10, lowMild = 0, highMild = 0, highSevere = 10}
ssa = fromList [3.814697265625e-6,3.814697265625e-6,3.814697265625e-6,3.814697265625e-6,3.814697265625e-6,3.814697265625e-6,3.814697265625e-6,3.814697265625e-6,3.814697265625e-6,3.814697265625e-6]
loS = 3.814697265625e-6
loM = 3.814697265625e-6
hiM = 3.814697265625e-6
hiS = 3.814697265625e-6
q1 = 3.814697265625e-6
q3 = 3.814697265625e-6
iqr = 0.0
from criterion.
Cool! Thanks for looking into this. That should make it easier to figure out a fix.
from criterion.
Any progress on this? I'm having the same stuff on Windows 7 64-bit, with criterion-0.6.0.1, ghc-7.4.1, in this very simple example:
module Main where
import Control.DeepSeq
import Criterion.Main
import Data.List ( sort )
main = defaultMain [ bgroup "rands" . benchSorts $ [1..1000::Int] ]
benchSorts xs = seq (sum xs) [ bench "sort" (nf (sum . sort) xs) ]
Typical output is:
warming up
estimating clock resolution...
mean is 1.637714 us (320001 iterations)
found 638950 outliers among 319999 samples (199.7%)
318951 (99.7%) low severe
319999 (100.0%) high severe
estimating cost of a clock call...
mean is 123.0625 ns (16 iterations)
benchmarking rands/sort
Warning: Couldn't open /dev/urandom
Warning: using system clock for seed instead (quality will be lower)
mean: 15.09599 us, lb 14.81007 us, ub 15.52451 us, ci 0.950
std dev: 1.798179 us, lb 1.497669 us, ub 2.072980 us, ci 0.950
found 17 outliers among 100 samples (17.0%)
17 (17.0%) high severe
variance introduced by outliers: 84.204%
variance is severely inflated by outliers
It doesn't seem to be affected by optimization level. Same happens on a virtual machine running linux:
warming up
estimating clock resolution...
mean is 5.303620 us (160001 iterations)
found 287114 outliers among 159999 samples (179.4%)
145205 (90.8%) low severe
141909 (88.7%) high severe
estimating cost of a clock call...
mean is 2.490415 us (21 iterations)
found 3 outliers among 21 samples (14.3%)
3 (14.3%) high mild
benchmarking rands/insertionSort
mean: 281.0402 us, lb 277.6374 us, ub 284.8547 us, ci 0.950
std dev: 18.47338 us, lb 16.33518 us, ub 20.99275 us, ci 0.950
variance introduced by outliers: 61.565%
variance is severely inflated by outliers
from criterion.
I also got the same with Windows 7 64-bit, criterion-0.6.1.1, ghc-7.4.1:
Warning: Couldn't open /dev/urandom
Warning: using system clock for seed instead (quality will be lower)
warming up
estimating clock resolution...
mean is 2.190758 us (320001 iterations)
found 639297 outliers among 319999 samples (199.8%)
319298 (99.8%) low severe
319999 (100.0%) high severe
estimating cost of a clock call...
mean is 179.1769 ns (15 iterations)
benchmarking tiny/fib 10
mean: 4.652202 us, lb 4.358087 us, ub 4.973060 us, ci 0.950
std dev: 1.549386 us, lb 1.293269 us, ub 2.013035 us, ci 0.950
found 2 outliers among 100 samples (2.0%)
2 (2.0%) high mild
variance introduced by outliers: 97.865%
variance is severely inflated by outliers
benchmarking tiny/fib 15
mean: 49.72007 us, lb 47.22011 us, ub 51.66477 us, ci 0.950
std dev: 11.37166 us, lb 9.687343 us, ub 12.79400 us, ci 0.950
found 173 outliers among 100 samples (173.0%)
94 (94.0%) low severe
79 (79.0%) high severe
variance introduced by outliers: 95.725%
variance is severely inflated by outliers
With these tests:
{-# LANGUAGE ScopedTypeVariables #-}
import Criterion.Main
main = defaultMain [
bgroup "tiny" [ bench "fib 10" $ whnf fib 10
, bench "fib 15" $ whnf fib 15
]
]
fib :: Int -> Int
fib n | n < 0 = error "negative!"
| otherwise = go (fromIntegral n)
where
go 0 = 0
go 1 = 1
go n = go (n-1) + go (n-2)
from criterion.
classifyOutliers
triggers this bug whenever the input contains many identical values:
>>> classifyOutliers (U.replicate 10 0)
Outliers {samplesSeen = 10, lowSevere = 10, lowMild = 0, highMild = 0, highSevere = 10}
This results in duplicate counts in the lowSevere
and highSevere
range, just like in the other errors reported here, too. This also triggers the bug if even if not all the data is identical:
>>> classifyOutliers (U.fromList [0, 0, 0, 0, 0, 0, 1])
Outliers {samplesSeen = 7, lowSevere = 6, lowMild = 0, highMild = 0, highSevere = 7}
However,it was not clear what would cause it to receive large stretches of identical data, especially when you have 100 samples. You would think that this would be very improbable. So I added a trace
to see what the offending data was and this is what it was getting:
benchmarking parse
collecting 100 samples, 1 iterations each, in estimated 10.58121 s
mean: 123.6324 ns, lb 47.33842 ns, ub 238.0733 ns, ci 0.950
std dev: 471.0889 ns, lb 368.2355 ns, ub 707.7345 ns, ci 0.950
fromList [2.7629260312427172e-6,8.555773984302174e-7,-9.80969179760326e-8,-9.809
69179760326e-8,8.555773984302174e-7,-9.80969179760326e-8,-9.80969179760326e-8,8.
555773984302174e-7,-9.80969179760326e-8,8.555773984302174e-7,8.555773984302174e-
7,-9.80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.809691797603
26e-8,-9.80969179760326e-8,-9.80969179760326e-8,8.555773984302174e-7,-9.80969179
760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.8096
9179760326e-8,8.555773984302174e-7,-9.80969179760326e-8,8.555773984302174e-7,-9.
80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8
,-9.80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.8096917976032
6e-8,8.555773984302174e-7,-9.80969179760326e-8,-9.80969179760326e-8,-9.809691797
60326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.80969
179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.8
0969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,8.555773984302174e-7,
-9.80969179760326e-8,-9.80969179760326e-8,8.555773984302174e-7,-9.80969179760326
e-8,-9.80969179760326e-8,8.555773984302174e-7,-9.80969179760326e-8,-9.8096917976
0326e-8,-9.80969179760326e-8,-9.80969179760326e-8,8.555773984302174e-7,-9.809691
79760326e-8,-9.80969179760326e-8,8.555773984302174e-7,-9.80969179760326e-8,-9.80
969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,1.09399597753178e-6,-9
.80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-
8,8.555773984302174e-7,-9.80969179760326e-8,8.555773984302174e-7,-9.80969179760$
26e-8,-9.80969179760326e-8,8.555773984302174e-7,-9.80969179760326e-8,-9.80969179
760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.8096
9179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,8.555773984302174e-7,-9.
80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8
,-9.80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.8096917976032
6e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,8.5557739843
02174e-7,-9.80969179760326e-8]
found 179 outliers among 100 samples (179.0%)
79 (79.0%) low severe
100 (100.0%) high severe
The data has several repeating values, and the value -9.80969179760326e-8
sufficiently dominates the rest of the data set that it triggers the bug. I still don't have an answer for why that exact timing repeats so often. My initial guess was because it was getting near the limit of the timer's minimum step size, but that wasn't the case. Benchmarks on trivial code like nf id ()
give smaller values in the e-9
range that are much more diverse. Here's a sample of some of them:
-9,5.519538864781468e-9,5.444293899712435e-9,5.59069442957501e-9,5.468012421310282e-9,5.529353425442646e-9,5.553071947040493e-9,5.443476019657337e-9,5.621773871668741e-9,5.605416270566778e-9,5.51872098472637e-9,5.646310273321687e-9,5.549800426820101e-9,5.643038753101294e-9,5.516267344561075e-9,5.60459839051168e-9,5.443476019657337e-9,5.571065308252654e-9,5.783714122578182e-9,6.146852867041778e-9,5.763267121200728e-9,5.49173094290813e-9,5.488459422687737e-9,5.557161347315985e-9,5.440204499436944e-9,5.60459839051168e-9,5.5121779442855845e-9,5.502363383624406e-9,5.6185023514483485e-9,5.443476019657337e-9,5.632406312385018e-9,5.5391679861038246e-9,5.6086877907871706e-9,5.519538864781468e-9,5.605416270566778e-9,5.6185023514483485e-9,5.446747539877729e-9,5.632406312385018e-9,5.611959311007563e-9,5.512995824340683e-9,5.574336828473046e-9,5.45083694015322e-9,5.6185023514483485e-9,5.508906424065192e-9,5.45083694015322e-9,5.616048711283054e-9,5.45083694015322e-9,5.56370438775677e-9,5.621773871668741e-9,5.502363383624406e-9,5.584969269189323e-9,5.6185023514483485e-9,5.6078699107320724e-9,5.643038753101294e-9,6.047071500319799e-9,5.567793788032261e-9,5.638949352825803e-9,5.601326870291287e-9,5.471283941530675e-9,5.587422909354618e-9,5.526081905222253e-9,5.454108460373613e-9,5.638949352825803e-9,5.6561248339828655e-9,5.69456519657248e-9,5.532624945663039e-9,5.546528906599708e-9]
So the missing thing to solve is what causes the timings to abnormally snap to certain values.
from criterion.
Oops, closed this by accident.
from criterion.
Fix validated in f028c02e5ec26c5e24f626c2f53da936bbe926bf.
from criterion.
Related Issues (20)
- benchmark statistics are non-sense with mwc-random-0.15.0.0 HOT 5
- Remove examples/tiny.html?
- criterion lower bound on aeson seems to be too loose HOT 3
- Presence of double quotes in benchmark name produces broken HTML report HOT 5
- 0.1.2.0 build failure on Apple Silicon HOT 22
- Log scale discoverability HOT 1
- Criterion/Main/Options.hs:38:48: error: Module ‘Options.Applicative.Help.Pretty’ does not export ‘Ann’ | 38 | import Options.Applicative.Help.Pretty ((.$.), Ann) HOT 5
- Regressions in Config Parameter are Ignored HOT 3
- Support aeson-2 HOT 2
- Is there any way to generate a static output? HOT 3
- Add a variant of runBenchmark which also takes iteration range. HOT 2
- [FR] support benchmarking max memory usage HOT 4
- use performMajorGC HOT 10
- Fails to compile with optparse-applicative-0.17.0.0 HOT 2
- Update tutorial to suggest `v2-`style `cabal` commands
- Avoid only a subset of benchmarks being specialized by SpecConstr. HOT 5
- please document that (and how) criterion runs GC HOT 1
- first benchmark is expensive HOT 5
- please document that `nf` is (contains) an IO action
- Add or migrate to `peakAllocated` HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from criterion.