I'm not sure what would cause this -- how could it be getting more outliers than there

It's criterion itself that measures the clock resolution.<a href="https://github.com/b

Bogus outlier count on clock resolution samples about criterion HOT 10 CLOSED

haskell commented on June 11, 2024

Bogus outlier count on clock resolution samples

from criterion.

Comments (10)

bos commented on June 11, 2024

Well that's entertaining. Could you tell me more about the environment you're running in? OS, GHC version, etc?

from criterion.

jmillikin commented on June 11, 2024

OS: Ubuntu 11.10
Arch: x86_64
GHC: 7.0.3 (standard version in the Ubuntu repository)

I'm not able to reproduce it on GHC 7.4.1, but it reliably reproduces on GHC 7.0.3.

What library is responsible for measuring clock resolution (statistics, criteiron, etc)? I'd be happy to put in some traces or something to help debug.

from criterion.

bos commented on June 11, 2024

It's criterion itself that measures the clock resolution.https://github.com/bos/criterion/blob/master/Criterion/Environment.hs#L37

from criterion.

jmillikin commented on June 11, 2024

It looks like the outlier calculation assumes there will be outliers, and produces bad results if all the samples are nearly identical.

In this case, the measurement of clock resolution on my system is coming up with the same value almost every time. I added some traces to Criterion.Analysis.classifyOutliers so it would use only the first ten values, and print out its internal state. Here's the output:

outliers = Outliers {samplesSeen = 10, lowSevere = 10, lowMild = 0, highMild = 0, highSevere = 10}
ssa = fromList [3.814697265625e-6,3.814697265625e-6,3.814697265625e-6,3.814697265625e-6,3.814697265625e-6,3.814697265625e-6,3.814697265625e-6,3.814697265625e-6,3.814697265625e-6,3.814697265625e-6]
loS = 3.814697265625e-6
loM = 3.814697265625e-6
hiM = 3.814697265625e-6
hiS = 3.814697265625e-6
q1 = 3.814697265625e-6
q3 = 3.814697265625e-6
iqr = 0.0

from criterion.

bos commented on June 11, 2024

Cool! Thanks for looking into this. That should make it easier to figure out a fix.

from criterion.

dreixel commented on June 11, 2024

Any progress on this? I'm having the same stuff on Windows 7 64-bit, with criterion-0.6.0.1, ghc-7.4.1, in this very simple example:

module Main where

import Control.DeepSeq
import Criterion.Main
import Data.List ( sort )


main = defaultMain [ bgroup "rands" . benchSorts $ [1..1000::Int] ]

benchSorts xs = seq (sum xs) [ bench "sort" (nf (sum . sort) xs) ]

Typical output is:

warming up
estimating clock resolution...
mean is 1.637714 us (320001 iterations)
found 638950 outliers among 319999 samples (199.7%)
  318951 (99.7%) low severe
  319999 (100.0%) high severe
estimating cost of a clock call...
mean is 123.0625 ns (16 iterations)

benchmarking rands/sort
Warning: Couldn't open /dev/urandom
Warning: using system clock for seed instead (quality will be lower)
mean: 15.09599 us, lb 14.81007 us, ub 15.52451 us, ci 0.950
std dev: 1.798179 us, lb 1.497669 us, ub 2.072980 us, ci 0.950
found 17 outliers among 100 samples (17.0%)
  17 (17.0%) high severe
variance introduced by outliers: 84.204%
variance is severely inflated by outliers

It doesn't seem to be affected by optimization level. Same happens on a virtual machine running linux:

warming up
estimating clock resolution...
mean is 5.303620 us (160001 iterations)
found 287114 outliers among 159999 samples (179.4%)
  145205 (90.8%) low severe
  141909 (88.7%) high severe
estimating cost of a clock call...
mean is 2.490415 us (21 iterations)
found 3 outliers among 21 samples (14.3%)
  3 (14.3%) high mild

benchmarking rands/insertionSort
mean: 281.0402 us, lb 277.6374 us, ub 284.8547 us, ci 0.950
std dev: 18.47338 us, lb 16.33518 us, ub 20.99275 us, ci 0.950
variance introduced by outliers: 61.565%
variance is severely inflated by outliers

from criterion.

letmaik commented on June 11, 2024

I also got the same with Windows 7 64-bit, criterion-0.6.1.1, ghc-7.4.1:

Warning: Couldn't open /dev/urandom
Warning: using system clock for seed instead (quality will be lower)
warming up
estimating clock resolution...
mean is 2.190758 us (320001 iterations)
found 639297 outliers among 319999 samples (199.8%)
  319298 (99.8%) low severe
  319999 (100.0%) high severe
estimating cost of a clock call...
mean is 179.1769 ns (15 iterations)

benchmarking tiny/fib 10
mean: 4.652202 us, lb 4.358087 us, ub 4.973060 us, ci 0.950
std dev: 1.549386 us, lb 1.293269 us, ub 2.013035 us, ci 0.950
found 2 outliers among 100 samples (2.0%)
  2 (2.0%) high mild
variance introduced by outliers: 97.865%
variance is severely inflated by outliers

benchmarking tiny/fib 15
mean: 49.72007 us, lb 47.22011 us, ub 51.66477 us, ci 0.950
std dev: 11.37166 us, lb 9.687343 us, ub 12.79400 us, ci 0.950
found 173 outliers among 100 samples (173.0%)
  94 (94.0%) low severe
  79 (79.0%) high severe
variance introduced by outliers: 95.725%
variance is severely inflated by outliers

With these tests:

{-# LANGUAGE ScopedTypeVariables #-}

import Criterion.Main

main = defaultMain [
        bgroup "tiny" [ bench "fib 10" $ whnf fib 10
                      , bench "fib 15" $ whnf fib 15
                      ]
       ]

fib :: Int -> Int
fib n | n < 0 = error "negative!"
      | otherwise = go (fromIntegral n)
  where
    go 0 = 0
    go 1 = 1
    go n = go (n-1) + go (n-2)

from criterion.

Gabriella439 commented on June 11, 2024

classifyOutliers triggers this bug whenever the input contains many identical values:

>>> classifyOutliers (U.replicate 10 0)
Outliers {samplesSeen = 10, lowSevere = 10, lowMild = 0, highMild = 0, highSevere = 10}

This results in duplicate counts in the lowSevere and highSevere range, just like in the other errors reported here, too. This also triggers the bug if even if not all the data is identical:

>>> classifyOutliers (U.fromList [0, 0, 0, 0, 0, 0, 1])
Outliers {samplesSeen = 7, lowSevere = 6, lowMild = 0, highMild = 0, highSevere = 7}

However,it was not clear what would cause it to receive large stretches of identical data, especially when you have 100 samples. You would think that this would be very improbable. So I added a trace to see what the offending data was and this is what it was getting:

benchmarking parse
collecting 100 samples, 1 iterations each, in estimated 10.58121 s
mean: 123.6324 ns, lb 47.33842 ns, ub 238.0733 ns, ci 0.950
std dev: 471.0889 ns, lb 368.2355 ns, ub 707.7345 ns, ci 0.950
fromList [2.7629260312427172e-6,8.555773984302174e-7,-9.80969179760326e-8,-9.809
69179760326e-8,8.555773984302174e-7,-9.80969179760326e-8,-9.80969179760326e-8,8.
555773984302174e-7,-9.80969179760326e-8,8.555773984302174e-7,8.555773984302174e-
7,-9.80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.809691797603
26e-8,-9.80969179760326e-8,-9.80969179760326e-8,8.555773984302174e-7,-9.80969179
760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.8096
9179760326e-8,8.555773984302174e-7,-9.80969179760326e-8,8.555773984302174e-7,-9.
80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8
,-9.80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.8096917976032
6e-8,8.555773984302174e-7,-9.80969179760326e-8,-9.80969179760326e-8,-9.809691797
60326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.80969
179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.8
0969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,8.555773984302174e-7,
-9.80969179760326e-8,-9.80969179760326e-8,8.555773984302174e-7,-9.80969179760326
e-8,-9.80969179760326e-8,8.555773984302174e-7,-9.80969179760326e-8,-9.8096917976
0326e-8,-9.80969179760326e-8,-9.80969179760326e-8,8.555773984302174e-7,-9.809691
79760326e-8,-9.80969179760326e-8,8.555773984302174e-7,-9.80969179760326e-8,-9.80
969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,1.09399597753178e-6,-9
.80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-
8,8.555773984302174e-7,-9.80969179760326e-8,8.555773984302174e-7,-9.80969179760$
26e-8,-9.80969179760326e-8,8.555773984302174e-7,-9.80969179760326e-8,-9.80969179
760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.8096
9179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,8.555773984302174e-7,-9.
80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8
,-9.80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.8096917976032
6e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,8.5557739843
02174e-7,-9.80969179760326e-8]
found 179 outliers among 100 samples (179.0%)
  79 (79.0%) low severe
  100 (100.0%) high severe

The data has several repeating values, and the value -9.80969179760326e-8 sufficiently dominates the rest of the data set that it triggers the bug. I still don't have an answer for why that exact timing repeats so often. My initial guess was because it was getting near the limit of the timer's minimum step size, but that wasn't the case. Benchmarks on trivial code like nf id () give smaller values in the e-9 range that are much more diverse. Here's a sample of some of them:

-9,5.519538864781468e-9,5.444293899712435e-9,5.59069442957501e-9,5.468012421310282e-9,5.529353425442646e-9,5.553071947040493e-9,5.443476019657337e-9,5.621773871668741e-9,5.605416270566778e-9,5.51872098472637e-9,5.646310273321687e-9,5.549800426820101e-9,5.643038753101294e-9,5.516267344561075e-9,5.60459839051168e-9,5.443476019657337e-9,5.571065308252654e-9,5.783714122578182e-9,6.146852867041778e-9,5.763267121200728e-9,5.49173094290813e-9,5.488459422687737e-9,5.557161347315985e-9,5.440204499436944e-9,5.60459839051168e-9,5.5121779442855845e-9,5.502363383624406e-9,5.6185023514483485e-9,5.443476019657337e-9,5.632406312385018e-9,5.5391679861038246e-9,5.6086877907871706e-9,5.519538864781468e-9,5.605416270566778e-9,5.6185023514483485e-9,5.446747539877729e-9,5.632406312385018e-9,5.611959311007563e-9,5.512995824340683e-9,5.574336828473046e-9,5.45083694015322e-9,5.6185023514483485e-9,5.508906424065192e-9,5.45083694015322e-9,5.616048711283054e-9,5.45083694015322e-9,5.56370438775677e-9,5.621773871668741e-9,5.502363383624406e-9,5.584969269189323e-9,5.6185023514483485e-9,5.6078699107320724e-9,5.643038753101294e-9,6.047071500319799e-9,5.567793788032261e-9,5.638949352825803e-9,5.601326870291287e-9,5.471283941530675e-9,5.587422909354618e-9,5.526081905222253e-9,5.454108460373613e-9,5.638949352825803e-9,5.6561248339828655e-9,5.69456519657248e-9,5.532624945663039e-9,5.546528906599708e-9]

So the missing thing to solve is what causes the timings to abnormally snap to certain values.

from criterion.

bos commented on June 11, 2024

Oops, closed this by accident.

from criterion.

bos commented on June 11, 2024

Fix validated in f028c02e5ec26c5e24f626c2f53da936bbe926bf.

from criterion.

Bogus outlier count on clock resolution samples about criterion HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent