haskell / criterion Goto Github PK

View Code? Open in Web Editor NEW

497.0 497.0 86.0 2.76 MB

A powerful but simple library for measuring the performance of Haskell code.

Home Page: http://www.serpentine.com/criterion

License: BSD 2-Clause "Simplified" License

Haskell 77.32% C 2.31% CSS 1.32% Smarty 3.73% JavaScript 15.31%

criterion's People

Contributors

Stargazers

Watchers

Forkers

meiersi jsgf repos-haskell shelarcy shimuuar basvandijk tsuraan mvv bgamari wraithm juretta coreyoconnor ivan-m conklech fhartwig elgamine-dev gabriella439 snoyberg mwotton nilcons-contrib liskin ysusuk mikeizbicki rrnewton tmcdonell xich iu-parfunc k32 green-haskell tranma 23skidoo oliver-batchelor gmenaka hvr erikd cotrone ndmitchell vollmerm phadej tserduke factisresearch ntc2 runarorama chadbrewbaker merijn porges dnadales kindaro andreaspk avieth hardentoo ocramz rozbb rubenpieters quasicomputational fusiled tchajed gelisam clojurians-org hsyl20 athas maxgabriel slaser79 manish364824 jonascarpay nageshlop considerate tubbz-alt jeremyadamhart input-output-hk standardgalactic 414owen tbidne yaitskov mpickering lyokha newhoggy andreabedini isabella232 lovasko jamestiotio atsuzaki zorojuro29

criterion's Issues

Possible error in data analysis

I think there is an error in criterion's data analysis. Here is simplified description of algorithm as I understand it. For simplicity I will ignore discreteness of clock.

Clock call cost is estimated.
For every benchmark N samples are collected
These samples are corrected for timing overhead and somehow averaged. Exact nature of averaging is not significant for reasons stated below.

The problem is that clock call cost is measured quantity and this measurement have error (σ(t) from now on) and it's never taken into account. This error corresponds to shifts of timing distribution as whole and couldn't be eliminated. No averaging procedure could detect such shifts. So error for benchmark couldn't be less than σ(t). It shouldn't be significant for long-running functions. But it is significant for function which take same or less time to complete as getPOSIXTime. mwc-random's benchmarks should be affected.

Randomize memory layout

"Small changes to a program or its execution environment can perturb its layout, which affects caches and branch predictors. The impact of these layout changes is unpredictable and substantial: Mytkowicz et al. show that just changing the size of environment variables can trigger performance degradation as high as 300%; we find that simply changing the link order of object files can cause performance to decrease by as much as 57%. Failure to control for layout is a form of measurement bias. All executions constitute just one sample from the vast space of possible memory layouts. This limited sampling makes statistical tests inapplicable, since they depend on multiple samples over a space, often with a known distribution. As a result, it is currently not possible to test whether a code modification is the direct cause of any observed performance change, or if it is due to incidental effects like a different code, stack, or heap layout. A random memory layout eliminates the effect of layout on performance, and repeated randomization leads to normally-distributed execution times. This makes it straightforward to use standard statistical tests for performance evaluation." (courtesy of the paper at http://www.stabilizer-tool.org/)

Missing link target in documentation

in http://www.serpentine.com/criterion/tutorial.html
the following paragraph contains an anchor tag without a meaningful target (for the phrase "normal form"):

We use nfIO to specify that after we run the IO action, its result must be evaluated 
to normal form, i.e. so that all of its internal constructors are fully evaluated, and it 
contains no thunks.

Did you mean it to point here: http://www.haskell.org/haskellwiki/Weak_head_normal_form ?

"normal form" versus "head normal form"

The documentation in Criterion.Main says, in the "Benchmarking pure code" part:

The first is a function which will cause results to be evaluated to head normal form (NF):

nf :: NFData b => (a -> b) -> a -> Pure

Should this say "normal form" instead of "head normal form"? Maybe I have misunderstood the terms.

Invalid comparison CSV files

When a comparison CSV file is produced, there doesn't seem to be any escaping done to the benchmark names.

(e.g. quoting fields that contain commas, or escaping literal quotation marks)

Criterion build fails on ghc 7.6.1 (on Mac)

Criterion build is failing on Mac (but not on Linux).

System details first:
$ which ghc
/Volumes/Data/scripts/ghc/7.6.1/bin/ghc
$ uname -a
Darwin desktop.local 12.2.0 Darwin Kernel Version 12.2.0: Sat Aug 25 00:48:52 PDT 2012; root:xnu-2050.18.24~1/RELEASE_X86_64 x86_64

I get the error below when trying to do cabal install:

Building criterion-0.6.1.1...
[ 1 of 12] Compiling Paths_criterion ( dist/build/autogen/Paths_criterion.hs, dist/build/Paths_criterion.o )

dist/build/autogen/Paths_criterion.hs:21:13: Not in scope: `catch'

dist/build/autogen/Paths_criterion.hs:22:13: Not in scope: `catch'

dist/build/autogen/Paths_criterion.hs:23:14: Not in scope: `catch'

dist/build/autogen/Paths_criterion.hs:24:17: Not in scope: `catch'
cabal: Error: some packages failed to install:
criterion-0.6.1.1 failed during the building phase. The exception was:
ExitFailure 1

thread blocked indefinitely in an MVar operation (HP 2013 ghc 7.6.3 on mac)

I get the following error in my criterion code

Exception inside child thread "(worker 0 of originator ThreadId 3)", ThreadId 7: thread blocked indefinitely in an MVar operation
benchmarks: thread blocked indefinitely in an MVar operation

heres a simplified piece of code that replicates this error

{-# LANGUAGE BangPatterns #-}

module Main where

import Criterion.Main


import Criterion.Config
import Data.Word 

import Data.Foldable 


whnfIter:: Int ->(a->a)-> a -> Pure 
whnfIter cnt f  arg = whnf (\v -> foldl' (\ a b -> f a ) v [0 .. cnt]  ) arg

main =  defaultMainWith defaultConfig{cfgSamples=ljust 10} (return ()) [
    bgroup "Morton Z" [
        bcompare [ 

        bench "addingNumbersIter1000" $! whnfIter 1000 ( (7 + ):: Word->Word)  9 -- ,

        ]]  ]

heres the ghc options i'd used, most of these probably are irrelevant, but i'm putting them here for completeness

   ghc-options: -O2 -optlo "-O3"  -fllvm   -optlc "-O3" -optlo "-std-compile-opts"      -optlo "-bb-vectorize"   -fllvm-tbaa -optlo  "-regalloc=pbqp"     -rtsopts -threaded -with-rtsopts=-N2

Test suite is broken

Preprocessing test suite 'tests' for criterion-1.0.0.2...

tests/Tests.hs:5:8:
    Could not find module ‘Properties’

http://hydra.cryp.to/build/164216/nixlog/1/raw

Add error bars on summary plot.

It would be nice to have error bars on summary plots so one could easily see whether difference could be explained by statistical fluctuations. Sadly flot lack builtin error bars but there is a plugin for it. Another possible approach is to use overlapping bar plots a la:

-----------
   |  |   |
-----------

build failure caused by dependency.

It seems statistics 0.10 changed Statistics.KernelDensity to Statistics.Sample.KernelDensity , which causes criterion building fail.
You need to set the upper limit of statistics version.

Fix Windows resolution problems by replacing Data.Time.Clock.POSIX with System.Clock

On Windows, getPOSIXTime uses the C function GetSystemTimeAsFileTime, which gives low-resolution timings. As observed in issue #11, it will return the same value many times in a row. criterion therefore produces low-quality statistics under Windows. QueryPerformanceCounter is the appropriate timing source for benchmarks on Windows; we can access it with Clock.getTime Clock.Monotonic from the clock package. See example data below.

My patch, conklech/criterion@b6f657c, fixes the problem on Windows, but might create compatibility problems on some POSIX systems.

In issue #2, @coreyoconnor gave a similar patch, coreyoconnor/criterion@37b1216, which uses Clock.ProcessCPUTime instead. That has different semantics--which was the goal there. But it's not clear that Clock.ProcessCPUTime gives benchmark-grade data on Windows. It uses GetProcessTimes under the hood; MSDN doesn't explain where the data comes from, and I haven't tested it.

Under POSIX, time calls gettimeofday and clock calls clock_gettime. Cursory Google research indicates that clock_gettime with CLOCK_MONOTONIC is the proper benchmark timesource; gettimeofday can go backwards, for instance when ntp is adjusting the clock.

However, it looks like some systems don't implement CLOCK_MONOTONIC. Perhaps just OS X? I don't know. Someone with better POSIX knowledge should look into compatibility issues with CLOCK_MONOTONIC in non-Win32 environments.

On my Windows 7 laptop, criterion using time (i.e. criterion-0.6.2.1) reports an estimated resolution of about 3.2 msec, with the weird 200% outlier rate reported by others. However, the final report data shows that all measurements share the same two or three values:

After dropping in clock and using Clock.Monotonic, the resolution is reported as 4.5 msec but the outlier rate problem is gone, and the final report shows an actual distribution of data:

Missing dependency on ghc-prim (for GHC.Generics)

Same problem as in haskell/statistics#52.

Clock resolution estimation broken on Windows7

I don't know if criterion is supposed to work on Windows, but it doesn't seem to work. The clock estimation gets more outliers than samples. Impressive. :)

Benchmark program:

import Criterion.Main
main = defaultMain [bench "test" $ whnf succ 0]

Running this you get:

$ ghc --make Crit.hs
[1 of 1] Compiling Main             ( Crit.hs, Crit.o )
Linking Crit.exe ...

Lennart@Lennart-Work /workspace/Proj
$ ./Crit
Warning: Couldn't open /dev/urandom
Warning: using system clock for seed instead (quality will be lower)
warming up
estimating clock resolution...
mean is 1.501650 us (640001 iterations)
found 1279037 outliers among 639999 samples (199.8%)
  639038 (99.8%) low severe
  639999 (100.0%) high severe
estimating cost of a clock call...
mean is 125.8999 ns (14 iterations)
found 1 outliers among 14 samples (7.1%)
  1 (7.1%) high severe

benchmarking test
mean: 11.38051 ns, lb 10.87220 ns, ub 12.19332 ns, ci 0.950
std dev: 3.319025 ns, lb 2.605880 ns, ub 4.085354 ns, ci 0.950
found 12 outliers among 100 samples (12.0%)
  12 (12.0%) high severe
variance introduced by outliers: 97.824%
variance is severely inflated by outliers

get "xxx" in the report for long-running computations

If a computation runs quite long, say, longer than 0.5s, then I get a result like:

OLS regression  544 ms  570 ms  0 s
R² goodness-of-fit 0.999   0.999   xxx

in the HTML report and then all following regression results are "xxx", too. This also implies that the overview diagram is not generated.

Please add support for statistics 0.11.x

criterion-0.6.0.0 failing to build

[10 of 12] Compiling Criterion.Report ( Criterion/Report.hs, dist/build/Criterion/Report.o )

Criterion/Report.hs:73:25:
    No instance for (MonadIO Criterion)
      arising from a use of `liftIO'
    Possible fix: add an instance declaration for (MonadIO Criterion)
    In the expression: liftIO
    In the expression:
        liftIO
      $ do { tpl <- loadTemplate
                      [".", templateDir] (fromLJ cfgTemplate cfg);
               L.writeFile name =<< formatReport reports tpl }
    In a case alternative:
        Last (Just name)
          -> liftIO
           $ do { tpl <- loadTemplate [".", ....] (fromLJ cfgTemplate cfg);
                    L.writeFile name =<< formatReport reports tpl }
cabal: Error: some packages failed to install:
criterion-0.6.0.0 failed during the building phase. The exception was:
ExitFailure 1

Here are my currently install packages:

% ghc-pkg list
/usr/lib/ghc-7.0.4/package.conf.d
   Cabal-1.10.2.0
   HTTP-4000.1.1
   HUnit-1.2.0.3
   MissingH-1.1.0.3
   X11-1.5.0.0
   X11-xft-0.3
   array-0.3.0.2
   base-4.3.1.0
   bin-package-db-0.0.0.0
   binary-0.5.0.2
   bytestring-0.9.1.10
   containers-0.4.0.0
   dataenc-0.14
   directory-1.1.0.0
   extensible-exceptions-0.1.1.2
   ffi-1.0
   filepath-1.2.0.0
   ghc-7.0.4
   ghc-binary-0.5.0.2
   ghc-paths-0.1.0.8
   ghc-prim-0.2.0.0
   haddock-2.9.2
   hashed-storage-0.4.13
   haskeline-0.6.4.0
   haskell2010-1.0.0.0
   haskell98-1.1.0.1
   hpc-0.5.0.6
   hslogger-1.1.4
   html-1.0.1.2
   integer-gmp-0.2.0.3
   mmap-0.4.1
   mtl-1.1.1.1
   network-2.2.1.7
   old-locale-1.0.0.2
   old-time-1.0.0.6
   parsec-2.1.0.1
   pretty-1.0.1.2
   process-1.0.1.5
   random-1.0.0.3
   regex-base-0.93.1
   regex-compat-0.92
   regex-posix-0.94.1
   rts-1.0
   stm-2.1.2.2
   syb-0.3.2
   template-haskell-2.5.0.0
   terminfo-0.3.1.3
   time-1.1.2.0
   time-1.2.0.3
   unix-2.4.2.0
   utf8-string-0.3.6
   xhtml-3000.2.0.1
   xmonad-0.9.2
   xmonad-contrib-0.9.2
   zlib-0.5.2.0
/home/ollie/.ghc/i386-linux-7.0.4/package.conf.d
   aeson-0.5.0.0
   asn1-data-0.6.1.2
   attoparsec-0.10.1.0
   attoparsec-conduit-0.0.1
   attoparsec-enumerator-0.3
   base-unicode-symbols-0.2.2.3
   base64-bytestring-0.1.1.0
   blaze-builder-0.3.0.2
   blaze-builder-conduit-0.0.1
   case-insensitive-0.4.0.1
   cereal-0.3.5.1
   certificate-1.0.1
   cmdargs-0.9.2
   conduit-0.1.1
   cprng-aes-0.2.3
   crypto-api-0.8
   crypto-pubkey-types-0.1.0
   cryptocipher-0.3.0
   cryptohash-0.7.4
   data-default-0.3.0
   deepseq-1.2.0.1
   dlist-0.5
   double-conversion-0.2.0.4
   entropy-0.2.1
   enumerator-0.4.18
   erf-2.0.0.0
   failure-0.2.0
   hashable-1.1.2.2
   hastache-0.2.4
   http-conduit-1.2.0
   http-types-0.6.8
   largeword-1.0.1
   lifted-base-0.1.0.2
   math-functions-0.1.1.0
   monad-control-0.3.1
   monad-par-0.1.0.3
   mwc-random-0.11.0.0
   network-2.3.0.8
   parsec-3.1.2
   primitive-0.4.1
   semigroups-0.8
   statistics-0.10.1.0
   tagged-0.2.3.1
   text-0.11.1.12
   text-format-0.3.0.7
   tls-0.8.5
   tls-extra-0.4.2
   transformers-0.2.2.0
   transformers-base-0.4.1
   unix-compat-0.3.0.1
   unordered-containers-0.1.4.6
   vector-0.9.1
   vector-algorithms-0.5.3
   zlib-bindings-0.0.2
   zlib-conduit-0.0.1

And my GHC information:

 % ghc -V 
The Glorious Glasgow Haskell Compilation System, version 7.0.4

Criterion does not build against statistics-0.11

Error is:

Preprocessing library criterion-0.8.0.0...
[ 1 of 14] Compiling Paths_criterion  ( dist_prof/build/autogen/Paths_criterion.hs, dist_prof/build/Paths_criterion.o )
[ 2 of 14] Compiling Criterion.Config ( Criterion/Config.hs, dist_prof/build/Criterion/Config.o )
[ 3 of 14] Compiling Criterion.Monad  ( Criterion/Monad.hs, dist_prof/build/Criterion/Monad.o )
[ 4 of 14] Compiling Criterion.Measurement ( Criterion/Measurement.hs, dist_prof/build/Criterion/Measurement.o )
[ 5 of 14] Compiling Criterion.IO.Printf ( Criterion/IO/Printf.hs, dist_prof/build/Criterion/IO/Printf.o )
[ 6 of 14] Compiling Criterion.Analysis.Types ( Criterion/Analysis/Types.hs, dist_prof/build/Criterion/Analysis/Types.o )
[ 7 of 14] Compiling Criterion.Analysis ( Criterion/Analysis.hs, dist_prof/build/Criterion/Analysis.o )

Criterion/Analysis.hs:129:29:
    Couldn't match type `v0 Double -> Double'
                  with `Statistics.Types.Estimator'
    Expected type: [Statistics.Types.Estimator]
      Actual type: [v0 Double -> Double]
    In the second argument of `resample', namely `ests'
    In the expression:
        resample gen ests numResamples samples :: IO [Resample]
    In the second argument of `($)', namely
      `\ gen -> resample gen ests numResamples samples :: IO [Resample]'

Criterion/Analysis.hs:130:55:
    Couldn't match type `v0 Double -> Double'
                  with `Statistics.Types.Estimator'
    Expected type: [Statistics.Types.Estimator]
      Actual type: [v0 Double -> Double]
    In the third argument of `B.bootstrapBCA', namely `ests'
    In the expression: B.bootstrapBCA ci samples ests resamples
    In a pattern binding:
      [estMean, estStdDev] = B.bootstrapBCA ci samples ests resamples
Failed to install criterion-0.8.0.0

allow rendering comparison chart split by top level bgroup

heres an example of splitting the top level chart vs the current one

https://bitbucket.org/carter/multicriterion-templates/src/b1c1b808b71c?at=master is the current repo for this.

I'm not necessarily saying "merge this in", but might at least be worth having a curated "plugins.md" / "addons.md" / "moretools.md" list or something that is centrally curated

fails to build with ghc-7.6.1 / base-4.6.0.0

Criterion-0.6.2.0 fails to build with ghc-7.6.1 due to the removal of Prelude.catch in base-4.6.0.0. In particular, this causes the complication of Paths_criterion to fail.

Bogus outlier count on clock resolution samples

I'm not sure what would cause this -- how could it be getting more outliers than there are samples?

warming up
estimating clock resolution...
mean is 5.257008 us (160001 iterations)
found 243504 outliers among 159999 samples (152.2%)
  120537 (75.3%) low severe
  122967 (76.9%) high severe
estimating cost of a clock call...
mean is 31.80684 ns (52 iterations)
found 6 outliers among 52 samples (11.5%)
  4 (7.7%) high mild
  2 (3.8%) high severe

Strange IORef run

I was doing some testing where I'm acquiring and consuming resources - popping rows off of a database. I found that criterion was consuming way more rows than I had expected.

So I tried another IO process with the code below:

import Control.Monad
import Criterion
import Criterion.Main
import Data.IORef (newIORef, readIORef, modifyIORef')

cri :: IO Int
cri = do
    ref <- newIORef 0
    defaultMain
        [ bench "ioref" $ nfIO $ modifyIORef' ref (+1)
        ]
    readIORef ref

noncri :: Int -> IO Int
noncri n = do
    ref <- newIORef 0
    replicateM_ n $ modifyIORef' ref (+1)
    readIORef ref

main :: IO ()
main = do
       cri >>= print
       noncri 100 >>= print

And I got this:

warming up
estimating clock resolution...
mean is 1.308630 us (640001 iterations)
found 3253 outliers among 639999 samples (0.5%)
  3041 (0.5%) high severe
estimating cost of a clock call...
mean is 49.82543 ns (9 iterations)
found 2 outliers among 9 samples (22.2%)
  2 (22.2%) high severe

benchmarking ioref
mean: 151.7543 ns, lb 150.2320 ns, ub 153.5662 ns, ci 0.950
std dev: 8.485390 ns, lb 7.286958 ns, ub 9.796859 ns, ci 0.950
found 9 outliers among 100 samples (9.0%)
  9 (9.0%) high mild
variance introduced by outliers: 53.488%
variance is severely inflated by outliers
927667
100

Why would the run hit the IORef a million times for a 100 config run? Or is this a spurious way to use criterion?

fromLJ diverges on fields with no default value

It appears Config.fromLJ will not terminate when called to extract a field which was left mempty in the defaultConfig.

For example:

> fromLJ cfgResults defaultConfig
"^CInterrupted.

cc @mboes

Issues with criterion and async?

I am currently trying to profile some code that is using the async package. It seems there is a problem with it, cause directly after the first test using async the program crashes:

/tmp/criterion19204.dat: hPutBuf: illegal operation (handle is closed)

This is the error when running compiled with -threaded but without +RTS -N -RTS. When I enable the threaded runtime I get:

Exception inside child thread "(worker 1 of originator ThreadId 3)", ThreadId 14: thread blocked indefinitely in an MVar operation
[several of them]
Exception inside child thread "(worker 6 of originator ThreadId 3)", ThreadId 19: thread blocked indefinitely in an MVar operation

Is there anything known about this?

Please update Hackage

Subj.

criterion-0.9.0.0 does not build against cassava-0.3.0.0

The issue is that cassava's encode function now expects a list instead of a vector. The encode function from cassava-0.2.2.0 had type:

encode :: ToRecord a => Vector a -> ByteString

In cassava-0.3.0.0 it now has type:

encode :: ToRecord a => [a] -> ByteString

This causes problems when building Criterion.IO.Printf (and possibly other modules, but this is as far as I got):

Criterion/IO/Printf.hs:102:56:
    Could not deduce (G.Vector [] a)
      arising from a use of `G.singleton'
    from the context (Csv.ToRecord a)
      bound by the type signature for
                 writeCsv :: Csv.ToRecord a => a -> Criterion ()
      at Criterion/IO/Printf.hs:(99,1)-(103,24)
    Possible fix:
      add (G.Vector [] a) to the context of
        the type signature for
          writeCsv :: Csv.ToRecord a => a -> Criterion ()
      or add an instance declaration for (G.Vector [] a)
    In the second argument of `(.)', namely `G.singleton'
    In the second argument of `(.)', namely `Csv.encode . G.singleton'
    In the second argument of `(.)', namely
      `B.appendFile fn . Csv.encode . G.singleton'

I'd issue a pull request, but I wasn't sure if you wanted to immediately switch to cassava-0.3.0.0 or still support cassava-0.2.2.0 or do some CPP magic to support both.

Support for setup/teardown in IO benchmarks

Benchmarking in IO often requires phases of initialization and cleanup or resource acquisition and releasing before and after execution of each sample. Currently it is impossible to exclude those phases from measuring.

To provide a flexible solution to this problem I suggest to update the interface of a class Benchmarkable to the following:

class Benchmarkable a where
  run :: a -> Int -> StartTimer -> StopTimer -> IO ()
type StartTimer = IO ()
type StopTimer = IO ()

After also adding the following:

newtype IOBenchmark = IOBenchmark (StartTimer -> StopTimer -> IO ())
instance Benchmarkable IOBenchmark where
  run (IOBenchmark f) n start stop = replicateM n $ f start stop

The user will be able to finally apply it like so:

main = defaultMain
  [
    bench "My wonderful benchmark" $ IOBenchmark $ \startTimer stopTimer -> do
      db <- establishADBConnection
      startTimer -- !
      workWithDB db
      stopTimer -- !
      closeDBConnection db
      cleanUp
  ]

One can imagine some more involved scenarios, where the timer starts and stops multiple times during execution of just a single sample.

Another important benefit of the suggested approach is that it will support monad transformers:

main = defaultMain
  [
    bench "My wonderful benchmark" $ IOBenchmark $ \startTimer stopTimer -> do
      runDBConnectionT $ do
        liftIO $ startTimer
        work
        liftIO $ stopTimer
  ]

Build fails over complications with MinGW

I've tried to build criterion, but it failed and the build produced the following very long error-log:
http://pastebin.com/8VW3q5g9

I'm running Windows 7 x64 SP1, GHC 7.8.3.
The build fails with both cabal-install 1.20.0.3, cabal library 1.20.0.2 and cabal-install 1.18.0.5, cabal library 1.18.1.3.
A fresh installation of the latest Haskell Platform (2014.2.0.0) for x64 machines was used (so the MinGW GCC version is GCC (rubenvb-4.6.3) 4.6.3).
ghc-pkg check doesn't find any problems with any package.

µ triggers abort on system with non-unicode default encoding

A measurement including "µs" will trigger an error in System.IO if the default encoding doesn't support the µ character, with the following output:
fibber.exe: : commitBuffer: invalid argument (invalid character)

Notably this is an issue on Windows where the default encoding is latin1.

I think the best solution is to switch 'µ' to 'u'.

I tried using hSetEncoding and it works, but it will look like junk on a non-utf8 locale.

Benchmark can't be properly run on other machine

Hello,

compiled bin can't be properly run on another machine because of error ': TemplateNotFound "report.tpl"'. If "templates" dir from .cabal is supplied with bin, and bin run with "-t templates/report.tpl", the report is missing pics.

Ubuntu OS.

Best regards,

vlatko

Criterion fails to build with GHC 6.12.3

Please see below for the error log on cabal install of criterion 0.6.0.0. The error doesn't happen with GHC 7.0.4. It will be very helpful to have criterion build on 6.12.3 as well because I am trying to pin-point the performance regression between 6.12.3 and 7.0.4 for my code. I am using RHEL 5 x86_64.

Error:

Configuring criterion-0.6.0.0...
Preprocessing library criterion-0.6.0.0...
Building criterion-0.6.0.0...
[ 1 of 12] Compiling Paths_criterion ( dist/build/autogen/Paths_criterion.hs, dist/build/Paths_criterion.o )
[ 2 of 12] Compiling Criterion.Analysis.Types ( Criterion/Analysis/Types.hs, dist/build/Criterion/Analysis/Types.o )
[ 3 of 12] Compiling Criterion.Types ( Criterion/Types.hs, dist/build/Criterion/Types.o )
[ 4 of 12] Compiling Criterion.Measurement ( Criterion/Measurement.hs, dist/build/Criterion/Measurement.o )
[ 5 of 12] Compiling Criterion.Config ( Criterion/Config.hs, dist/build/Criterion/Config.o )
[ 6 of 12] Compiling Criterion.Monad ( Criterion/Monad.hs, dist/build/Criterion/Monad.o )
[ 7 of 12] Compiling Criterion.IO ( Criterion/IO.hs, dist/build/Criterion/IO.o )
[ 8 of 12] Compiling Criterion.Analysis ( Criterion/Analysis.hs, dist/build/Criterion/Analysis.o )
[ 9 of 12] Compiling Criterion.Environment ( Criterion/Environment.hs, dist/build/Criterion/Environment.o )
[10 of 12] Compiling Criterion.Report ( Criterion/Report.hs, dist/build/Criterion/Report.o )

Criterion/Report.hs:73:24:
No instance for (MonadIO Criterion)
arising from a use of liftIO' at Criterion/Report.hs:73:24-29 Possible fix: add an instance declaration for (MonadIO Criterion) In the first argument of($)', namely `liftIO'
In the expression:
liftIO
$ do { tpl <- loadTemplate
[".", templateDir](fromLJ cfgTemplate cfg);
L.writeFile name =<< formatReport reports tpl }
In a case alternative:
Last (Just name)
-> liftIO
$ do { tpl <- loadTemplate [".", ....](fromLJ cfgTemplate cfg);
L.writeFile name =<< formatReport reports tpl }
cabal: Error: some packages failed to install:
criterion-0.6.0.0 failed during the building phase. The exception was:
ExitFailure 1

`bcompare` not in scope?

When I run

import Criterion.Main

main = defaultMain [
  bcompare [
    bench "exp" $ whnf exp (2 :: Double)
    , bench "log" $ whnf log (2 :: Double)
    , bench "sqrt" $ whnf sqrt (2 :: Double)
  ]
]

I get:

 Not in scope: `bcompare'

Vector index out of range

I am using criterion 0.8.1, vector 10.9.1 and ghc 7.8.2.

The following code:

import Criterion.Config
import Criterion.Main

critConfig = defaultConfig 
    { cfgSamples     = ljust 4
--     , cfgResamples   = ljust 2
    }

main = do
    defaultMainWith critConfig (return ())
        [ bench "bug" $ print "squash" 
        ]

causes a crash with the error message:

criterion-bug: ./Data/Vector/Generic.hs:249 ((!)): index out of bounds (-9223372036854775808,100000)
criterion-bug: thread blocked indefinitely in an MVar operation

The big negative number is equal to minBound::Int. If you replace the 4 with a 5, then everything works fine.

The 100000 made me thing the error might have something to do with cfgResamples. If you uncomment the line assigning it to 2, then the program works fine as long as cfgSamples is greater than 25. The program stops working, however, if you set it to a smaller number.

I tried searching for where the library uses the ! to index into a vector with grep: grep -r "\!" Criterion. But unfortunately, this turned up no uses.

Different criterion benchmark outputs for Vector Arrays

Code with problematic output attached below.

The issue is different measurements - if defaultMain has multiple vector benchmarks, the measurements seem to include time to build the function inputs as well. If I benchmark only one vector function, the measurement excludes the time to build the function input, as I expected it to. The code below has two bench functions "DoubleV" and "IntV". If I benchmark both, I get ~400ns for each. If I benchmark only "DoubleV" by commenting out "IntV", I get ~185ns which is the expected measurement. I didn't see this issue when I initially created the functions using List. But, when I switched from List to unboxed Vectors, I saw this issue. So, it seems to be specific to vectors.

I can reproduce it only for the code below in question - I tried to reproduce it for functions built using basic operations like foldl but couldn't.

-- | Begin Haskell code

import Data.ByteString.Internal (unsafeCreate,ByteString)
import qualified Data.Vector.Unboxed as V (Vector, forM_, fromList, replicate, length, Unbox)
import Data.Bits (shiftR)
import GHC.Int (Int16,Int32,Int64)
import GHC.Word (Word8,Word16,Word32,Word64)
import Unsafe.Coerce (unsafeCoerce)
import Criterion.Main
import Foreign (Ptr, poke, plusPtr)
import Data.IORef (newIORef, readIORef, writeIORef)

-- | Write a Word32 in little endian format
putWord32le :: Word32 -> Ptr Word8 -> IO()
putWord32le w p = do
poke p (fromIntegral (w) :: Word8)
poke (p plusPtr 1) (fromIntegral (shiftR w 8) :: Word8)
poke (p plusPtr 2) (fromIntegral (shiftR w 16) :: Word8)
poke (p plusPtr 3) (fromIntegral (shiftR w 24) :: Word8)
{-# INLINE putWord32le #-}

-- | Write a Word64 in little endian format
putWord64le :: Word64 -> Ptr Word8 -> IO()
putWord64le w p = do
poke p (fromIntegral (w) :: Word8)
poke (p plusPtr 1) (fromIntegral (shiftR w 8) :: Word8)
poke (p plusPtr 2) (fromIntegral (shiftR w 16) :: Word8)
poke (p plusPtr 3) (fromIntegral (shiftR w 24) :: Word8)
poke (p plusPtr 4) (fromIntegral (shiftR w 32) :: Word8)
poke (p plusPtr 5) (fromIntegral (shiftR w 40) :: Word8)
poke (p plusPtr 6) (fromIntegral (shiftR w 48) :: Word8)
poke (p plusPtr 7) (fromIntegral (shiftR w 56) :: Word8)
{-# INLINE putWord64le #-}

-- | Function to generate putWordleV functions, N = 16,32,64
genFnPutWordNleV :: (V.Unbox a) => (a -> Ptr Word8 -> IO()) -> Int -> V.Vector a -> Ptr Word8 -> IO()
genFnPutWordNleV f n w p = do
addr <- newIORef p -- Store the initial pointer address
V.forM_ w $ \x -> do -- loop over WordN list - output type is IO()
curAddr <- readIORef addr -- get the address for current wordN
writeIORef addr (curAddr plusPtr n) -- write the address for next wordN
f x curAddr -- put the current WordN - this must be last action in order to satisfy output type IO()

putWord64leV :: V.Vector Word64 -> Ptr Word8 -> IO()
putWord64leV = genFnPutWordNleV putWord64le 8
{-# INLINE putWord64leV #-}

putWord32leV :: V.Vector Word32 -> Ptr Word8 -> IO()
putWord32leV = genFnPutWordNleV putWord32le 4
{-# INLINE putWord32leV #-}

encodeDoubleV :: V.Vector Double -> ByteString
encodeDoubleV x = unsafeCreate (8*(V.length x)) (putWord64leV $ unsafeCoerce x)

encodeInt32V :: V.Vector GHC.Int.Int32 -> ByteString
encodeInt32V x = unsafeCreate (4*(V.length x)) (putWord32leV $ unsafeCoerce x)

main :: IO ()
main = do

let intv = V.fromList [1..10] :: V.Vector GHC.Int.Int32
doublev = V.fromList [1..10] :: V.Vector Double
defaultMain [
bench "DoubleV" $ whnf encodeDoubleV doublev
,bench "IntV" $ whnf encodeInt32V intv
]

-- | End Haskell Code

Criterion output for above code:

warming up
estimating clock resolution...
mean is 5.155132 us (160001 iterations)
found 275510 outliers among 159999 samples (172.2%)
135061 (84.4%) low severe
140449 (87.8%) high severe
estimating cost of a clock call...
mean is 47.38399 ns (35 iterations)
found 5 outliers among 35 samples (14.3%)
2 (5.7%) high mild
3 (8.6%) high severe

benchmarking DoubleV
mean: 407.7989 ns, lb 406.7614 ns, ub 409.2373 ns, ci 0.950
std dev: 6.238086 ns, lb 4.913598 ns, ub 9.148503 ns, ci 0.950

benchmarking IntV
mean: 401.6377 ns, lb 400.5134 ns, ub 402.9650 ns, ci 0.950
std dev: 6.253466 ns, lb 5.415961 ns, ub 7.395777 ns, ci 0.950

Criterion ouput when "intv" benchmark is commented out, only "doublev" is benchmarked:

warming up
estimating clock resolution...
mean is 5.142457 us (160001 iterations)
found 269135 outliers among 159999 samples (168.2%)
132865 (83.0%) low severe
136270 (85.2%) high severe
estimating cost of a clock call...
mean is 46.25240 ns (35 iterations)
found 2 outliers among 35 samples (5.7%)
2 (5.7%) high severe

benchmarking DoubleV
mean: 183.1018 ns, lb 182.2975 ns, ub 184.4998 ns, ci 0.950
std dev: 5.297046 ns, lb 3.595863 ns, ub 8.951493 ns, ci 0.950

Compile details: ghc -O --make
Criterion Version: 0.5.1.1 (but cabal list shows installed versions as: 0.5.1.0, 0.5.1.1)
Compiler: GHC 7.0.3
OS: Mac Darwin Kernel Version 11.2.0, root:xnu-1699.24.8~1/RELEASE_X86_64 x86_64

StackOverflow post: http://stackoverflow.com/questions/8379191/forcing-evaluation-of-function-input-before-benchmarking-in-criterion (different title because I initially misunderstood the issue as problem of lazy evaluation of inputs)

Contradicting units: Times in details always report seconds

Have a look at http://htmlpreview.github.io/?https://github.com/nh2/loop/blob/ab6c10fc84/results/bench-foldl-and-iorefs-are-slow-llvm.html.

In the bar chart at the top the first time is 5.02 ms.

In the details for the first benchmark, Mean execution time says 5.015 s.

Bug in outlier reporting

Current outlier summary is a little confusing:

benchmarking Set/toList
mean: 180.5426 ns, lb 153.1244 ns, ub 217.2590 ns, ci 0.950
std dev: 511.4385 ns, lb 422.8353 ns, ub 804.1799 ns, ci 0.950
found *1771* outliers among 1000 samples (177.1%)
  771 (77.1%) low severe
  1000 (100.0%) high severe
variance introduced by outliers: 99.900%
variance is severely inflated by outliers

Obviously, criterion sums over low and high severity outliers, but is this really the expected behaviour?

0.8.0.0 fails to build currently because of lack of upper bound on statistics

Criterion/Analysis.hs:129:29:
    Couldn't match type `v0 Double -> Double'
                  with `Statistics.Types.Estimator'
    Expected type: [Statistics.Types.Estimator]
      Actual type: [v0 Double -> Double]
    In the second argument of `resample', namely `ests'
    In the expression:
        resample gen ests numResamples samples :: IO [Resample]
    In the second argument of `($)', namely
      `\ gen -> resample gen ests numResamples samples :: IO [Resample]'

Criterion/Analysis.hs:130:55:
    Couldn't match type `v0 Double -> Double'
                  with `Statistics.Types.Estimator'
    Expected type: [Statistics.Types.Estimator]
      Actual type: [v0 Double -> Double]
    In the third argument of `B.bootstrapBCA', namely `ests'
    In the expression: B.bootstrapBCA ci samples ests resamples
    In a pattern binding:
      [estMean, estStdDev] = B.bootstrapBCA ci samples ests resamples

adding < 0.11.0.0 to statistics in the cabal file fixes it.

Weird results benchmarking IO-free IO actions

I get strange results when benchmarking IO actions that don't do any actual IO. This issue usually arises when benchmarking streaming libraries like pipes and conduit when I try to test the raw overhead of the streaming machinery.

To illustrate what I mean, consider these two pipelines for pipes and conduit:

pipes, conduit :: (Monad m) => Int -> m () 
pipes n = runEffect $ for (each [1..n] >-> P.map (+1) >-> P.filter even) discard
conduit n = C.sourceList [1..n] $= C.map (+1) $= C.filter even $$ C.sinkNull

These don't actually do any IO, so they type-check as any monad. However, if I specialize them to the IO monad and benchmark them using nfIO, I get very different result from specializing them to the Identity monad and using whnf.

Here's the code I use to benchmark the two alternatives:

import Criterion.Main
import Data.Conduit
import qualified Data.Conduit.List as C
import Pipes
import qualified Pipes.Prelude as P

import Data.Functor.Identity

criterion :: Int -> IO ()
criterion n = defaultMain
    [ bgroup "IO"
        [ bench "pipes"   $ nfIO (pipes   n)
        , bench "conduit" $ nfIO (conduit n)
        ]
    , bgroup "Identity"
        [ bench "pipes"   $ whnf (runIdentity . pipes  ) n
        , bench "conduit" $ whnf (runIdentity . conduit) n
        ]
    ]

pipes, conduit :: (Monad m) => Int -> m ()
pipes n = runEffect $ for (each [1..n] >-> P.map (+1) >-> P.filter even) discard
conduit n = C.sourceList [1..n] $= C.map (+1) $= C.filter even $$ C.sinkNull

main = criterion (10^5)

... and here are the results. I want to highlight that the benchmark results for specializing to IO are very different from specializing to Identity. Usually the IO-based benchmarks are several orders of magnitude faster:

warming up
estimating clock resolution...
mean is 24.61508 ns (20480001 iterations)
found 390965 outliers among 20479999 samples (1.9%)
  265634 (1.3%) high mild
  125331 (0.6%) high severe
estimating cost of a clock call...
mean is 23.52013 ns (1 iterations)

benchmarking IO/pipes
mean: 71.52687 ns, lb 65.49627 ns, ub 91.83582 ns, ci 0.950
std dev: 50.86117 ns, lb 17.29679 ns, ub 113.8678 ns, ci 0.950
found 7 outliers among 100 samples (7.0%)
  4 (4.0%) high mild
  3 (3.0%) high severe
variance introduced by outliers: 98.970%
variance is severely inflated by outliers

benchmarking IO/conduit
mean: 1.327538 ms, lb 1.311253 ms, ub 1.353610 ms, ci 0.950
std dev: 103.8509 us, lb 73.29134 us, ub 144.2492 us, ci 0.950
found 11 outliers among 100 samples (11.0%)
  2 (2.0%) high mild
  9 (9.0%) high severe
variance introduced by outliers: 69.713%
variance is severely inflated by outliers

benchmarking Identity/pipes
mean: 3.579730 ms, lb 3.495507 ms, ub 3.642569 ms, ci 0.950
std dev: 371.3227 us, lb 294.6107 us, ub 451.7314 us, ci 0.950
found 35 outliers among 100 samples (35.0%)
  22 (22.0%) low severe
  6 (6.0%) high mild
  7 (7.0%) high severe
variance introduced by outliers: 80.047%
variance is severely inflated by outliers

benchmarking Identity/conduit
mean: 16.22628 ms, lb 16.17274 ms, ub 16.30177 ms, ci 0.950
std dev: 322.3388 us, lb 245.0389 us, ub 416.8648 us, ci 0.950
found 8 outliers among 100 samples (8.0%)
  7 (7.0%) high severe
variance introduced by outliers: 13.224%
variance is moderately inflated by outliers

Whenever I get a weird criterion result, I always compare to time. In this case, I timed the above program by just changing the main function to either:

main = pipes (10^8)

-- or:

main = conduit (10^8)

This gave the following results:

$ time ./bench  # main = pipes (10^8)

real    0m2.405s
user    0m2.380s
sys 0m0.016s
...
$ time ./bench  # main = conduit (10^8)

real    0m16.243s
user    0m16.161s
sys 0m0.048s

If you scale these down 1000x to match the criterion tests they agree better with the Identity-based results. Also, common sense says that there is no way that pipes is chewing through 10^6 elements in 71 nanoseconds on my laptop, which is the other reason I don't believe the IO-based benchmarks.

So I already have a work-around for this (specialize to Identity instead of IO), but I thought you might be interested because benchmarking IO-free IO actions seems to have triggered a weird corner case of criterion that could be worth investigating. This issue occurs with both criterion-0.8.0.0 and criterion-0.9.0.0 and occurs with the cfgPerformGC flag set to either True or False.

I'm also pinging @cartazio on this because I mentioned this issue to him previously and he was interested in example code for this.

Add a delay flag

Has it been considered to have a delay flag, that will sleep between sample runs for a specified Xms parameter? I can imagine a Haskell program that use OS resource files such as TCP connections of the like, that are not forcibly killed by the OS, which creates an issue with Haskell programs being immediately re-sampled by Criterion.

Users of criterion might want criterion to offer a --delay Xms flag. The X would not be considered in any timings, but would let the Haskell code "cool down", and all OS resources removed e.g. 2 seconds between is run.

Has a --delay Xms flag been considered?

Add abitility to benchmark using CPU time

Currently criterion only able to benchmark functions using wall clock time. My experiments show that such measurements are quite sensitive to CPU load and execution time could easily double on heavy loaded system. What's worse such measurements are not reproducible. Every successive run could get different answer and discrepancy couldn't be explained by statistical fluctuations.

I did quick experiment and replaced getPOSIXTime with getCPUTime. In this case measurements didn't depend on CPU load as one could expect. So CPU time is better performance metric for some functions, like numeric ones.

The ability to configure the format of the summary file

Right now in the function defaultMainWith, the format of the summary file is hard coded as far as I can tell. It would be wonderful to have the ability to select only certain numbers to be thrown into the summary file.

Version 0.8.0.1 build is broken

A complete log file is available at http://hydra.cryp.to/build/45829/nixlog/2/raw. I guess this issue is triggered by the latest version of unix-bytestring.

Criterion 0.8 build failure

Building criterion-0.8.0.0...
Preprocessing library criterion-0.8.0.0...
[ 1 of 14] Compiling Paths_criterion  ( dist/dist-sandbox-be38e152/build/autogen/Paths_criterion.hs, dist/dist-sandbox-be38e152/build/Paths_criterion.o )
[ 2 of 14] Compiling Criterion.Config ( Criterion/Config.hs, dist/dist-sandbox-be38e152/build/Criterion/Config.o )
[ 3 of 14] Compiling Criterion.Monad  ( Criterion/Monad.hs, dist/dist-sandbox-be38e152/build/Criterion/Monad.o )
[ 4 of 14] Compiling Criterion.Measurement ( Criterion/Measurement.hs, dist/dist-sandbox-be38e152/build/Criterion/Measurement.o )
[ 5 of 14] Compiling Criterion.IO.Printf ( Criterion/IO/Printf.hs, dist/dist-sandbox-be38e152/build/Criterion/IO/Printf.o )
[ 6 of 14] Compiling Criterion.Analysis.Types ( Criterion/Analysis/Types.hs, dist/dist-sandbox-be38e152/build/Criterion/Analysis/Types.o )
[ 7 of 14] Compiling Criterion.Analysis ( Criterion/Analysis.hs, dist/dist-sandbox-be38e152/build/Criterion/Analysis.o )

Criterion/Analysis.hs:127:15:
    No instance for (Data.Vector.Generic.Base.Vector v0 Double)
      arising from a use of `mean'
    Possible fix:
      add an instance declaration for
      (Data.Vector.Generic.Base.Vector v0 Double)
    In the expression: mean
    In the expression: [mean, stdDev]
    In an equation for `ests': ests = [mean, stdDev]

Criterion/Analysis.hs:129:25:
    Couldn't match type `primitive-0.5.2.1:Control.Monad.Primitive.PrimState
                           m0'
                   with `GHC.Prim.RealWorld'
    Expected type: primitive-0.5.2.1:Control.Monad.Primitive.PrimState
                     IO
      Actual type: primitive-0.5.2.1:Control.Monad.Primitive.PrimState
                     m0
    Expected type: System.Random.MWC.Gen
                     (primitive-0.5.2.1:Control.Monad.Primitive.PrimState IO)
      Actual type: System.Random.MWC.Gen
                     (primitive-0.5.2.1:Control.Monad.Primitive.PrimState m0)
    In the first argument of `resample', namely `gen'
    In the expression:
        resample gen ests numResamples samples :: IO [Resample]

Criterion/Analysis.hs:129:29:
    Couldn't match expected type `Statistics.Types.Estimator'
                with actual type `v0 Double -> Double'
    Expected type: [Statistics.Types.Estimator]
      Actual type: [v0 Double -> Double]
    In the second argument of `resample', namely `ests'
    In the expression:
        resample gen ests numResamples samples :: IO [Resample]

Obviously based on some combination of the dependencies of my project and the ones you've specified in criterion, cabal has chosen what it thinks should be a valid build plan -- probably you're accepting a version of one of your deps that you don't actually support.

criterion release contains files without source

As reported at http://bugs.debian.org/736440 the criterion release contains these files without their corresponding source code

templates/js/excanvas-r3.min.js
templates/js/jquery-1.6.4.min.js
templates/js/jquery.flot-0.7.min.js

This causes problems for downstream distributions.

Could you please make a release of criterion that includes the corresponding source (i.e. non-minified, non-packed) files?

Bonus points for ensuring that the minified files are really derived from the source files, for example by adding a small Makefile and building them yourself, which would also serve as documentation about the tool used to create these files.

Thanks,
Joachim

v1.0.0.0 fails to build

The error is (in both GHC 7.6.3 and 7.8.3)

Criterion/Main.hs:28:7:
     The export item `Benchmarkable(run)'
     attempts to export constructors or class methods that are not visible here

The error was introduced in this commit 925f3f2

Benchmarkable now seems to be a newtype instead of a class. I don't know how you want to fix it. I guess you could export run as a newtype unwrapper, but you may prefer another fix.

Condensed command line output

It would be nice if Criterion.Main generated programs would have an option for a condensed command-line output. Often, I am only interesting in the mean values, and having one line per benchmark with just that number would be easier to read.

(Related to #17, but I’d like this output to be available by default, as it is specially the quick-shot-benchmarks that benefit from a small overview.)

Build failure with optparse-applicative-0.10.0

It seems reader was removed in the latest version of optparse-applicative, works with 0.9.1.1.

build fails with hastache 0.5.0

Hey,

there is a brand new hastache version 0.5.0 which breaks criterion. There is not upper bound in criterion.cabal, so this affects all standard builds.

Criterion/Report.hs:94:42:
    Couldn't match kind `*' against `* -> *'
    Kind incompatibility when matching types:
      m0 :: * -> *
      MuType :: (* -> *) -> *
    In the expression:
      mkGenericContext reportAnalysis $ H.encodeStr nym
    In a case alternative:
    ('a' : 'n' : _)
      -> mkGenericContext reportAnalysis $ H.encodeStr nym

Criterion/Report.hs:99:42:
    Couldn't match expected type `IO (MuType IO)'
        with actual type `MuType m0'
    Expected type: H.MuContext IO
      Actual type: B.ByteString -> MuType m0
    In the third argument of `H.hastacheStr', namely `context'
    In a stmt of a 'do' block:
      H.hastacheStr H.defaultConfig template context

Greetings, Alex

A benchmark's runtime can depend on the presence/absence of other benchmarks

The following code compares two functions for summing over a vector:

{-# LANGUAGE BangPatterns #-}

import Control.DeepSeq
import Criterion
import Criterion.Main
import Data.Vector.Unboxed as VU
import Data.Vector.Generic as VG
import qualified Data.Vector.Fusion.Stream as Stream

sumV :: (VG.Vector v a, Num a) => v a -> a
sumV = Stream.foldl' (+) 0 . VG.stream

main = do   
    let v10 = VU.fromList [0..9]  :: VU.Vector Double
    deepseq v10 $ return ()

    defaultMain 
        [ bench "sumV"   $ nf sumV v10
        ]

But suppose I change the last few lines to the following:

    defaultMain 
        [ bench "sumV"   $ nf sumV v10
        , bench "VU.sum" $ nf VU.sum v10    -- Added this line
        ]

This, surprisingly, affects the runtime of the sumV benchmark. It makes it about 20% faster. Similarly, if we remove the sumV benchmark and leave the VU.sum benchmark, the VU.sum benchmark becomes about 20% slower. Tests were run with the patched criterion-1.0.0.2 I sent on ghc-7.8.3 with the -O2 -fllvm flags.

What's going on is that different core is generated for the sumV and VU.sum benchmarks depending on whether the other benchmark is present. Essentially, the common bits are being factored out and placed in a function, and this function is getting called in both benchmarks. This happens to make both benchmarks faster.

I'm not sure if this should be considered a "proper bug," but it confused me for a an hour or so. It's something that criterion users (especially those performing really small benchmarks) probably should be aware of.

Prelude.chr: bad argument: 2711551

When benchmarking threading-intensive applications and libraries, I've started seeing this error crop up across multiple projects.

It is nondeterministic, but it seem to be related to the fact that in each projects passes the following options:

 --regress=allocated:iters --regress=bytesCopied:iters --regress=cycles:iters \
 --regress=numGcs:iters --regress=mutatorWallSeconds:iters --regress=gcWallSeconds:iters \
 --regress=cpuTime:iters --raw report.criterion -o report.html

Because I have not been able to reproduce it when doing a (minimal) run with no options. Nor have I (yet) been able to pin down by process of elimination which one of these options introduces the bug.

I haven't yet tried to reproduce it in profiling mode to get a backtrace...

Removing Outliers

When attempting to collect benchmarking data, it occasionally happens that system noise will make a few of the iterations take much longer than other iterations. If one iteration out of a thousand does this and is ten times slower than the rest, this inflates the average by 10% and can cause havoc with the standard deviation. Even if only one iteration out of a hundred does this, then the average and standard deviation are useless even though the data is meaningful if you exclude these outliers.

I would like the ability to instruct Criterion to omit such outliers from it's calculations. Even something as simple as removing the best and worst 10% of samples would often be sufficient.

Of course this could be abused (e.g., the standard deviation doesn't mean quite so much if you remove the best and worst 49% of samples), but for benchmarking things like CPU times of code that does no IO or external calculation it can be very useful as benchmark numbers often need to reflect the performance of the code being benchmarked instead of whatever system noise happened to randomly kick in.

Possible extensions of this idea include reporting the median and/or mode. Another possibility is to try to fit the sampling data to some sort of distribution is flexible enough to account for system noise (e.g. a Poisson, bimodal(*) or mixture based distribution) and then reporting the parameters of that distribution (e.g., the location of the peak (or peaks) of the Poisson or or bimodal distribution rather than just the mean).

(*) The second peak represents when the system noise kicks up.