haskell / criterion Goto Github PK
View Code? Open in Web Editor NEWA powerful but simple library for measuring the performance of Haskell code.
Home Page: http://www.serpentine.com/criterion
License: BSD 2-Clause "Simplified" License
A powerful but simple library for measuring the performance of Haskell code.
Home Page: http://www.serpentine.com/criterion
License: BSD 2-Clause "Simplified" License
I think there is an error in criterion's data analysis. Here is simplified description of algorithm as I understand it. For simplicity I will ignore discreteness of clock.
The problem is that clock call cost is measured quantity and this measurement have error (σ(t) from now on) and it's never taken into account. This error corresponds to shifts of timing distribution as whole and couldn't be eliminated. No averaging procedure could detect such shifts. So error for benchmark couldn't be less than σ(t). It shouldn't be significant for long-running functions. But it is significant for function which take same or less time to complete as getPOSIXTime. mwc-random's benchmarks should be affected.
"Small changes to a program or its execution environment can perturb its layout, which affects caches and branch predictors. The impact of these layout changes is unpredictable and substantial: Mytkowicz et al. show that just changing the size of environment variables can trigger performance degradation as high as 300%; we find that simply changing the link order of object files can cause performance to decrease by as much as 57%. Failure to control for layout is a form of measurement bias. All executions constitute just one sample from the vast space of possible memory layouts. This limited sampling makes statistical tests inapplicable, since they depend on multiple samples over a space, often with a known distribution. As a result, it is currently not possible to test whether a code modification is the direct cause of any observed performance change, or if it is due to incidental effects like a different code, stack, or heap layout. A random memory layout eliminates the effect of layout on performance, and repeated randomization leads to normally-distributed execution times. This makes it straightforward to use standard statistical tests for performance evaluation." (courtesy of the paper at http://www.stabilizer-tool.org/)
in http://www.serpentine.com/criterion/tutorial.html
the following paragraph contains an anchor tag without a meaningful target (for the phrase "normal form"):
We use nfIO to specify that after we run the IO action, its result must be evaluated
to normal form, i.e. so that all of its internal constructors are fully evaluated, and it
contains no thunks.
Did you mean it to point here: http://www.haskell.org/haskellwiki/Weak_head_normal_form ?
The documentation in Criterion.Main says, in the "Benchmarking pure code" part:
The first is a function which will cause results to be evaluated to head normal form (NF):
nf :: NFData b => (a -> b) -> a -> Pure
Should this say "normal form" instead of "head normal form"? Maybe I have misunderstood the terms.
When a comparison CSV file is produced, there doesn't seem to be any escaping done to the benchmark names.
(e.g. quoting fields that contain commas, or escaping literal quotation marks)
Criterion build is failing on Mac (but not on Linux).
System details first:
$ which ghc
/Volumes/Data/scripts/ghc/7.6.1/bin/ghc
$ uname -a
Darwin desktop.local 12.2.0 Darwin Kernel Version 12.2.0: Sat Aug 25 00:48:52 PDT 2012; root:xnu-2050.18.24~1/RELEASE_X86_64 x86_64
I get the error below when trying to do cabal install:
Building criterion-0.6.1.1...
[ 1 of 12] Compiling Paths_criterion ( dist/build/autogen/Paths_criterion.hs, dist/build/Paths_criterion.o )
dist/build/autogen/Paths_criterion.hs:21:13: Not in scope: `catch'
dist/build/autogen/Paths_criterion.hs:22:13: Not in scope: `catch'
dist/build/autogen/Paths_criterion.hs:23:14: Not in scope: `catch'
dist/build/autogen/Paths_criterion.hs:24:17: Not in scope: `catch'
cabal: Error: some packages failed to install:
criterion-0.6.1.1 failed during the building phase. The exception was:
ExitFailure 1
I get the following error in my criterion code
Exception inside child thread "(worker 0 of originator ThreadId 3)", ThreadId 7: thread blocked indefinitely in an MVar operation
benchmarks: thread blocked indefinitely in an MVar operation
heres a simplified piece of code that replicates this error
{-# LANGUAGE BangPatterns #-}
module Main where
import Criterion.Main
import Criterion.Config
import Data.Word
import Data.Foldable
whnfIter:: Int ->(a->a)-> a -> Pure
whnfIter cnt f arg = whnf (\v -> foldl' (\ a b -> f a ) v [0 .. cnt] ) arg
main = defaultMainWith defaultConfig{cfgSamples=ljust 10} (return ()) [
bgroup "Morton Z" [
bcompare [
bench "addingNumbersIter1000" $! whnfIter 1000 ( (7 + ):: Word->Word) 9 -- ,
]] ]
heres the ghc options i'd used, most of these probably are irrelevant, but i'm putting them here for completeness
ghc-options: -O2 -optlo "-O3" -fllvm -optlc "-O3" -optlo "-std-compile-opts" -optlo "-bb-vectorize" -fllvm-tbaa -optlo "-regalloc=pbqp" -rtsopts -threaded -with-rtsopts=-N2
Preprocessing test suite 'tests' for criterion-1.0.0.2...
tests/Tests.hs:5:8:
Could not find module ‘Properties’
It would be nice to have error bars on summary plots so one could easily see whether difference could be explained by statistical fluctuations. Sadly flot lack builtin error bars but there is a plugin for it. Another possible approach is to use overlapping bar plots a la:
-----------
| | |
-----------
It seems statistics 0.10 changed Statistics.KernelDensity
to Statistics.Sample.KernelDensity
, which causes criterion building fail.
You need to set the upper limit of statistics version.
On Windows, getPOSIXTime uses the C function GetSystemTimeAsFileTime
, which gives low-resolution timings. As observed in issue #11, it will return the same value many times in a row. criterion
therefore produces low-quality statistics under Windows. QueryPerformanceCounter
is the appropriate timing source for benchmarks on Windows; we can access it with Clock.getTime Clock.Monotonic
from the clock
package. See example data below.
My patch, conklech/criterion@b6f657c, fixes the problem on Windows, but might create compatibility problems on some POSIX systems.
In issue #2, @coreyoconnor gave a similar patch, coreyoconnor/criterion@37b1216, which uses Clock.ProcessCPUTime
instead. That has different semantics--which was the goal there. But it's not clear that Clock.ProcessCPUTime
gives benchmark-grade data on Windows. It uses GetProcessTimes
under the hood; MSDN doesn't explain where the data comes from, and I haven't tested it.
Under POSIX, time
calls gettimeofday
and clock
calls clock_gettime
. Cursory Google research indicates that clock_gettime
with CLOCK_MONOTONIC
is the proper benchmark timesource; gettimeofday
can go backwards, for instance when ntp
is adjusting the clock.
However, it looks like some systems don't implement CLOCK_MONOTONIC
. Perhaps just OS X? I don't know. Someone with better POSIX knowledge should look into compatibility issues with CLOCK_MONOTONIC
in non-Win32 environments.
On my Windows 7 laptop, criterion
using time
(i.e. criterion-0.6.2.1
) reports an estimated resolution of about 3.2 msec, with the weird 200% outlier rate reported by others. However, the final report data shows that all measurements share the same two or three values:
After dropping in clock
and using Clock.Monotonic
, the resolution is reported as 4.5 msec but the outlier rate problem is gone, and the final report shows an actual distribution of data:
Same problem as in haskell/statistics#52.
I don't know if criterion is supposed to work on Windows, but it doesn't seem to work. The clock estimation gets more outliers than samples. Impressive. :)
Benchmark program:
import Criterion.Main
main = defaultMain [bench "test" $ whnf succ 0]
Running this you get:
$ ghc --make Crit.hs
[1 of 1] Compiling Main ( Crit.hs, Crit.o )
Linking Crit.exe ...
Lennart@Lennart-Work /workspace/Proj
$ ./Crit
Warning: Couldn't open /dev/urandom
Warning: using system clock for seed instead (quality will be lower)
warming up
estimating clock resolution...
mean is 1.501650 us (640001 iterations)
found 1279037 outliers among 639999 samples (199.8%)
639038 (99.8%) low severe
639999 (100.0%) high severe
estimating cost of a clock call...
mean is 125.8999 ns (14 iterations)
found 1 outliers among 14 samples (7.1%)
1 (7.1%) high severe
benchmarking test
mean: 11.38051 ns, lb 10.87220 ns, ub 12.19332 ns, ci 0.950
std dev: 3.319025 ns, lb 2.605880 ns, ub 4.085354 ns, ci 0.950
found 12 outliers among 100 samples (12.0%)
12 (12.0%) high severe
variance introduced by outliers: 97.824%
variance is severely inflated by outliers
If a computation runs quite long, say, longer than 0.5s, then I get a result like:
OLS regression 544 ms 570 ms 0 s
R² goodness-of-fit 0.999 0.999 xxx
in the HTML report and then all following regression results are "xxx", too. This also implies that the overview diagram is not generated.
[10 of 12] Compiling Criterion.Report ( Criterion/Report.hs, dist/build/Criterion/Report.o )
Criterion/Report.hs:73:25:
No instance for (MonadIO Criterion)
arising from a use of `liftIO'
Possible fix: add an instance declaration for (MonadIO Criterion)
In the expression: liftIO
In the expression:
liftIO
$ do { tpl <- loadTemplate
[".", templateDir] (fromLJ cfgTemplate cfg);
L.writeFile name =<< formatReport reports tpl }
In a case alternative:
Last (Just name)
-> liftIO
$ do { tpl <- loadTemplate [".", ....] (fromLJ cfgTemplate cfg);
L.writeFile name =<< formatReport reports tpl }
cabal: Error: some packages failed to install:
criterion-0.6.0.0 failed during the building phase. The exception was:
ExitFailure 1
Here are my currently install packages:
% ghc-pkg list
/usr/lib/ghc-7.0.4/package.conf.d
Cabal-1.10.2.0
HTTP-4000.1.1
HUnit-1.2.0.3
MissingH-1.1.0.3
X11-1.5.0.0
X11-xft-0.3
array-0.3.0.2
base-4.3.1.0
bin-package-db-0.0.0.0
binary-0.5.0.2
bytestring-0.9.1.10
containers-0.4.0.0
dataenc-0.14
directory-1.1.0.0
extensible-exceptions-0.1.1.2
ffi-1.0
filepath-1.2.0.0
ghc-7.0.4
ghc-binary-0.5.0.2
ghc-paths-0.1.0.8
ghc-prim-0.2.0.0
haddock-2.9.2
hashed-storage-0.4.13
haskeline-0.6.4.0
haskell2010-1.0.0.0
haskell98-1.1.0.1
hpc-0.5.0.6
hslogger-1.1.4
html-1.0.1.2
integer-gmp-0.2.0.3
mmap-0.4.1
mtl-1.1.1.1
network-2.2.1.7
old-locale-1.0.0.2
old-time-1.0.0.6
parsec-2.1.0.1
pretty-1.0.1.2
process-1.0.1.5
random-1.0.0.3
regex-base-0.93.1
regex-compat-0.92
regex-posix-0.94.1
rts-1.0
stm-2.1.2.2
syb-0.3.2
template-haskell-2.5.0.0
terminfo-0.3.1.3
time-1.1.2.0
time-1.2.0.3
unix-2.4.2.0
utf8-string-0.3.6
xhtml-3000.2.0.1
xmonad-0.9.2
xmonad-contrib-0.9.2
zlib-0.5.2.0
/home/ollie/.ghc/i386-linux-7.0.4/package.conf.d
aeson-0.5.0.0
asn1-data-0.6.1.2
attoparsec-0.10.1.0
attoparsec-conduit-0.0.1
attoparsec-enumerator-0.3
base-unicode-symbols-0.2.2.3
base64-bytestring-0.1.1.0
blaze-builder-0.3.0.2
blaze-builder-conduit-0.0.1
case-insensitive-0.4.0.1
cereal-0.3.5.1
certificate-1.0.1
cmdargs-0.9.2
conduit-0.1.1
cprng-aes-0.2.3
crypto-api-0.8
crypto-pubkey-types-0.1.0
cryptocipher-0.3.0
cryptohash-0.7.4
data-default-0.3.0
deepseq-1.2.0.1
dlist-0.5
double-conversion-0.2.0.4
entropy-0.2.1
enumerator-0.4.18
erf-2.0.0.0
failure-0.2.0
hashable-1.1.2.2
hastache-0.2.4
http-conduit-1.2.0
http-types-0.6.8
largeword-1.0.1
lifted-base-0.1.0.2
math-functions-0.1.1.0
monad-control-0.3.1
monad-par-0.1.0.3
mwc-random-0.11.0.0
network-2.3.0.8
parsec-3.1.2
primitive-0.4.1
semigroups-0.8
statistics-0.10.1.0
tagged-0.2.3.1
text-0.11.1.12
text-format-0.3.0.7
tls-0.8.5
tls-extra-0.4.2
transformers-0.2.2.0
transformers-base-0.4.1
unix-compat-0.3.0.1
unordered-containers-0.1.4.6
vector-0.9.1
vector-algorithms-0.5.3
zlib-bindings-0.0.2
zlib-conduit-0.0.1
And my GHC information:
% ghc -V
The Glorious Glasgow Haskell Compilation System, version 7.0.4
Error is:
Preprocessing library criterion-0.8.0.0...
[ 1 of 14] Compiling Paths_criterion ( dist_prof/build/autogen/Paths_criterion.hs, dist_prof/build/Paths_criterion.o )
[ 2 of 14] Compiling Criterion.Config ( Criterion/Config.hs, dist_prof/build/Criterion/Config.o )
[ 3 of 14] Compiling Criterion.Monad ( Criterion/Monad.hs, dist_prof/build/Criterion/Monad.o )
[ 4 of 14] Compiling Criterion.Measurement ( Criterion/Measurement.hs, dist_prof/build/Criterion/Measurement.o )
[ 5 of 14] Compiling Criterion.IO.Printf ( Criterion/IO/Printf.hs, dist_prof/build/Criterion/IO/Printf.o )
[ 6 of 14] Compiling Criterion.Analysis.Types ( Criterion/Analysis/Types.hs, dist_prof/build/Criterion/Analysis/Types.o )
[ 7 of 14] Compiling Criterion.Analysis ( Criterion/Analysis.hs, dist_prof/build/Criterion/Analysis.o )
Criterion/Analysis.hs:129:29:
Couldn't match type `v0 Double -> Double'
with `Statistics.Types.Estimator'
Expected type: [Statistics.Types.Estimator]
Actual type: [v0 Double -> Double]
In the second argument of `resample', namely `ests'
In the expression:
resample gen ests numResamples samples :: IO [Resample]
In the second argument of `($)', namely
`\ gen -> resample gen ests numResamples samples :: IO [Resample]'
Criterion/Analysis.hs:130:55:
Couldn't match type `v0 Double -> Double'
with `Statistics.Types.Estimator'
Expected type: [Statistics.Types.Estimator]
Actual type: [v0 Double -> Double]
In the third argument of `B.bootstrapBCA', namely `ests'
In the expression: B.bootstrapBCA ci samples ests resamples
In a pattern binding:
[estMean, estStdDev] = B.bootstrapBCA ci samples ests resamples
Failed to install criterion-0.8.0.0
heres an example of splitting the top level chart vs the current one
https://bitbucket.org/carter/multicriterion-templates/src/b1c1b808b71c?at=master is the current repo for this.
I'm not necessarily saying "merge this in", but might at least be worth having a curated "plugins.md" / "addons.md" / "moretools.md" list or something that is centrally curated
Criterion-0.6.2.0 fails to build with ghc-7.6.1 due to the removal of Prelude.catch in base-4.6.0.0. In particular, this causes the complication of Paths_criterion to fail.
I'm not sure what would cause this -- how could it be getting more outliers than there are samples?
warming up
estimating clock resolution...
mean is 5.257008 us (160001 iterations)
found 243504 outliers among 159999 samples (152.2%)
120537 (75.3%) low severe
122967 (76.9%) high severe
estimating cost of a clock call...
mean is 31.80684 ns (52 iterations)
found 6 outliers among 52 samples (11.5%)
4 (7.7%) high mild
2 (3.8%) high severe
I was doing some testing where I'm acquiring and consuming resources - popping rows off of a database. I found that criterion was consuming way more rows than I had expected.
So I tried another IO process with the code below:
import Control.Monad
import Criterion
import Criterion.Main
import Data.IORef (newIORef, readIORef, modifyIORef')
cri :: IO Int
cri = do
ref <- newIORef 0
defaultMain
[ bench "ioref" $ nfIO $ modifyIORef' ref (+1)
]
readIORef ref
noncri :: Int -> IO Int
noncri n = do
ref <- newIORef 0
replicateM_ n $ modifyIORef' ref (+1)
readIORef ref
main :: IO ()
main = do
cri >>= print
noncri 100 >>= print
And I got this:
warming up
estimating clock resolution...
mean is 1.308630 us (640001 iterations)
found 3253 outliers among 639999 samples (0.5%)
3041 (0.5%) high severe
estimating cost of a clock call...
mean is 49.82543 ns (9 iterations)
found 2 outliers among 9 samples (22.2%)
2 (22.2%) high severe
benchmarking ioref
mean: 151.7543 ns, lb 150.2320 ns, ub 153.5662 ns, ci 0.950
std dev: 8.485390 ns, lb 7.286958 ns, ub 9.796859 ns, ci 0.950
found 9 outliers among 100 samples (9.0%)
9 (9.0%) high mild
variance introduced by outliers: 53.488%
variance is severely inflated by outliers
927667
100
Why would the run hit the IORef a million times for a 100 config run? Or is this a spurious way to use criterion?
It appears Config.fromLJ
will not terminate when called to extract a field which was left mempty
in the defaultConfig
.
For example:
> fromLJ cfgResults defaultConfig
"^CInterrupted.
cc @mboes
I am currently trying to profile some code that is using the async package. It seems there is a problem with it, cause directly after the first test using async the program crashes:
/tmp/criterion19204.dat: hPutBuf: illegal operation (handle is closed)
This is the error when running compiled with -threaded but without +RTS -N -RTS. When I enable the threaded runtime I get:
Exception inside child thread "(worker 1 of originator ThreadId 3)", ThreadId 14: thread blocked indefinitely in an MVar operation
[several of them]
Exception inside child thread "(worker 6 of originator ThreadId 3)", ThreadId 19: thread blocked indefinitely in an MVar operation
Is there anything known about this?
Subj.
The issue is that cassava
's encode
function now expects a list instead of a vector. The encode
function from cassava-0.2.2.0
had type:
encode :: ToRecord a => Vector a -> ByteString
In cassava-0.3.0.0
it now has type:
encode :: ToRecord a => [a] -> ByteString
This causes problems when building Criterion.IO.Printf
(and possibly other modules, but this is as far as I got):
Criterion/IO/Printf.hs:102:56:
Could not deduce (G.Vector [] a)
arising from a use of `G.singleton'
from the context (Csv.ToRecord a)
bound by the type signature for
writeCsv :: Csv.ToRecord a => a -> Criterion ()
at Criterion/IO/Printf.hs:(99,1)-(103,24)
Possible fix:
add (G.Vector [] a) to the context of
the type signature for
writeCsv :: Csv.ToRecord a => a -> Criterion ()
or add an instance declaration for (G.Vector [] a)
In the second argument of `(.)', namely `G.singleton'
In the second argument of `(.)', namely `Csv.encode . G.singleton'
In the second argument of `(.)', namely
`B.appendFile fn . Csv.encode . G.singleton'
I'd issue a pull request, but I wasn't sure if you wanted to immediately switch to cassava-0.3.0.0
or still support cassava-0.2.2.0
or do some CPP magic to support both.
Benchmarking in IO often requires phases of initialization and cleanup or resource acquisition and releasing before and after execution of each sample. Currently it is impossible to exclude those phases from measuring.
To provide a flexible solution to this problem I suggest to update the interface of a class Benchmarkable
to the following:
class Benchmarkable a where
run :: a -> Int -> StartTimer -> StopTimer -> IO ()
type StartTimer = IO ()
type StopTimer = IO ()
After also adding the following:
newtype IOBenchmark = IOBenchmark (StartTimer -> StopTimer -> IO ())
instance Benchmarkable IOBenchmark where
run (IOBenchmark f) n start stop = replicateM n $ f start stop
The user will be able to finally apply it like so:
main = defaultMain
[
bench "My wonderful benchmark" $ IOBenchmark $ \startTimer stopTimer -> do
db <- establishADBConnection
startTimer -- !
workWithDB db
stopTimer -- !
closeDBConnection db
cleanUp
]
One can imagine some more involved scenarios, where the timer starts and stops multiple times during execution of just a single sample.
Another important benefit of the suggested approach is that it will support monad transformers:
main = defaultMain
[
bench "My wonderful benchmark" $ IOBenchmark $ \startTimer stopTimer -> do
runDBConnectionT $ do
liftIO $ startTimer
work
liftIO $ stopTimer
]
I've tried to build criterion, but it failed and the build produced the following very long error-log:
http://pastebin.com/8VW3q5g9
I'm running Windows 7 x64 SP1, GHC 7.8.3.
The build fails with both cabal-install 1.20.0.3, cabal library 1.20.0.2 and cabal-install 1.18.0.5, cabal library 1.18.1.3.
A fresh installation of the latest Haskell Platform (2014.2.0.0) for x64 machines was used (so the MinGW GCC version is GCC (rubenvb-4.6.3) 4.6.3).
ghc-pkg check
doesn't find any problems with any package.
A measurement including "µs" will trigger an error in System.IO if the default encoding doesn't support the µ character, with the following output:
fibber.exe: : commitBuffer: invalid argument (invalid character)
Notably this is an issue on Windows where the default encoding is latin1.
I think the best solution is to switch 'µ' to 'u'.
I tried using hSetEncoding and it works, but it will look like junk on a non-utf8 locale.
Hello,
compiled bin can't be properly run on another machine because of error ': TemplateNotFound "report.tpl"'. If "templates" dir from .cabal is supplied with bin, and bin run with "-t templates/report.tpl", the report is missing pics.
Ubuntu OS.
Best regards,
vlatko
Please see below for the error log on cabal install of criterion 0.6.0.0. The error doesn't happen with GHC 7.0.4. It will be very helpful to have criterion build on 6.12.3 as well because I am trying to pin-point the performance regression between 6.12.3 and 7.0.4 for my code. I am using RHEL 5 x86_64.
Configuring criterion-0.6.0.0...
Preprocessing library criterion-0.6.0.0...
Building criterion-0.6.0.0...
[ 1 of 12] Compiling Paths_criterion ( dist/build/autogen/Paths_criterion.hs, dist/build/Paths_criterion.o )
[ 2 of 12] Compiling Criterion.Analysis.Types ( Criterion/Analysis/Types.hs, dist/build/Criterion/Analysis/Types.o )
[ 3 of 12] Compiling Criterion.Types ( Criterion/Types.hs, dist/build/Criterion/Types.o )
[ 4 of 12] Compiling Criterion.Measurement ( Criterion/Measurement.hs, dist/build/Criterion/Measurement.o )
[ 5 of 12] Compiling Criterion.Config ( Criterion/Config.hs, dist/build/Criterion/Config.o )
[ 6 of 12] Compiling Criterion.Monad ( Criterion/Monad.hs, dist/build/Criterion/Monad.o )
[ 7 of 12] Compiling Criterion.IO ( Criterion/IO.hs, dist/build/Criterion/IO.o )
[ 8 of 12] Compiling Criterion.Analysis ( Criterion/Analysis.hs, dist/build/Criterion/Analysis.o )
[ 9 of 12] Compiling Criterion.Environment ( Criterion/Environment.hs, dist/build/Criterion/Environment.o )
[10 of 12] Compiling Criterion.Report ( Criterion/Report.hs, dist/build/Criterion/Report.o )
Criterion/Report.hs:73:24:
No instance for (MonadIO Criterion)
arising from a use of liftIO' at Criterion/Report.hs:73:24-29 Possible fix: add an instance declaration for (MonadIO Criterion) In the first argument of
($)', namely `liftIO'
In the expression:
liftIO
$ do { tpl <- loadTemplate
[".", templateDir](fromLJ cfgTemplate cfg);
L.writeFile name =<< formatReport reports tpl }
In a case alternative:
Last (Just name)
-> liftIO
$ do { tpl <- loadTemplate [".", ....](fromLJ cfgTemplate cfg);
L.writeFile name =<< formatReport reports tpl }
cabal: Error: some packages failed to install:
criterion-0.6.0.0 failed during the building phase. The exception was:
ExitFailure 1
When I run
import Criterion.Main
main = defaultMain [
bcompare [
bench "exp" $ whnf exp (2 :: Double)
, bench "log" $ whnf log (2 :: Double)
, bench "sqrt" $ whnf sqrt (2 :: Double)
]
]
I get:
Not in scope: `bcompare'
I am using criterion 0.8.1, vector 10.9.1 and ghc 7.8.2.
The following code:
import Criterion.Config
import Criterion.Main
critConfig = defaultConfig
{ cfgSamples = ljust 4
-- , cfgResamples = ljust 2
}
main = do
defaultMainWith critConfig (return ())
[ bench "bug" $ print "squash"
]
causes a crash with the error message:
criterion-bug: ./Data/Vector/Generic.hs:249 ((!)): index out of bounds (-9223372036854775808,100000)
criterion-bug: thread blocked indefinitely in an MVar operation
The big negative number is equal to minBound::Int
. If you replace the 4 with a 5, then everything works fine.
The 100000 made me thing the error might have something to do with cfgResamples
. If you uncomment the line assigning it to 2, then the program works fine as long as cfgSamples
is greater than 25. The program stops working, however, if you set it to a smaller number.
I tried searching for where the library uses the !
to index into a vector with grep: grep -r "\!" Criterion
. But unfortunately, this turned up no uses.
Code with problematic output attached below.
The issue is different measurements - if defaultMain has multiple vector benchmarks, the measurements seem to include time to build the function inputs as well. If I benchmark only one vector function, the measurement excludes the time to build the function input, as I expected it to. The code below has two bench functions "DoubleV" and "IntV". If I benchmark both, I get ~400ns for each. If I benchmark only "DoubleV" by commenting out "IntV", I get ~185ns which is the expected measurement. I didn't see this issue when I initially created the functions using List. But, when I switched from List to unboxed Vectors, I saw this issue. So, it seems to be specific to vectors.
I can reproduce it only for the code below in question - I tried to reproduce it for functions built using basic operations like foldl but couldn't.
-- | Begin Haskell code
import Data.ByteString.Internal (unsafeCreate,ByteString)
import qualified Data.Vector.Unboxed as V (Vector, forM_, fromList, replicate, length, Unbox)
import Data.Bits (shiftR)
import GHC.Int (Int16,Int32,Int64)
import GHC.Word (Word8,Word16,Word32,Word64)
import Unsafe.Coerce (unsafeCoerce)
import Criterion.Main
import Foreign (Ptr, poke, plusPtr)
import Data.IORef (newIORef, readIORef, writeIORef)
-- | Write a Word32 in little endian format
putWord32le :: Word32 -> Ptr Word8 -> IO()
putWord32le w p = do
poke p (fromIntegral (w) :: Word8)
poke (p plusPtr
1) (fromIntegral (shiftR w 8) :: Word8)
poke (p plusPtr
2) (fromIntegral (shiftR w 16) :: Word8)
poke (p plusPtr
3) (fromIntegral (shiftR w 24) :: Word8)
{-# INLINE putWord32le #-}
-- | Write a Word64 in little endian format
putWord64le :: Word64 -> Ptr Word8 -> IO()
putWord64le w p = do
poke p (fromIntegral (w) :: Word8)
poke (p plusPtr
1) (fromIntegral (shiftR w 8) :: Word8)
poke (p plusPtr
2) (fromIntegral (shiftR w 16) :: Word8)
poke (p plusPtr
3) (fromIntegral (shiftR w 24) :: Word8)
poke (p plusPtr
4) (fromIntegral (shiftR w 32) :: Word8)
poke (p plusPtr
5) (fromIntegral (shiftR w 40) :: Word8)
poke (p plusPtr
6) (fromIntegral (shiftR w 48) :: Word8)
poke (p plusPtr
7) (fromIntegral (shiftR w 56) :: Word8)
{-# INLINE putWord64le #-}
-- | Function to generate putWordleV functions, N = 16,32,64
genFnPutWordNleV :: (V.Unbox a) => (a -> Ptr Word8 -> IO()) -> Int -> V.Vector a -> Ptr Word8 -> IO()
genFnPutWordNleV f n w p = do
addr <- newIORef p -- Store the initial pointer address
V.forM_ w $ \x -> do -- loop over WordN list - output type is IO()
curAddr <- readIORef addr -- get the address for current wordN
writeIORef addr (curAddr plusPtr
n) -- write the address for next wordN
f x curAddr -- put the current WordN - this must be last action in order to satisfy output type IO()
putWord64leV :: V.Vector Word64 -> Ptr Word8 -> IO()
putWord64leV = genFnPutWordNleV putWord64le 8
{-# INLINE putWord64leV #-}
putWord32leV :: V.Vector Word32 -> Ptr Word8 -> IO()
putWord32leV = genFnPutWordNleV putWord32le 4
{-# INLINE putWord32leV #-}
encodeDoubleV :: V.Vector Double -> ByteString
encodeDoubleV x = unsafeCreate (8*(V.length x)) (putWord64leV $ unsafeCoerce x)
encodeInt32V :: V.Vector GHC.Int.Int32 -> ByteString
encodeInt32V x = unsafeCreate (4*(V.length x)) (putWord32leV $ unsafeCoerce x)
main :: IO ()
main = do
let intv = V.fromList [1..10] :: V.Vector GHC.Int.Int32
doublev = V.fromList [1..10] :: V.Vector Double
defaultMain [
bench "DoubleV" $ whnf encodeDoubleV doublev
,bench "IntV" $ whnf encodeInt32V intv
]
-- | End Haskell Code
warming up
estimating clock resolution...
mean is 5.155132 us (160001 iterations)
found 275510 outliers among 159999 samples (172.2%)
135061 (84.4%) low severe
140449 (87.8%) high severe
estimating cost of a clock call...
mean is 47.38399 ns (35 iterations)
found 5 outliers among 35 samples (14.3%)
2 (5.7%) high mild
3 (8.6%) high severe
benchmarking DoubleV
mean: 407.7989 ns, lb 406.7614 ns, ub 409.2373 ns, ci 0.950
std dev: 6.238086 ns, lb 4.913598 ns, ub 9.148503 ns, ci 0.950
benchmarking IntV
mean: 401.6377 ns, lb 400.5134 ns, ub 402.9650 ns, ci 0.950
std dev: 6.253466 ns, lb 5.415961 ns, ub 7.395777 ns, ci 0.950
warming up
estimating clock resolution...
mean is 5.142457 us (160001 iterations)
found 269135 outliers among 159999 samples (168.2%)
132865 (83.0%) low severe
136270 (85.2%) high severe
estimating cost of a clock call...
mean is 46.25240 ns (35 iterations)
found 2 outliers among 35 samples (5.7%)
2 (5.7%) high severe
benchmarking DoubleV
mean: 183.1018 ns, lb 182.2975 ns, ub 184.4998 ns, ci 0.950
std dev: 5.297046 ns, lb 3.595863 ns, ub 8.951493 ns, ci 0.950
Compile details: ghc -O --make
Criterion Version: 0.5.1.1 (but cabal list shows installed versions as: 0.5.1.0, 0.5.1.1)
Compiler: GHC 7.0.3
OS: Mac Darwin Kernel Version 11.2.0, root:xnu-1699.24.8~1/RELEASE_X86_64 x86_64
StackOverflow post: http://stackoverflow.com/questions/8379191/forcing-evaluation-of-function-input-before-benchmarking-in-criterion (different title because I initially misunderstood the issue as problem of lazy evaluation of inputs)
Have a look at http://htmlpreview.github.io/?https://github.com/nh2/loop/blob/ab6c10fc84/results/bench-foldl-and-iorefs-are-slow-llvm.html.
In the bar chart at the top the first time is 5.02 ms
.
In the details for the first benchmark, Mean execution time
says 5.015 s
.
Current outlier summary is a little confusing:
benchmarking Set/toList
mean: 180.5426 ns, lb 153.1244 ns, ub 217.2590 ns, ci 0.950
std dev: 511.4385 ns, lb 422.8353 ns, ub 804.1799 ns, ci 0.950
found *1771* outliers among 1000 samples (177.1%)
771 (77.1%) low severe
1000 (100.0%) high severe
variance introduced by outliers: 99.900%
variance is severely inflated by outliers
Obviously, criterion
sums over low and high severity outliers, but is this really the expected behaviour?
Criterion/Analysis.hs:129:29:
Couldn't match type `v0 Double -> Double'
with `Statistics.Types.Estimator'
Expected type: [Statistics.Types.Estimator]
Actual type: [v0 Double -> Double]
In the second argument of `resample', namely `ests'
In the expression:
resample gen ests numResamples samples :: IO [Resample]
In the second argument of `($)', namely
`\ gen -> resample gen ests numResamples samples :: IO [Resample]'
Criterion/Analysis.hs:130:55:
Couldn't match type `v0 Double -> Double'
with `Statistics.Types.Estimator'
Expected type: [Statistics.Types.Estimator]
Actual type: [v0 Double -> Double]
In the third argument of `B.bootstrapBCA', namely `ests'
In the expression: B.bootstrapBCA ci samples ests resamples
In a pattern binding:
[estMean, estStdDev] = B.bootstrapBCA ci samples ests resamples
adding < 0.11.0.0 to statistics in the cabal file fixes it.
I get strange results when benchmarking IO
actions that don't do any actual IO
. This issue usually arises when benchmarking streaming libraries like pipes
and conduit
when I try to test the raw overhead of the streaming machinery.
To illustrate what I mean, consider these two pipelines for pipes
and conduit
:
pipes, conduit :: (Monad m) => Int -> m ()
pipes n = runEffect $ for (each [1..n] >-> P.map (+1) >-> P.filter even) discard
conduit n = C.sourceList [1..n] $= C.map (+1) $= C.filter even $$ C.sinkNull
These don't actually do any IO
, so they type-check as any monad. However, if I specialize them to the IO
monad and benchmark them using nfIO
, I get very different result from specializing them to the Identity
monad and using whnf
.
Here's the code I use to benchmark the two alternatives:
import Criterion.Main
import Data.Conduit
import qualified Data.Conduit.List as C
import Pipes
import qualified Pipes.Prelude as P
import Data.Functor.Identity
criterion :: Int -> IO ()
criterion n = defaultMain
[ bgroup "IO"
[ bench "pipes" $ nfIO (pipes n)
, bench "conduit" $ nfIO (conduit n)
]
, bgroup "Identity"
[ bench "pipes" $ whnf (runIdentity . pipes ) n
, bench "conduit" $ whnf (runIdentity . conduit) n
]
]
pipes, conduit :: (Monad m) => Int -> m ()
pipes n = runEffect $ for (each [1..n] >-> P.map (+1) >-> P.filter even) discard
conduit n = C.sourceList [1..n] $= C.map (+1) $= C.filter even $$ C.sinkNull
main = criterion (10^5)
... and here are the results. I want to highlight that the benchmark results for specializing to IO
are very different from specializing to Identity
. Usually the IO
-based benchmarks are several orders of magnitude faster:
warming up
estimating clock resolution...
mean is 24.61508 ns (20480001 iterations)
found 390965 outliers among 20479999 samples (1.9%)
265634 (1.3%) high mild
125331 (0.6%) high severe
estimating cost of a clock call...
mean is 23.52013 ns (1 iterations)
benchmarking IO/pipes
mean: 71.52687 ns, lb 65.49627 ns, ub 91.83582 ns, ci 0.950
std dev: 50.86117 ns, lb 17.29679 ns, ub 113.8678 ns, ci 0.950
found 7 outliers among 100 samples (7.0%)
4 (4.0%) high mild
3 (3.0%) high severe
variance introduced by outliers: 98.970%
variance is severely inflated by outliers
benchmarking IO/conduit
mean: 1.327538 ms, lb 1.311253 ms, ub 1.353610 ms, ci 0.950
std dev: 103.8509 us, lb 73.29134 us, ub 144.2492 us, ci 0.950
found 11 outliers among 100 samples (11.0%)
2 (2.0%) high mild
9 (9.0%) high severe
variance introduced by outliers: 69.713%
variance is severely inflated by outliers
benchmarking Identity/pipes
mean: 3.579730 ms, lb 3.495507 ms, ub 3.642569 ms, ci 0.950
std dev: 371.3227 us, lb 294.6107 us, ub 451.7314 us, ci 0.950
found 35 outliers among 100 samples (35.0%)
22 (22.0%) low severe
6 (6.0%) high mild
7 (7.0%) high severe
variance introduced by outliers: 80.047%
variance is severely inflated by outliers
benchmarking Identity/conduit
mean: 16.22628 ms, lb 16.17274 ms, ub 16.30177 ms, ci 0.950
std dev: 322.3388 us, lb 245.0389 us, ub 416.8648 us, ci 0.950
found 8 outliers among 100 samples (8.0%)
7 (7.0%) high severe
variance introduced by outliers: 13.224%
variance is moderately inflated by outliers
Whenever I get a weird criterion
result, I always compare to time
. In this case, I timed the above program by just changing the main
function to either:
main = pipes (10^8)
-- or:
main = conduit (10^8)
This gave the following results:
$ time ./bench # main = pipes (10^8)
real 0m2.405s
user 0m2.380s
sys 0m0.016s
...
$ time ./bench # main = conduit (10^8)
real 0m16.243s
user 0m16.161s
sys 0m0.048s
If you scale these down 1000x to match the criterion tests they agree better with the Identity
-based results. Also, common sense says that there is no way that pipes
is chewing through 10^6 elements in 71 nanoseconds on my laptop, which is the other reason I don't believe the IO
-based benchmarks.
So I already have a work-around for this (specialize to Identity
instead of IO
), but I thought you might be interested because benchmarking IO
-free IO
actions seems to have triggered a weird corner case of criterion
that could be worth investigating. This issue occurs with both criterion-0.8.0.0
and criterion-0.9.0.0
and occurs with the cfgPerformGC
flag set to either True
or False
.
I'm also pinging @cartazio on this because I mentioned this issue to him previously and he was interested in example code for this.
Has it been considered to have a delay flag, that will sleep between sample runs for a specified Xms parameter? I can imagine a Haskell program that use OS resource files such as TCP connections of the like, that are not forcibly killed by the OS, which creates an issue with Haskell programs being immediately re-sampled by Criterion.
Users of criterion might want criterion to offer a --delay Xms flag. The X would not be considered in any timings, but would let the Haskell code "cool down", and all OS resources removed e.g. 2 seconds between is run.
Has a --delay Xms flag been considered?
Currently criterion only able to benchmark functions using wall clock time. My experiments show that such measurements are quite sensitive to CPU load and execution time could easily double on heavy loaded system. What's worse such measurements are not reproducible. Every successive run could get different answer and discrepancy couldn't be explained by statistical fluctuations.
I did quick experiment and replaced getPOSIXTime with getCPUTime. In this case measurements didn't depend on CPU load as one could expect. So CPU time is better performance metric for some functions, like numeric ones.
Right now in the function defaultMainWith, the format of the summary file is hard coded as far as I can tell. It would be wonderful to have the ability to select only certain numbers to be thrown into the summary file.
A complete log file is available at http://hydra.cryp.to/build/45829/nixlog/2/raw. I guess this issue is triggered by the latest version of unix-bytestring
.
Building criterion-0.8.0.0...
Preprocessing library criterion-0.8.0.0...
[ 1 of 14] Compiling Paths_criterion ( dist/dist-sandbox-be38e152/build/autogen/Paths_criterion.hs, dist/dist-sandbox-be38e152/build/Paths_criterion.o )
[ 2 of 14] Compiling Criterion.Config ( Criterion/Config.hs, dist/dist-sandbox-be38e152/build/Criterion/Config.o )
[ 3 of 14] Compiling Criterion.Monad ( Criterion/Monad.hs, dist/dist-sandbox-be38e152/build/Criterion/Monad.o )
[ 4 of 14] Compiling Criterion.Measurement ( Criterion/Measurement.hs, dist/dist-sandbox-be38e152/build/Criterion/Measurement.o )
[ 5 of 14] Compiling Criterion.IO.Printf ( Criterion/IO/Printf.hs, dist/dist-sandbox-be38e152/build/Criterion/IO/Printf.o )
[ 6 of 14] Compiling Criterion.Analysis.Types ( Criterion/Analysis/Types.hs, dist/dist-sandbox-be38e152/build/Criterion/Analysis/Types.o )
[ 7 of 14] Compiling Criterion.Analysis ( Criterion/Analysis.hs, dist/dist-sandbox-be38e152/build/Criterion/Analysis.o )
Criterion/Analysis.hs:127:15:
No instance for (Data.Vector.Generic.Base.Vector v0 Double)
arising from a use of `mean'
Possible fix:
add an instance declaration for
(Data.Vector.Generic.Base.Vector v0 Double)
In the expression: mean
In the expression: [mean, stdDev]
In an equation for `ests': ests = [mean, stdDev]
Criterion/Analysis.hs:129:25:
Couldn't match type `primitive-0.5.2.1:Control.Monad.Primitive.PrimState
m0'
with `GHC.Prim.RealWorld'
Expected type: primitive-0.5.2.1:Control.Monad.Primitive.PrimState
IO
Actual type: primitive-0.5.2.1:Control.Monad.Primitive.PrimState
m0
Expected type: System.Random.MWC.Gen
(primitive-0.5.2.1:Control.Monad.Primitive.PrimState IO)
Actual type: System.Random.MWC.Gen
(primitive-0.5.2.1:Control.Monad.Primitive.PrimState m0)
In the first argument of `resample', namely `gen'
In the expression:
resample gen ests numResamples samples :: IO [Resample]
Criterion/Analysis.hs:129:29:
Couldn't match expected type `Statistics.Types.Estimator'
with actual type `v0 Double -> Double'
Expected type: [Statistics.Types.Estimator]
Actual type: [v0 Double -> Double]
In the second argument of `resample', namely `ests'
In the expression:
resample gen ests numResamples samples :: IO [Resample]
Obviously based on some combination of the dependencies of my project and the ones you've specified in criterion, cabal has chosen what it thinks should be a valid build plan -- probably you're accepting a version of one of your deps that you don't actually support.
As reported at http://bugs.debian.org/736440 the criterion release contains these files without their corresponding source code
templates/js/excanvas-r3.min.js
templates/js/jquery-1.6.4.min.js
templates/js/jquery.flot-0.7.min.js
This causes problems for downstream distributions.
Could you please make a release of criterion that includes the corresponding source (i.e. non-minified, non-packed) files?
Bonus points for ensuring that the minified files are really derived from the source files, for example by adding a small Makefile and building them yourself, which would also serve as documentation about the tool used to create these files.
Thanks,
Joachim
The error is (in both GHC 7.6.3 and 7.8.3)
Criterion/Main.hs:28:7:
The export item `Benchmarkable(run)'
attempts to export constructors or class methods that are not visible here
The error was introduced in this commit 925f3f2
Benchmarkable now seems to be a newtype instead of a class. I don't know how you want to fix it. I guess you could export run
as a newtype unwrapper, but you may prefer another fix.
It would be nice if Criterion.Main
generated programs would have an option for a condensed command-line output. Often, I am only interesting in the mean values, and having one line per benchmark with just that number would be easier to read.
(Related to #17, but I’d like this output to be available by default, as it is specially the quick-shot-benchmarks that benefit from a small overview.)
It seems reader
was removed in the latest version of optparse-applicative, works with 0.9.1.1.
Hey,
there is a brand new hastache version 0.5.0 which breaks criterion. There is not upper bound in criterion.cabal, so this affects all standard builds.
Criterion/Report.hs:94:42:
Couldn't match kind `*' against `* -> *'
Kind incompatibility when matching types:
m0 :: * -> *
MuType :: (* -> *) -> *
In the expression:
mkGenericContext reportAnalysis $ H.encodeStr nym
In a case alternative:
('a' : 'n' : _)
-> mkGenericContext reportAnalysis $ H.encodeStr nym
Criterion/Report.hs:99:42:
Couldn't match expected type `IO (MuType IO)'
with actual type `MuType m0'
Expected type: H.MuContext IO
Actual type: B.ByteString -> MuType m0
In the third argument of `H.hastacheStr', namely `context'
In a stmt of a 'do' block:
H.hastacheStr H.defaultConfig template context
Greetings, Alex
The following code compares two functions for summing over a vector:
{-# LANGUAGE BangPatterns #-}
import Control.DeepSeq
import Criterion
import Criterion.Main
import Data.Vector.Unboxed as VU
import Data.Vector.Generic as VG
import qualified Data.Vector.Fusion.Stream as Stream
sumV :: (VG.Vector v a, Num a) => v a -> a
sumV = Stream.foldl' (+) 0 . VG.stream
main = do
let v10 = VU.fromList [0..9] :: VU.Vector Double
deepseq v10 $ return ()
defaultMain
[ bench "sumV" $ nf sumV v10
]
But suppose I change the last few lines to the following:
defaultMain
[ bench "sumV" $ nf sumV v10
, bench "VU.sum" $ nf VU.sum v10 -- Added this line
]
This, surprisingly, affects the runtime of the sumV
benchmark. It makes it about 20% faster. Similarly, if we remove the sumV
benchmark and leave the VU.sum
benchmark, the VU.sum
benchmark becomes about 20% slower. Tests were run with the patched criterion-1.0.0.2 I sent on ghc-7.8.3 with the -O2 -fllvm
flags.
What's going on is that different core is generated for the sumV
and VU.sum
benchmarks depending on whether the other benchmark is present. Essentially, the common bits are being factored out and placed in a function, and this function is getting called in both benchmarks. This happens to make both benchmarks faster.
I'm not sure if this should be considered a "proper bug," but it confused me for a an hour or so. It's something that criterion users (especially those performing really small benchmarks) probably should be aware of.
When benchmarking threading-intensive applications and libraries, I've started seeing this error crop up across multiple projects.
It is nondeterministic, but it seem to be related to the fact that in each projects passes the following options:
--regress=allocated:iters --regress=bytesCopied:iters --regress=cycles:iters \
--regress=numGcs:iters --regress=mutatorWallSeconds:iters --regress=gcWallSeconds:iters \
--regress=cpuTime:iters --raw report.criterion -o report.html
Because I have not been able to reproduce it when doing a (minimal) run with no options. Nor have I (yet) been able to pin down by process of elimination which one of these options introduces the bug.
I haven't yet tried to reproduce it in profiling mode to get a backtrace...
When attempting to collect benchmarking data, it occasionally happens that system noise will make a few of the iterations take much longer than other iterations. If one iteration out of a thousand does this and is ten times slower than the rest, this inflates the average by 10% and can cause havoc with the standard deviation. Even if only one iteration out of a hundred does this, then the average and standard deviation are useless even though the data is meaningful if you exclude these outliers.
I would like the ability to instruct Criterion to omit such outliers from it's calculations. Even something as simple as removing the best and worst 10% of samples would often be sufficient.
Of course this could be abused (e.g., the standard deviation doesn't mean quite so much if you remove the best and worst 49% of samples), but for benchmarking things like CPU times of code that does no IO or external calculation it can be very useful as benchmark numbers often need to reflect the performance of the code being benchmarked instead of whatever system noise happened to randomly kick in.
Possible extensions of this idea include reporting the median and/or mode. Another possibility is to try to fit the sampling data to some sort of distribution is flexible enough to account for system noise (e.g. a Poisson, bimodal(*) or mixture based distribution) and then reporting the parameters of that distribution (e.g., the location of the peak (or peaks) of the Poisson or or bimodal distribution rather than just the mean).
(*) The second peak represents when the system noise kicks up.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.