Code Monkey home page Code Monkey logo

stressdisk's Introduction

StressDisk

This is a program designed to stress test your disks and find failures in them.

Use it to soak test your new disks / memory cards / USB sticks before trusting your valuable data to it.

Use it to soak test your new PC hardware also for the same reason.

Note that it turns out to be quite a sensitive memory tester too so errors can sometimes be caused by bad RAM in your computer rather than disk errors.

Install

StressDisk is a Go program and comes as a single binary file.

Download the relevant binary from

Or alternatively if you have Go installed use

go install github.com/ncw/stressdisk@latest

If you want to modify the sources, it is recommended to check out the repository.

git clone https://github.com/ncw/stressdisk.git
cd stressdisk
go build .

You can then modify the source, rebuild as needed, and submit patches.

Usage

Use stressdisk -h to see all the options.

Disk soak testing utility

Automatic usage:
  stressdisk run directory            - auto fill the directory up and soak test it
  stressdisk cycle directory          - fill, test, delete, repeat - torture for flash
  stressdisk clean directory          - delete the check files from the directory

Manual usage:
  stressdisk help                       - this help
  stressdisk [ -s size ] write filename - write a check file
  stressdisk read filename              - read the check file back
  stressdisk reads filename             - ... repeatedly for duration set
  stressdisk check filename1 filename2  - compare two check files
  stressdisk checks filename1 filename2 - ... repeatedly for duration set

Full options:
  -cpuprofile string
        Write cpu profile to file
  -duration duration
        Duration to run test (default 24h0m0s)
  -logfile string
        File to write log to set to empty to ignore (default "stressdisk.log")
  -maxerrors uint
        Max number of errors to print per file (default 64)
  -nodirect
        Don't use O_DIRECT
  -s int
        Size of the check files (default 1000000000)
  -stats duration
        Interval to print stats (default 1m0s)
  -statsfile string
        File to load/store statistics data (default "stressdisk_stats.json")

Note that flags must be provided BEFORE the stressdisk command, eg
  stressdisk -duration 48h run /mnt

Quickstart

Install your new media in your computer and format it (make a filesystem on it).

Open a terminal (or cmd prompt if running Windows).

To check the disk:

Linux: ./stressdisk run /media/nameofnewdisk
Windows: stressdisk.exe run F:

Let it run for 24 hours. It will finish on its own. Note whether any errors were reported. Then use the following to remove the check files:

Linux: ./stressdisk clean /media/nameofnewdisk
Windows: stressdisk.exe clean F:

If you find errors, then you can use the read / reads / check / checks sub-commands to investigate further.

2012/09/20 22:23:20 Exiting after running for > 30s
2012/09/20 22:23:20 
Bytes read:         20778 MByte ( 692.59 MByte/s)
Bytes written:          0 MByte (   0.00 MByte/s)
Errors:                 0
Elapsed time:  30.00033s

2012/09/20 22:23:20 PASSED with no errors

Stress disk can be interrupted after it has written its check files and it will continue from where it left off.

The default running time for stressdisk is 24h which is a sensible minimum. However if you want to run it for longer then use -duration 48h for instance.

Errors

If stressdisk finds an error it will print lines like this:

2019/03/07 10:55:09 0AA00000: 2D, A1 diff 8C

The fields are offset, file1 value, file 2 value and the diff which is file1_value XOR file2_value all in hexadecimal. The diff will be a binary number for a single bit error so 01, 02, 04, 08, 10, 20, 40, 80.

This may give some insight into the problem (eg a single bit flipped), or errors starting 4k boundaries, but may not.

However, the actual errors aren't that important, you shouldn't get any. If you do then:

  1. run memtest86 on the machine for 48 hours to check for RAM problems, if this passes then
  2. try the stressdisk test on another machine if you can, if this fails then
  3. discard or return the media

If you didn't get to step 3. then you'll need to play with the hardware of the machine, replace the RAM etc. Stressdisk errors are usually caused by bad media, but not always. Bad RAM is a fairly likely cause of stressdisk errors too.

Testing Flash

Stressdisk has a special mode which is good for giving flash / SSD media a hard time. The normal "run" test will fill the disk and read the files back continually which a good test but doesn't torture flash as much as it could as writing is a much more intensive operation for flash than reading.

To test flash / SSD harder "cycle" mode does lots of write cycles as well as read cycles. It works by filling the media with test files verifying that the data is valid, deleting the test files, and repeating the write + verify process continually.

Caution: This will be destructive to flash media if run long periods of time, since flash devices have a limited number of writes per sector/cell.

This Is Intentional! You can use this to stress test flash harder.

You can also use this mode to find the breaking point of flash devices to determine what the lifetime of the media is if you are quality testing flash media before making a bulk buy. The -statsfile option is useful when doing this to save persistent stats to disk in case the process is interrupted.

If you are merely interested in doing a less destructive test of the flash device for data integrity, then should use the "run" mode, as this mode only writes the check files once, and does reads operations to verify data integrity which have little destructive penalty.

How it works

Stressdisk fills up your disk with identical large files (1 GB by default) full of random data. It then randomly chooses a pair of these and reads them back checking they are the same.

This causes the disk head to seek backwards and forwards across the disk surface very quickly which is the worst possible access pattern for disk drives and flushes out errors.

It seems to work equally well for non-rotating media.

The access patterns are designed so that your computer won't cache the data being read off the disk so your computer will be forced to read it off the disk.

Stressdisk uses OS specific commands to make sure the data isn't cached in RAM so that you won't just be testing your computer RAM.

History

I wrote the first version of stressdisk in about 1995 after discovering that the CD I had just written at great expense had bit errors in it. I discovered that my very expensive SCSI disk was returning occasional errors.

It has been used over the years to soak test 1000s of disks, memory cards, usb sticks and found many with errors. It has also found quite a few memory errors (bad RAM).

The original stressdisk was written in C with a perl wrapper but it was rather awkward to use because of that, so I re-wrote it in Go in 2012 as an exercise in learning Go and so that I could distribute it in an easy to run single executable format.

License

This is free software under the terms of MIT the license (check the COPYING file included in this package).

Contact and support

The project website is at:

There you can file bug reports, ask for help or contribute patches.

Authors

Contributors

  • Yves Junqueira for code review and helpful suggestions
  • dcabro for reporting the windows empty partition issue
  • Colin Lord for fixing documentation issues
  • Your name goes here!

stressdisk's People

Contributors

dmeador avatar dramborleg avatar mafrosis avatar ncw avatar tsenart avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

stressdisk's Issues

Status file get's corrupted on restart

I'm doing a long term test to wear out an SD card by writing together with a combination of periodic power outages.

Have set stressdisc as a systemd sctipt that runs on startup:

/usr/bin/stressdisk -statsfile /home/root/status  -logfile /home/root/stress-log-sd -duration 48h0m0s -stats 1m0s -s 500000 cycle /data/senic-hub/logs

I need to delete the /home/root/status every time before running it otherwise I end up with a failed start 2018/06/25 16:26:44 error reading statsfile: EOF

sw malfunction?

I am trying to find a dependable and trustable sw for testing storage, either spinning, usb or SSD... not found yet, some look good but do not work with usb sticks, other not recognize ssds etc.
Best up to now is H2testw but has its issues.
Testing stressdisk....

I used the stressdisk with a USB connected half giga SSD. Windows 10 2004 x64. Several times. Full format between tests.
Did it in a unit that H2testw had no issue with it.

Had systematically two issues

a) created files are 954MB (1GiB?). as to the last end available space is less (774MB) stressdisk issues an error message, probably unrelated to media error.
2020/11/23 11:09:13 Writing file "o:TST_0511" size 1000000000
2020/11/23 11:09:19
Bytes read: 1204228 MByte ( 423.25 MByte/s)
Bytes written: 1734176 MByte ( 96.56 MByte/s)
Errors: 0
Elapsed time: 26m0.0010488s

2020/11/23 11:09:20 Error while writing "o:TST_0511"
2020/11/23 11:09:20 Removing incomplete file "o:TST_0511"
2020/11/23 11:09:20 Starting round 1
2020/11/23 11:09:20 Reading file "o:TST_0468", "o:TST_0129"

b) after having read almost ten times the storage (5111GB read), I get the error:
"Error while reading "o:TST_0099": read o:TST_0099: A device which does not exist was specified
then, stressdisk ends
TST_0099 had been read 9 times before without issue...
Operating system shows device, no issue... I can delete files, format it, and use it normally.
Similar outcome in the other stressdisk tests.

First one looks as a kind of bug.
Second one is weird.... could be perhaps a heating issue? could be a wrong message? could be a bug?

Otherwise, media looks working properly...

Add an option to set BlockSize

It would be nice if that could be added in order to see if the current size being used is causing errors while testing a UDF drive with an open source driver being used to mount it.

Make a C++ version

It would be nice if you or someone else could do that so that way a developer that I am having work on a UDFS driver can figure out how it works so he can change it to where it causes the read errors for both of us instead of just me.

Reduce log verbosity

I'm running stressdisk as a systemd service and am finding the stats very useful.

Bytes read:           144 MByte (   8.73 MByte/s)
Bytes written:         78 MByte (   2.81 MByte/s)
Errors:                 0
Elapsed time:  1m0.038864945s

Especially when comparing to the numbers monitoring tools give. i.e. iostat

But the constant logging is to verbose:

Error while writing "/data/logs/TST_0004"
Removing incomplete file "/data/logs/TST_0004"
Reading file "/data/logs/TST_0002", "/data/logs/TST_0003"
Reading file "/data/logs/TST_0001", "/data/logs/TST_0002"
Reading file "/data/logs/TST_0000", "/data/logs/TST_0001"
Reading file "/data/logs/TST_0003", "/data/logs/TST_0000"
Removing 4 check files
Removing file "/data/logs/TST_0000"
Removing file "/data/logs/TST_0001"
Removing file "/data/logs/TST_0002"
Removing file "/data/logs/TST_0003"
....

Would be a great feature to be able to quiet those.

Too much logging on error

If stressdisk does find an error it can produce too much log messages logging every difference in a 1 GB file!

Need to limit the difference logs

Can't run on an empty partition

stressdisk on windows will not run on an empty partition. It reports "Couldn't read directory "X:": open X:: Access is denied."

As soon as I created one empty directory, it runs without problems. I run this from elevated prompt (with admin rights).

Thrashing of ada0 (boot device, hard disk drive) whilst testing a USB thumb drive at da0

Device

8 GB Verbatim STORE N GO s/n 17071802004381

Symptoms

Before killing stressdisk – note the xterm view of gstat -p:

2020-12-29 17:25:18

The desktop environment became completely unusable, I had to key Control-Alt-F2 then log in as root to perform a killing (and at ttyv1 things were also horribly slow to respond).

Also at some point I pulled out the drive, in an attempt to regain use of the desktop environment.

If I recall correctly, neither SIGTERM nor SIGKILL worked but (clutching at straws) SIGILL terminated the process. https://www.freebsd.org/cgi/man.cgi?sektion=3&query=signal

After termination of stressdisk:

2020-12-29 17:29:13

Terminal shows Errors but Software Reports 0 Errors

Bildschirmfoto 2024-07-26 um 11 31 26

I came across this some time ago but had this issue again, I'm testing a CFExpress Card with Stressdisk, there are Write Errors shown in the Terminal Window and Log but the Software reports unter each Segment 0 Errors.

How can this be? Is this a Bug of expected behaviour?

Document the random behavior of choosing which files are being read?

Right now the documentation states:
It then randomly chooses a pair of these and reads them back checking they are the same.

So by reading this I would assume that due to random chance it would be possible that even if I let this run for multiple days one file was already read e.g. 4 times, while another file hasn't been read ever. In practice by looking at the implementation it seems to me that not just stressdisk cycle but also stressdisk run is implemented in rounds. And during each round each file is read exactly twice before progressing to the next round. (Ignoring the unlucky case where one file would be selected to be compared to itself then that check would probably be skipped).
Sorry if I got anything wrong here, I just had a brief glance at the code.

Do we want to document this behaviour? I think sometimes it's useful to make sure that every file has been read at least once if you want to test some hardware. I know stressdisk's main purpose is seems to be to create a lot of seek stress to test the drive this way, but ensuring every file was tried at least once as soon as it prints "Starting round 2" would be a nice additional guarantee.

Or is this implementation detail undocumented on purpose? E.g. so the implementation can change in the future.

Errors are Ambiguous

I am using your program to validate solid state drives, but the error outputs do not make any sense. There is no description of what the errors actually are (1 bit wrong, 1 byte wrong, etc.). I am also seeing an abnormally high number of errors. Here is an example of the results of running the "cycle" run on an enterprise-grade intel SSD drive (240 GB) for 24 hours.

Read: 20,775,224 MByte (488.23 MByte/s)
Write: 10,506,396 MByte (240.3 MByte/s)
Errors: 75,202,464

I have also attached a log file for your reference.

stressdisk_enterprise_1.log

Problem on Mac OS 10.6.8 with new drive

First time user of stressdisk, so not sure where the issue is with this:

Mac Pro (2010), Mac OS 10.6.8
New WD Black 4TB in external eSATA dock. Dock has a fan and drive is room temp to the touch.

I started stressdisk yesterday with the command:

./stressdisk -duration=24h0m0s -logfile="stressdisk.log" run /Volumes/NewDisk/

I was writing fine, about 175MB/s:

2015/04/11 14:19:54 Writing file "/Volumes/NewDisk/TST_0024" size 1000000000
2015/04/11 14:20:00 Writing file "/Volumes/NewDisk/TST_0025" size 1000000000
2015/04/11 14:20:05 Writing file "/Volumes/NewDisk/TST_0026" size 1000000000
2015/04/11 14:20:11 Writing file "/Volumes/NewDisk/TST_0027" size 1000000000
2015/04/11 14:20:16 Writing file "/Volumes/NewDisk/TST_0028" size 1000000000

As it tried to fill the disk, it slowed way, way down:
2015/04/12 03:13:40 Writing file "/Volumes/NewDisk/TST_3989" size 1000000000
2015/04/12 03:13:56 Writing file "/Volumes/NewDisk/TST_3990" size 1000000000
2015/04/12 08:48:03 Writing file "/Volumes/NewDisk/TST_3991" size 1000000000
2015/04/12 10:52:18 Writing file "/Volumes/NewDisk/TST_3992" size 1000000000
2015/04/12 12:03:40 Writing file "/Volumes/NewDisk/TST_3993" size 1000000000
2015/04/12 13:14:52 Writing file "/Volumes/NewDisk/TST_3994" size 1000000000
2015/04/12 14:26:27 Writing file "/Volumes/NewDisk/TST_3995" size 1000000000

We're past the 24 hours (started 4/11 14:17:42, currently 14:50:45) and it's now reading at about 25MB/s, which I assume is the read random portion of the test. Since it finished writing after the timer should have expired, will it ever stop? Also, are any error messages flushed out of whatever stream they are written to, so that if there are no errors visible I can assume that there have been none?

I'll let it continue to run for now. If I recall, if I kill it now and restart it, will it detect the disk as full and just read the existing files?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.