Comments (14)
Thx for pointing that out. Should be fixed now with commit d3a85e9
from check_smart.
Hi @peternewman
Could you please share the current usage and output? I assume you're using -g
?
from check_smart.
Yes I am. It'll be a little while until I can do so, and I've replaced the failed drive, but I should be able to get the historic output from Nagios.
It was correctly reporting the overall status of the check as Critical, and listing the associated faults and level with each drive, but it was listing as sdb, sdc, sda (i.e. broken first, but just by letter/discovery order, not respective criticality).
from check_smart.
Usage was:
check_smart.pl -g '/dev/sd[a-z]' -i auto --selftest --ssd-lifetime
Output was:
CRITICAL: [/dev/sdb] - Reported_Uncorrect is non-zero (1), Current_Pending_Sector is non-zero (1), Offline_Uncorrectable is non-zero (1) --- [/dev/sdc] - Health status: FAILED!, Attribute Raw_Read_Error_Rate failed at In_the_past, Attribute Reallocated_Sector_Ct failed at FAILING_NOW, Reallocated_Sector_Ct is non-zero (2047), Reported_Uncorrect is non-zero (133), Attribute Reallocated_Event_Count failed at FAILING_NOW, Reallocated_Event_Count is non-zero (2047) --- [/dev/sda] - Device is clean --- [/dev/sdd] - Device is clean
Reformatted into bullets to make my point clearer, it was like this:
- CRITICAL:
- [/dev/sdb] - Reported_Uncorrect is non-zero (1), Current_Pending_Sector is non-zero (1), Offline_Uncorrectable is non-zero (1)
- [/dev/sdc] - Health status: FAILED!, Attribute Raw_Read_Error_Rate failed at In_the_past, Attribute Reallocated_Sector_Ct failed at FAILING_NOW, Reallocated_Sector_Ct is non-zero (2047), Reported_Uncorrect is non-zero (133), Attribute Reallocated_Event_Count failed at FAILING_NOW, Reallocated_Event_Count is non-zero (2047)
- [/dev/sda] - Device is clean
- [/dev/sdd] - Device is clean
What I wanted was:
- CRITICAL:
- [/dev/sdc] - Health status: FAILED!, Attribute Raw_Read_Error_Rate failed at In_the_past, Attribute Reallocated_Sector_Ct failed at FAILING_NOW, Reallocated_Sector_Ct is non-zero (2047), Reported_Uncorrect is non-zero (133), Attribute Reallocated_Event_Count failed at FAILING_NOW, Reallocated_Event_Count is non-zero (2047)
- [/dev/sdb] - Reported_Uncorrect is non-zero (1), Current_Pending_Sector is non-zero (1), Offline_Uncorrectable is non-zero (1)
- [/dev/sda] - Device is clean
- [/dev/sdd] - Device is clean
i.e. sorted prioritised by the return status you'd get by checking each drive individually.
from check_smart.
The "problem" here is that the priority sorting already happens. CRITICAL drives are shown before OK drives. You can see this, as the critical drives /dev/sdb
and /dev/sdc
are showing up before the ok drives /dev/sda
and /dev/sdd
.
However check_smart
does not know which critical drive (sdb or sdc) is more important or which state is more important.
You could work around this by using -w
and setting higher thresholds for some defective sectors. E.g. -w Reported_Uncorrect=2,Current_Pending_Sector=2,Offline_Uncorrectable
. This should then set drive sdb into OK state.
By the way: Although handy for quick checks, I'm not using -g
parameter in real world monitoring, as I want each drive separately monitored and want to keep historical data of the SMART values, see https://www.claudiokuenzler.com/blog/1077/when-is-solid-state-drive-ssd-dead-analysis-crucial-mx500-1tb for example.
from check_smart.
The "problem" here is that the priority sorting already happens. CRITICAL drives are shown before OK drives. You can see this, as the critical drives
/dev/sdb
and/dev/sdc
are showing up before the ok drives/dev/sda
and/dev/sdd
.
Howevercheck_smart
does not know which critical drive (sdb or sdc) is more important or which state is more important.
I think it does. I'm pretty certain when I checked them individually that sdb was WARNING and sdc was CRITICAL, it's just it doesn't currently use the subtlety of that info, just the binary good/bad state.
You could work around this by using
-w
and setting higher thresholds for some defective sectors. E.g.-w Reported_Uncorrect=2,Current_Pending_Sector=2,Offline_Uncorrectable
. This should then set drive sdb into OK state.
As above, I'm more interested in the general WARNING/CRITICAL ordering.
By the way: Although handy for quick checks, I'm not using
-g
parameter in real world monitoring, as I want each drive separately monitored and want to keep historical data of the SMART values, see https://www.claudiokuenzler.com/blog/1077/when-is-solid-state-drive-ssd-dead-analysis-crucial-mx500-1tb for example.
Thanks for the heads up, I figured I'd start by getting a check in place across all the machines with software RAID and hence no proper disk monitoring and go from there. I'm always rather nervous with manual config like that, as it becomes rather easy to miss a drive if a machine has more disks than expected. Fortunately for me, the data is pretty transient, so I'm really just interested in knowing the drive has, or is about to fail, rebuilding things and carrying on.
from check_smart.
As above, I'm more interested in the general WARNING/CRITICAL ordering.
Yes, this should definitely happen.
So if you do a manual check of sdb right now, is it CRITICAL or WARNING?
from check_smart.
As above, I'm more interested in the general WARNING/CRITICAL ordering.
Yes, this should definitely happen.
That's certainly what I'd like, I don't see any code to do so currently (I'm not sure if you're saying you think it should, or agreeing it's a feature to implement):
Lines 694 to 696 in 956f236
And e.g.:
Lines 664 to 665 in 956f236
versus
Lines 677 to 678 in 956f236
So if you do a manual check of sdb right now, is it CRITICAL or WARNING?
Yeah that works as expected:
/usr/local/bin/check_smart.pl -d /dev/sdb -i auto --selftest --ssd-lifetime; echo $?
WARNING: Drive <REDACTED> S/N <REDACTED>: Reported_Uncorrect is non-zero (5)|Raw_Read_Error_Rate=166912319 Spin_Up_Time=0 Start_Stop_Count=16 Reallocated_Sector_Ct=0 Seek_Error_Rate=346936906 Power_On_Hours=63372 Spin_Retry_Count=0 Power_Cycle_Count=16 End-to-End_Error=0 Reported_Uncorrect=5 Command_Timeout=4295041085 High_Fly_Writes=0 Airflow_Temperature_Cel=36 Temperature_Celsius=36 Hardware_ECC_Recovered=166912319 Reallocated_Event_Count=0 Current_Pending_Sector=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=0
1
from check_smart.
@peternewman can you please try with the 6.11 branch?
https://github.com/Napsty/check_smart/tree/6.11
How does that behave when you have both CRITICAL and WARNING drives in the same system?
from check_smart.
How does that behave when you have both CRITICAL and WARNING drives in the same system?
Thanks @Napsty . I've swapped my failed drive now unfortunately, so would need to fake it by making an existing warning a critical.
I do see one big issue though:
Lines 703 to 704 in 7eecae6
You'll only ever get warning messages out, as you're not concatenating the two joins together, just setting $status_string twice!
from check_smart.
This still doesn't fix it in global mode unfortunately @Napsty . Note how /dev/sdc where I fudged being under the threshold to generate my test critical is listed after /dev/sdb which only has warnings:
CRITICAL: [/dev/sdb] - [/dev/sdb] - Reallocated_Sector_Ct is non-zero (25), Reported_Uncorrect is non-zero (139), Reallocated_Event_Count is non-zero (25), Current_Pending_Sector is non-zero (55) --- [/dev/sdc] - Reported_Uncorrect is test critical[/dev/sdc] - --- [/dev/sda] - Device is clean --- [/dev/sdd] - Device is clean|
2
It does in single device mode though (N.B. I've changed to a different drive here and a different threshold), i.e. errors are now correctly listed before criticals on a per drive basis:
CRITICAL: Drive <redacted> S/N <redacted>: Reported_Uncorrect is test criticalReallocated_Sector_Ct is non-zero (25), Reallocated_Event_Count is non-zero (25), Current_Pending_Sector is non-zero (55)|Raw_Read_Error_Rate=65320951 Spin_Up_Time=0 Start_Stop_Count=17 Reallocated_Sector_Ct=25 Seek_Error_Rate=347559879 Power_On_Hours=63690 Spin_Retry_Count=0 Power_Cycle_Count=17 End-to-End_Error=0 Reported_Uncorrect=139 Command_Timeout=4295041085 High_Fly_Writes=0 Airflow_Temperature_Cel=38 Temperature_Celsius=38 Hardware_ECC_Recovered=65320951 Reallocated_Event_Count=25 Current_Pending_Sector=55 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=0
2
I think in your current model you've got @drives_status_not_okay
and @drives_status_okay
, you either need to switch to a hash based model, or split @drives_status_not_okay
into warning and critical.
See also related #71 to improve the current formatting in global mode.
from check_smart.
Hi @peternewman . Can you try it with the newest check_smart.pl from the 6.11 branch please:
https://github.com/Napsty/check_smart/blob/6.11/check_smart.pl
from check_smart.
Commit 5dbacc7 now also adds an internal "notice" status for attributes appearing as "less than threshold".
Before the commit, attributes would show up in their lookup order, even when different thresholds are given:
root@server:~# ./check_smart.pl -d /dev/sg3 -i sat -w "Reallocated_Sector_Ct=500"
WARNING: Drive WD2000FYYZ-23UL S/N XXX: Reallocated_Sector_Ct is non-zero (372) (but less than threshold 500), Reallocated_Event_Count is non-zero (36), Current_Pending_Sector is non-zero (2), Offline_Uncorrectable is non-zero (2)|Raw_Read_Error_Rate=0 Spin_Up_Time=4500 Start_Stop_Count=79 Reallocated_Sector_Ct=372 Seek_Error_Rate=0 Power_On_Hours=47246 Spin_Retry_Count=0 Calibration_Retry_Count=0 Power_Cycle_Count=35 G-Sense_Error_Rate=1 Power-Off_Retract_Count=30 Load_Cycle_Count=48 Temperature_Celsius=36 Hardware_ECC_Recovered=0 Reallocated_Event_Count=36 Current_Pending_Sector=2 Offline_Uncorrectable=2 UDMA_CRC_Error_Count=0 Multi_Zone_Error_Rate=2
After the commit, the "Reallocated_Sector_Ct" is moved to the end of the output:
root@server:~# ./check_smart.pl -d /dev/sg3 -i sat -w "Reallocated_Sector_Ct=500"
WARNING: Drive WD2000FYYZ-23UL S/N XXX: Reallocated_Event_Count is non-zero (36), Current_Pending_Sector is non-zero (2), Offline_Uncorrectable is non-zero (2), Reallocated_Sector_Ct is non-zero (372) (but less than threshold 500)|Raw_Read_Error_Rate=0 Spin_Up_Time=4500 Start_Stop_Count=79 Reallocated_Sector_Ct=372 Seek_Error_Rate=0 Power_On_Hours=47247 Spin_Retry_Count=0 Calibration_Retry_Count=0 Power_Cycle_Count=35 G-Sense_Error_Rate=1 Power-Off_Retract_Count=30 Load_Cycle_Count=48 Temperature_Celsius=36 Hardware_ECC_Recovered=0 Reallocated_Event_Count=36 Current_Pending_Sector=2 Offline_Uncorrectable=2 UDMA_CRC_Error_Count=0 Multi_Zone_Error_Rate=2
5dbacc7 also adds splits the "not_okay" drives into "critical" and "warning" drives (as suggested by you). Then critical (first) and warning (second) drives are merged together into the "not_okay" drives. This should assure, that critical drives appear first in the output.
from check_smart.
Fixed in #72
from check_smart.
Related Issues (20)
- status line 2000GB Gigabyte AORUS M.2 2280 PCIe 4.0 x4 NVMe HOT 4
- Warning thresholds does NOT give the expected result. HOT 2
- Add attribute 188 Command_Timeout to raw check list HOT 1
- Handling dots in attribute names HOT 1
- add aacraid HOT 5
- Request: Auto detect and count all drive on system
- Add special monitoring on SSD attribute 202 (Percent_Lifetime_Remain) HOT 1
- Wear_Leveling_Count is not reported as CRIT when disk is almost dead HOT 7
- No performance data on NVMe drive HOT 2
- 6.12.0 regression: invalid interface
- megaraid,N not work with 6.12 HOT 2
- Add TBW calculations for end of life prediction in SSDs HOT 1
- Percent_Lifetime_Remain usage HOT 5
- flag to disable temperature check HOT 2
- Intel ssd wearout not reported when almost dead HOT 9
- check_smart.pl very slow on Almalinux 8 HOT 1
- Percent_Lifetime_Remain threshold unset with -w HOT 19
- No output after pipe HOT 4
- Kingston ssd wearout not detected HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from check_smart.