I setup check_smart on a system which was already sick and had the following statuses:

Thx for pointing that out. Should be fixed now with commit <a class="commit-link" data

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Prioritise output by criticality about check_smart HOT 14 CLOSED

napsty commented on May 30, 2024

Prioritise output by criticality

from check_smart.

Comments (14)

Napsty commented on May 30, 2024 1

Thx for pointing that out. Should be fixed now with commit d3a85e9

from check_smart.

Napsty commented on May 30, 2024

Hi @peternewman
Could you please share the current usage and output? I assume you're using -g?

from check_smart.

peternewman commented on May 30, 2024

Yes I am. It'll be a little while until I can do so, and I've replaced the failed drive, but I should be able to get the historic output from Nagios.

It was correctly reporting the overall status of the check as Critical, and listing the associated faults and level with each drive, but it was listing as sdb, sdc, sda (i.e. broken first, but just by letter/discovery order, not respective criticality).

from check_smart.

peternewman commented on May 30, 2024

Usage was:
check_smart.pl -g '/dev/sd[a-z]' -i auto --selftest --ssd-lifetime

Output was:
CRITICAL: [/dev/sdb] - Reported_Uncorrect is non-zero (1), Current_Pending_Sector is non-zero (1), Offline_Uncorrectable is non-zero (1) --- [/dev/sdc] - Health status: FAILED!, Attribute Raw_Read_Error_Rate failed at In_the_past, Attribute Reallocated_Sector_Ct failed at FAILING_NOW, Reallocated_Sector_Ct is non-zero (2047), Reported_Uncorrect is non-zero (133), Attribute Reallocated_Event_Count failed at FAILING_NOW, Reallocated_Event_Count is non-zero (2047) --- [/dev/sda] - Device is clean --- [/dev/sdd] - Device is clean

Reformatted into bullets to make my point clearer, it was like this:

CRITICAL:
[/dev/sdb] - Reported_Uncorrect is non-zero (1), Current_Pending_Sector is non-zero (1), Offline_Uncorrectable is non-zero (1)
[/dev/sdc] - Health status: FAILED!, Attribute Raw_Read_Error_Rate failed at In_the_past, Attribute Reallocated_Sector_Ct failed at FAILING_NOW, Reallocated_Sector_Ct is non-zero (2047), Reported_Uncorrect is non-zero (133), Attribute Reallocated_Event_Count failed at FAILING_NOW, Reallocated_Event_Count is non-zero (2047)
[/dev/sda] - Device is clean
[/dev/sdd] - Device is clean

What I wanted was:

CRITICAL:
[/dev/sdc] - Health status: FAILED!, Attribute Raw_Read_Error_Rate failed at In_the_past, Attribute Reallocated_Sector_Ct failed at FAILING_NOW, Reallocated_Sector_Ct is non-zero (2047), Reported_Uncorrect is non-zero (133), Attribute Reallocated_Event_Count failed at FAILING_NOW, Reallocated_Event_Count is non-zero (2047)
[/dev/sdb] - Reported_Uncorrect is non-zero (1), Current_Pending_Sector is non-zero (1), Offline_Uncorrectable is non-zero (1)
[/dev/sda] - Device is clean
[/dev/sdd] - Device is clean

i.e. sorted prioritised by the return status you'd get by checking each drive individually.

from check_smart.

Napsty commented on May 30, 2024

The "problem" here is that the priority sorting already happens. CRITICAL drives are shown before OK drives. You can see this, as the critical drives /dev/sdb and /dev/sdc are showing up before the ok drives /dev/sda and /dev/sdd.
However check_smart does not know which critical drive (sdb or sdc) is more important or which state is more important.

You could work around this by using -w and setting higher thresholds for some defective sectors. E.g. -w Reported_Uncorrect=2,Current_Pending_Sector=2,Offline_Uncorrectable. This should then set drive sdb into OK state.

By the way: Although handy for quick checks, I'm not using -g parameter in real world monitoring, as I want each drive separately monitored and want to keep historical data of the SMART values, see https://www.claudiokuenzler.com/blog/1077/when-is-solid-state-drive-ssd-dead-analysis-crucial-mx500-1tb for example.

from check_smart.

peternewman commented on May 30, 2024

The "problem" here is that the priority sorting already happens. CRITICAL drives are shown before OK drives. You can see this, as the critical drives /dev/sdb and /dev/sdc are showing up before the ok drives /dev/sda and /dev/sdd.
However check_smart does not know which critical drive (sdb or sdc) is more important or which state is more important.

I think it does. I'm pretty certain when I checked them individually that sdb was WARNING and sdc was CRITICAL, it's just it doesn't currently use the subtlety of that info, just the binary good/bad state.

You could work around this by using -w and setting higher thresholds for some defective sectors. E.g. -w Reported_Uncorrect=2,Current_Pending_Sector=2,Offline_Uncorrectable. This should then set drive sdb into OK state.

As above, I'm more interested in the general WARNING/CRITICAL ordering.

By the way: Although handy for quick checks, I'm not using -g parameter in real world monitoring, as I want each drive separately monitored and want to keep historical data of the SMART values, see https://www.claudiokuenzler.com/blog/1077/when-is-solid-state-drive-ssd-dead-analysis-crucial-mx500-1tb for example.

Thanks for the heads up, I figured I'd start by getting a check in place across all the machines with software RAID and hence no proper disk monitoring and go from there. I'm always rather nervous with manual config like that, as it becomes rather easy to miss a drive if a machine has more disks than expected. Fortunately for me, the data is pretty transient, so I'm really just interested in knowing the drive has, or is about to fail, rebuilding things and carrying on.

from check_smart.

Napsty commented on May 30, 2024

As above, I'm more interested in the general WARNING/CRITICAL ordering.

Yes, this should definitely happen.

So if you do a manual check of sdb right now, is it CRITICAL or WARNING?

from check_smart.

peternewman commented on May 30, 2024

As above, I'm more interested in the general WARNING/CRITICAL ordering.

Yes, this should definitely happen.

That's certainly what I'd like, I don't see any code to do so currently (I'm not sure if you're saying you think it should, or agreeing it's a feature to implement):

check_smart/check_smart.pl

Lines 694 to 696 in 956f236

    
            if ($opt_g) { 
        
           $status_string = $label.join(', ', @error_messages); 
        
            }

And e.g.:

check_smart/check_smart.pl

Lines 664 to 665 in 956f236

    
           push(@error_messages, 'Disk temperature is higher than maximum'); 
        
           escalate_status('CRITICAL');

versus

check_smart/check_smart.pl

Lines 677 to 678 in 956f236

    
           push(@error_messages, 'Disk start_stop is higher than maximum'); 
        
           escalate_status('WARNING');

So if you do a manual check of sdb right now, is it CRITICAL or WARNING?

Yeah that works as expected:

/usr/local/bin/check_smart.pl -d /dev/sdb -i auto --selftest --ssd-lifetime; echo $?
WARNING: Drive  <REDACTED> S/N <REDACTED>:  Reported_Uncorrect is non-zero (5)|Raw_Read_Error_Rate=166912319 Spin_Up_Time=0 Start_Stop_Count=16 Reallocated_Sector_Ct=0 Seek_Error_Rate=346936906 Power_On_Hours=63372 Spin_Retry_Count=0 Power_Cycle_Count=16 End-to-End_Error=0 Reported_Uncorrect=5 Command_Timeout=4295041085 High_Fly_Writes=0 Airflow_Temperature_Cel=36 Temperature_Celsius=36 Hardware_ECC_Recovered=166912319 Reallocated_Event_Count=0 Current_Pending_Sector=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=0
1

from check_smart.

Napsty commented on May 30, 2024

@peternewman can you please try with the 6.11 branch?
https://github.com/Napsty/check_smart/tree/6.11
How does that behave when you have both CRITICAL and WARNING drives in the same system?

from check_smart.

peternewman commented on May 30, 2024

How does that behave when you have both CRITICAL and WARNING drives in the same system?

Thanks @Napsty . I've swapped my failed drive now unfortunately, so would need to fake it by making an existing warning a critical.

I do see one big issue though:

check_smart/check_smart.pl

Lines 703 to 704 in 7eecae6

    
           $status_string = join(', ', @error_messages); 
        
           $status_string = join(', ', @warning_messages);

You'll only ever get warning messages out, as you're not concatenating the two joins together, just setting $status_string twice!

from check_smart.

peternewman commented on May 30, 2024

This still doesn't fix it in global mode unfortunately @Napsty . Note how /dev/sdc where I fudged being under the threshold to generate my test critical is listed after /dev/sdb which only has warnings:

CRITICAL: [/dev/sdb] - [/dev/sdb] - Reallocated_Sector_Ct is non-zero (25), Reported_Uncorrect is non-zero (139), Reallocated_Event_Count is non-zero (25), Current_Pending_Sector is non-zero (55) --- [/dev/sdc] - Reported_Uncorrect is test critical[/dev/sdc] -  --- [/dev/sda] - Device is clean --- [/dev/sdd] - Device is clean|
2

It does in single device mode though (N.B. I've changed to a different drive here and a different threshold), i.e. errors are now correctly listed before criticals on a per drive basis:

CRITICAL: Drive  <redacted> S/N <redacted>:  Reported_Uncorrect is test criticalReallocated_Sector_Ct is non-zero (25), Reallocated_Event_Count is non-zero (25), Current_Pending_Sector is non-zero (55)|Raw_Read_Error_Rate=65320951 Spin_Up_Time=0 Start_Stop_Count=17 Reallocated_Sector_Ct=25 Seek_Error_Rate=347559879 Power_On_Hours=63690 Spin_Retry_Count=0 Power_Cycle_Count=17 End-to-End_Error=0 Reported_Uncorrect=139 Command_Timeout=4295041085 High_Fly_Writes=0 Airflow_Temperature_Cel=38 Temperature_Celsius=38 Hardware_ECC_Recovered=65320951 Reallocated_Event_Count=25 Current_Pending_Sector=55 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=0
2

I think in your current model you've got @drives_status_not_okay and @drives_status_okay, you either need to switch to a hash based model, or split @drives_status_not_okay into warning and critical.

See also related #71 to improve the current formatting in global mode.

from check_smart.

Napsty commented on May 30, 2024

Hi @peternewman . Can you try it with the newest check_smart.pl from the 6.11 branch please:
https://github.com/Napsty/check_smart/blob/6.11/check_smart.pl

from check_smart.

Napsty commented on May 30, 2024

Commit 5dbacc7 now also adds an internal "notice" status for attributes appearing as "less than threshold".

Before the commit, attributes would show up in their lookup order, even when different thresholds are given:

root@server:~# ./check_smart.pl -d /dev/sg3 -i sat -w "Reallocated_Sector_Ct=500"
WARNING: Drive  WD2000FYYZ-23UL S/N XXX:  Reallocated_Sector_Ct is non-zero (372) (but less than threshold 500), Reallocated_Event_Count is non-zero (36), Current_Pending_Sector is non-zero (2), Offline_Uncorrectable is non-zero (2)|Raw_Read_Error_Rate=0 Spin_Up_Time=4500 Start_Stop_Count=79 Reallocated_Sector_Ct=372 Seek_Error_Rate=0 Power_On_Hours=47246 Spin_Retry_Count=0 Calibration_Retry_Count=0 Power_Cycle_Count=35 G-Sense_Error_Rate=1 Power-Off_Retract_Count=30 Load_Cycle_Count=48 Temperature_Celsius=36 Hardware_ECC_Recovered=0 Reallocated_Event_Count=36 Current_Pending_Sector=2 Offline_Uncorrectable=2 UDMA_CRC_Error_Count=0 Multi_Zone_Error_Rate=2

After the commit, the "Reallocated_Sector_Ct" is moved to the end of the output:

root@server:~# ./check_smart.pl -d /dev/sg3 -i sat -w "Reallocated_Sector_Ct=500" 
WARNING: Drive  WD2000FYYZ-23UL S/N XXX:  Reallocated_Event_Count is non-zero (36), Current_Pending_Sector is non-zero (2), Offline_Uncorrectable is non-zero (2), Reallocated_Sector_Ct is non-zero (372) (but less than threshold 500)|Raw_Read_Error_Rate=0 Spin_Up_Time=4500 Start_Stop_Count=79 Reallocated_Sector_Ct=372 Seek_Error_Rate=0 Power_On_Hours=47247 Spin_Retry_Count=0 Calibration_Retry_Count=0 Power_Cycle_Count=35 G-Sense_Error_Rate=1 Power-Off_Retract_Count=30 Load_Cycle_Count=48 Temperature_Celsius=36 Hardware_ECC_Recovered=0 Reallocated_Event_Count=36 Current_Pending_Sector=2 Offline_Uncorrectable=2 UDMA_CRC_Error_Count=0 Multi_Zone_Error_Rate=2

5dbacc7 also adds splits the "not_okay" drives into "critical" and "warning" drives (as suggested by you). Then critical (first) and warning (second) drives are merged together into the "not_okay" drives. This should assure, that critical drives appear first in the output.

from check_smart.

Napsty commented on May 30, 2024

Fixed in #72

from check_smart.

Prioritise output by criticality about check_smart HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	if ($opt_g) {
	$status_string = $label.join(', ', @error_messages);
	}

	push(@error_messages, 'Disk temperature is higher than maximum');
	escalate_status('CRITICAL');

	push(@error_messages, 'Disk start_stop is higher than maximum');
	escalate_status('WARNING');

	$status_string = join(', ', @error_messages);
	$status_string = join(', ', @warning_messages);