napsty / check_smart Goto Github PK

View Code? Open in Web Editor NEW

64.0 11.0 19.0 222 KB

Monitoring Plugin to check hard drives, solid state drives and NVMe drives using SMART

Home Page: https://www.claudiokuenzler.com/monitoring-plugins/check_smart.php

License: GNU General Public License v3.0

Perl 100.00%

nagios-plugins monitoring-plugins scsi disk monitoring smart cciss intel-raid sata sas

check_smart's Introduction

check_smart monitoring plugin

Full and up to date documentation

Please go to https://www.claudiokuenzler.com/monitoring-plugins/check_smart.php for a complete and updated documentation including changelog, extensive usage examples, monitoring configurations (including Nagios, Icinga 1, Icinga 2, Shinken and Naemon).

Introduction

This is a plugin to monitor the health and values of SMART attributes of hard (HDD), solid state (SSD) and NVMe drives. The plugin is a fork of check_smart released in 2009 by Kurt Yoder. Since then the plugin has undergone a lot of changes. It allows to monitor drives behind hardware controllers and added a lot of parameters to fine tune the checks and set thresholds (on a per attribute setting).

Sudoers entry

This plugin needs to run as root, otherwise you're not able to lauch smartctl correctly. You have two options

Launch the plugin itself as root with sudo
Lauch the plugin as Nagios user and the smartctl command as root with sudo

Entry in sudoers (of course adapt your paths if necessary):

nagios   ALL = NOPASSWD: /usr/lib/nagios/plugins/check_smart.pl    # for option 1
nagios   ALL = NOPASSWD: /usr/local/sbin/smartctl                  # for option 2

Successful tests/examples

SATA disk behind MDRaid (Software Raid) on Linux:

/usr/lib/nagios/plugins/check_smart.pl -d /dev/sda -i ata

MegaRAID on Linux:

/usr/lib/nagios/plugins/check_smart.pl -d /dev/sda -i megaraid,8

Intel RAID on FreeBSD 9.2 ("kldload mfip.ko" required):

/usr/local/libexec/nagios/check_smart.pl -d /dev/pass0 -i scsi

SATA drives behind Intel RAID on FreeBSD 9.2 ("kldload mfip.ko" required):

/usr/local/libexec/nagios/check_smart.pl -d /dev/pass12 -i sat

SCSI drives behind HP RAID (CCISS) on FreeBSD 6.0:

/usr/local/libexec/nagios/check_smart.pl -d /dev/ciss0 -i cciss,0
OK: no SMART errors detected|defect_list=0 sent_blocks=3093462752 temperature=24;;68

/usr/local/libexec/nagios/check_smart.pl -d /dev/ciss0 -i cciss,3
WARNING: 48 Elements in grown defect list | defect_list=48 sent_blocks=1137657348 temperature=22;;68

Using threshold option (-b) to ignore 1 bad element, warning only when 2 bad elements are found:

/usr/local/libexec/nagios/check_smart.pl -d /dev/ciss0 -i cciss,1 -b 2
OK: 1 Elements in grown defect list (but less than threshold 2)|defect_list=1;2;2;; sent_blocks=2769458900762624 temperature=27;;65

SCSI drives behind HP RAID (CCISS) on Linux (Ubuntu hardy):

/usr/lib/nagios/plugins/check_smart.pl -d /dev/cciss/c0d0 -i cciss,0        
OK: no SMART errors detected. |

Check all SATA disks (sda - sdz) at the same time on Linux:

/usr/lib/nagios/plugins/check_smart.pl -g "/dev/sd[a-z]" -i ata        
OK: [/dev/sda] - Device is clean --- [/dev/sdb] - Device is clean|

Check all SCSI disks behind Intel RAID on FreeBSD 9.2 ("kldload mfip.ko" required):

/usr/local/libexec/nagios/check_smart.pl -g /dev/pass[1-9] -i scsi
OK: [/dev/pass0] - Device is clean --- [/dev/pass1] - Device is clean --- [/dev/pass2] - Device is clean --- [/dev/pass3] - Device is clean --- [/dev/pass4] - Device is clean --- [/dev/pass5] - Device is clean --- [/dev/pass6] - Device is clean --- [/dev/pass7] - Device is clean --- [/dev/pass8] - Device is clean --- [/dev/pass9] - Device is clean |

Single SCSI drive on FreeBSD 10.1:

/usr/local/libexec/nagios/check_smart.pl -d /dev/da0 -i scsi
OK: no SMART errors detected. |sent_blocks=14067306 temperature=34;;60

Single NVMe drive on Linux:

/usr/lib/nagios/plugins/check_smart.pl -d /dev/nvme0 -i nvme
OK: Drive Samsung SSD 970 PRO 512GB S/N XXXXXXXXXXXXXXX: no SMART errors detected. |Temperature=34 Available_Spare=100 Available_Spare_Threshold=10 Percentage_Used=0 Data_Units_Read=2854 Data_Units_Written=107590 Host_Read_Commands=67150 Host_Write_Commands=1406316 Controller_Busy_Time=20 Power_Cycles=16 Power_On_Hours=105 Unsafe_Shutdowns=6 Media_and_Data_Integrity_Errors=0 Error_Information_Log_Entries=0 Warning__Comp._Temperature_Time=0 Critical_Comp._Temperature_Time=0 Temperature_Sensor_1=34 Temperature_Sensor_2=33

see https://www.claudiokuenzler.com/monitoring-plugins/check_smart.php for more examples

check_smart's People

Contributors

Stargazers

Watchers

Forkers

loosi cguadall jbehrends papercapp heinrich-foto pulecp rohlik hawson check-plugins slcheng victorsow karcaw pvasileff peternewman grizzlydev-sarl rincewindshat nabertrand jvivona ymartin-ovh

check_smart's Issues

No performance data on NVMe drive

root@bullseye:~# /usr/lib/nagios/plugins/check_smart -d /dev/nvme1n1 -i nvme
OK: Drive  UCS-SDHPCIE 800GB S/N XXX: no SMART errors detected. |=0x00 =42 =100 =10 =0 =242 =2913064 =12586 =13282120 =26 =57 =4124 =44 =0 =0

Performance data is there - but the keys are empty

Prioritise output by criticality

I setup check_smart on a system which was already sick and had the following statuses:
sda - Okay
sdb - Warning - unrecoverable errors
sdc - Critical - due to die soon

It would be nice if the output listed sdc, sdb, sda so you know what to prioritise.

I had a quick look, and I think something like adding to a hash of arrays based on the local level, then joining them back up would do the trick, but didn't get a chance to implement it at the time.

Wear_Leveling_Count is not reported as CRIT when disk is almost dead

We found some disks thats almost dead because Wear_Leveling_Count standarded value is 1. Raw data are probably usefull only for vendor software because
https://superuser.com/questions/1037644/samsung-ssd-wear-leveling-count-meaning

Disk is
Device Model: Samsung SSD 850 EVO 500GB

and check is reporting OK :(

OK: Drive Samsung SSD 850 EVO 500GB S/N S3R3NF0J: no SMART errors detected. |Reallocated_Sector_Ct=0 Power_On_Hours=20837 Power_Cycle_Count=2 Wear_Leveling_Count=2981 Used_Rsvd_Blk_Cnt_Tot=0 Program_Fail_Cnt_Total=0 Erase_Fail_Count_Total=0 Runtime_Bad_Block=0 Uncorrectable_Error_Cnt=0 Airflow_Temperature_Cel=24 ECC_Error_Rate=0 CRC_Error_Count=0 POR_Recovery_Count=0 Total_LBAs_Written=572549452098

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
       5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
       9 Power_On_Hours          0x0032   095   095   000    Old_age   Always       -       20837
      12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       2
     177 Wear_Leveling_Count     0x0013   001   001   000    Pre-fail  Always       -       2981
     179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
     181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
     182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
     183 Runtime_Bad_Block       0x0013   100   099   010    Pre-fail  Always       -       0
     187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
     190 Airflow_Temperature_Cel 0x0032   076   060   000    Old_age   Always       -       24
     195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0
     199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
     235 POR_Recovery_Count      0x0012   100   100   000    Old_age   Always       -       0
     241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       572549492484
SMART Error Log Version: 1
No Errors Logged

check_smart.pl very slow on Almalinux 8

Hi,

When executing check_smart.pl on an AL8 server it takes 1 minute 30 seconds:

[root@dedicado]# cat /etc/redhat-release
AlmaLinux release 8.7 (Stone Smilodon)

[root@dedicado]# ./check_smart.pl -v
check_smart.pl v6.13.0
The monitoring plugins come with ABSOLUTELY NO WARRANTY. You may redistribute
copies of the plugins under the terms of the GNU General Public License.
For more information about these matters, see the file named COPYING.

[root@dedicado]# time ./check_smart.pl -d /dev/sda -i auto
OK: Drive  WDC WDS120G2G0A-00JH30 S/N 204806802724: no SMART errors detected. |Reallocated_Sector_Ct=0 Power_On_Hours=10739 Power_Cycle_Count=23 Block_Erase_Count=1258 Minimum_PE_Cycles_TLC=4 Max_Bad_Blocks_per_Die=0 Maximum_PE_Cycles_TLC=10 Total_Bad_Blocks=76 Grown_Bad_Blocks=0 Program_Fail_Count=0 Erase_Fail_Count=0 Average_PE_Cycles_TLC=4 Unexpected_Power_Loss=15 End-to-End_Error=0 Reported_Uncorrect=0 Command_Timeout=0 Temperature_Celsius=36 UDMA_CRC_Error_Count=0 Media_Wearout_Indicator=0 Available_Reservd_Space=100 NAND_GB_Written_TLC=483 NAND_GB_Written_SLC=7004 Host_Writes_GiB=3056 Host_Reads_GiB=163 Temp_Throttle_Status=0

real    1m30.962s
user    0m0.116s
sys     0m0.029s

In CentOS 7 it works OK.

Thanks,
Ignacio

Threshold/atributes naming inconsistency

This is a minor issue, but the word "count" is used in at least three different variants now:
Reallocated_Sector_Ct
Uncorrectable_Error_Cnt
Reallocated_Event_Count

Request: Auto detect and count all drive on system

Thanks for great nagios script. It is best ever I found!

It will be VERY helpfully for lot of people if you add a parameter to the script for auto detect and count all hard drives (nvme,ssd,hdd) on system (directly connected or behind RAID controller).

For example:

Now, if I want to check health my 4 SSD disks connected with SATA to the motherboard I must to run:

/usr/local/nagios/libexec/smart.pl -g '/dev/sd[c-e]' -i auto

And I need to add separate check to check my 7 disks that connected to my HP Smart Array Controller:

/usr/local/nagios/libexec/smart.pl -g '/dev/sda' -i 'cciss,[0-6]'

The idea is to add functionality of that script to be unified and to be called without definition of number of disks or type of RAID (with 'lspci | grep -i raid' (for example for RAID detection)).

And to be needed only comma separated list of thresholds needed for parameter..

Bug with -g parameter on FreeBSD

The commit 06d655d has introduced an issue on FreeBSD systems that drives /dev/da[0-9] are not detected anymore:

/usr/local/libexec/nagios/check_smart.pl -g /dev/da -i scsi
Could not find any valid block/character special device for pattern /dev/da !

The regex needs to be changed, that not only [a-z] is being looked up, but also numbers (for /dev/da2 for example).

Kingston ssd wearout not detected

Hello,

Some SSD disks unexpectedly failed, and we have noticed this difference between good devices and failing devices:

good:

169 Remaining_Lifetime_Perc 0x0000   093   093   000    Old_age   Offline      -       93

bad:

169 Remaining_Lifetime_Perc 0x0000   000   000   000    Old_age   Offline      -       0

So this probably needs to be added to a list of values to check.

Samuel

Is it possible to hide results like "No health status line found"

Hi there,

I am running this script to check all available harddisks for smart errors. I got this working, however it showing multiple "No health status line found" results within the Nagios GUI.

Is there a way to hide this from the output?

Here is a result of what the outputs looks like now within the Nagios GUI:
UNKNOWN: [/dev/sda] - Device is clean --- [/dev/sda] - Device is clean --- [/dev/sda] - Device is clean --- [/dev/sda] - Device is clean --- [/dev/sda] - No health status line found --- [/dev/sda] - No health status line found --- [/dev/sda] - No health status line found --- [/dev/sda] - No health status line found --- [/dev/sda] - No health status line found --- [/dev/sda] - No health status line found

Would be great to hide those "No health status line" results.

Add special monitoring on SSD attribute 202 (Percent_Lifetime_Remain)

Some SSDs (not all) do show the attribute 202 (Percent_Lifetime_Remain) which could be monitored additionally.
Here's a SSD which reached a value of 99 (99% used) and which is now in FAILING_NOW state.

# smartctl -q noserial -a /dev/sda 
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-14-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Crucial/Micron Client SSDs
Device Model:     CT1000MX500SSD1
Firmware Version: M3CR010
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Mar  2 17:13:33 2021 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(    0) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (  30) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x0031)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocate_NAND_Blk_Cnt 0x0032   099   099   010    Old_age   Always       -       12
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       10954
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       19
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   099   099   000    Old_age   Always       -       12
173 Ave_Block-Erase_Count   0x0032   001   001   000    Old_age   Always       -       1485
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       16
180 Unused_Reserve_NAND_Blk 0x0033   000   000   000    Pre-fail  Always       -       42
183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       0
184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   058   032   000    Old_age   Always       -       42 (Min/Max 0/68)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       12
197 Current_Pending_ECC_Cnt 0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       5
202 Percent_Lifetime_Remain 0x0030   001   001   001    Old_age   Offline  FAILING_NOW 99
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       0
246 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       29598563744
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       22433769966
248 FTL_Program_Page_Count  0x0032   100   100   000    Old_age   Always       -       39882762744

SMART Error Log Version: 1
Warning: ATA error count 0 inconsistent with error log pointer 3

ATA Error Count: 0
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error -2 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
  When the command that caused the error occurred, the device was in an unknown state.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  00 ec 00 00 00 00 00

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ec 00 00 00 00 00 00 00      00:00:00.000  IDENTIFY DEVICE
  ec 00 00 00 00 00 00 00      00:00:00.000  IDENTIFY DEVICE
  ec 00 00 00 00 00 00 00      00:00:00.000  IDENTIFY DEVICE
  ec 00 00 00 00 00 00 00      00:00:00.000  IDENTIFY DEVICE
  c8 00 00 00 00 00 00 00      00:00:00.000  READ DMA

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      9976         -
# 2  Short offline       Completed without error       00%      8677         -
# 3  Extended offline    Completed without error       00%      8647         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Add a way to ignore temperature failures of the past

We care about the status quo of the disk and use it to determine a failure (now) or in the future. We don't care if the temperature once rose above the disk's internal thresholds.
Use this commit as idea: papercapp@9a7d7bf - but don't hardcode it. Use a switch or an ignore list for it.

New release / tag

Thanks @Napsty for doing such a great work. It would be great if you can release/tag new versions also on github. The last tag is from 5 May 2014 and looking into the plugin it seems you bumped the version number a couple of times since then.

Thanks, Jan.

Use of uninitialized value $device in string eq

When running the plugin without any parameters it will show a perl warning:

# /usr/lib/nagios/plugins/check_smart
Use of uninitialized value $device in string eq at /usr/lib/nagios/plugins/check_smart line 143.
must specify a device!

check_smart v$Revision: 5.11 $ (monitoring-plugins 2.2)

The Monitoring Plugins come with ABSOLUTELY NO WARRANTY. You may redistribute
copies of the plugins under the terms of the GNU General Public License.
For more information about these matters, see the file named COPYING.
[...]

ATA read errors

I have 2 ATA disks, this one looks ok:

OK: no SMART errors detected. |Raw_Read_Error_Rate=0 Throughput_Performance=82 Spin_Up_Time=0 Start_Stop_Count=4 Reallocated_Sector_Ct=0 Seek_Error_Rate=0 Seek_Time_Performance=27 Power_On_Hours=5690 Spin_Retry_Count=0 Power_Cycle_Count=4 Power-Off_Retract_Count=4 Load_Cycle_Count=4 Temperature_Celsius=34 Reallocated_Event_Count=0 Current_Pending_Sector=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=0

and the other one has pretty high Raw_Read_Error_Rate, though no warning is shown.

OK: no SMART errors detected. |Raw_Read_Error_Rate=3080195 Throughput_Performance=78 Spin_Up_Time=0 Start_Stop_Count=4 Reallocated_Sector_Ct=0 Seek_Error_Rate=0 Seek_Time_Performance=26 Power_On_Hours=5690 Spin_Retry_Count=0 Power_Cycle_Count=4 Power-Off_Retract_Count=4 Load_Cycle_Count=4 Temperature_Celsius=37 Reallocated_Event_Count=0 Current_Pending_Sector=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=0

smartctl -a show some errors as well:

Error 1 occurred at disk power-on lifetime: 4281 hours (178 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 10 f8 9e 19 00  Error: WP at LBA = 0x00199ef8 = 1679096

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 b0 c0 78 10 f1 40 00   3d+18:58:53.439  WRITE FPDMA QUEUED
  60 10 b8 f8 9e 19 40 00   3d+18:58:53.414  READ FPDMA QUEUED
  60 10 b0 70 87 19 40 00   3d+18:58:53.414  READ FPDMA QUEUED
  60 10 a8 c8 86 19 40 00   3d+18:58:53.409  READ FPDMA QUEUED
  ea 00 00 00 00 00 a0 00   3d+18:58:53.386  FLUSH CACHE EXT

is it possible to detect this type of errors? Whole output is here http://paste.ubuntu.com/12078592/

Call smartctl only once

check_smart makes three checks where the smartctl command is launched three times. Consider the possibility to only run once.

Ignore overall health assessment

Hi,

I'm trying to configure check_smart to ignore attribute 202 (Percent lifetime used), using following command:

/usr/lib/nagios/plugins/check_smart -i ata -E Percent_Lifetime_Used,202 -d /dev/sdb

However the overall result is FAILED due to overall health assessment. Could we simply ignore it?

/usr/lib/nagios/plugins/check_smart -i ata -E Percent_Lifetime_Used,202 -d /dev/sdb --debug
Found /dev/sdb
###########################################################
(debug) CHECK 1: getting overall SMART health status for /dev/sdb 
###########################################################


(debug) executing:
sudo smartctl -d ata -Hi /dev/sdb

(debug) output:
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
 Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
 
 === START OF INFORMATION SECTION ===
 Device Model:     Crucial_CT275MX300SSD1
 Serial Number:    16441482E7D2
 LU WWN Device Id: 5 00a075 11482e7d2
 Firmware Version: M0CR040
 User Capacity:    275,064,201,216 bytes [275 GB]
 Sector Size:      512 bytes logical/physical
 Rotation Rate:    Solid State Device
 Form Factor:      2.5 inches
 Device is:        Not in smartctl database [for details use: -P showall]
 ATA Version is:   ACS-3 T13/2161-D revision 5
 SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
 Local Time is:    Thu Nov 14 16:47:52 2019 UTC
 SMART support is: Available - device has SMART capability.
 SMART support is: Enabled
 
 === START OF READ SMART DATA SECTION ===
 SMART overall-health self-assessment test result: FAILED!
 Drive failure expected in less than 24 hours. SAVE ALL DATA.
 No failed Attributes found.
 


(debug) parsing line:
Device Model:     Crucial_CT275MX300SSD1


(debug) found model:  Crucial_CT275MX300SSD1

(debug) parsing line:
Serial Number:    16441482E7D2


(debug) found serial number 16441482E7D2

(debug) parsing line:
SMART overall-health self-assessment test result: FAILED!


(debug) no 'PASSED' status; failing

###########################################################
(debug) CHECK 2: getting silent SMART health check
###########################################################


(debug) executing:
sudo smartctl -d ata -q silent -A /dev/sdb

(debug) exit code:
0

(debug) zero exit code, status OK

###########################################################
(debug) CHECK 3: getting detailed statistics
(debug) information contains a few more potential trouble spots
(debug) plus, we can also use the information for perfdata/graphing
###########################################################


(debug) executing:
sudo smartctl -d ata -A /dev/sdb

(debug) output:
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
 Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
 
 === START OF READ SMART DATA SECTION ===
 SMART Attributes Data Structure revision number: 16
 Vendor Specific SMART Attributes with Thresholds:
 ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
   1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
   5 Reallocated_Sector_Ct   0x0032   100   100   010    Old_age   Always       -       0
   9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       25718
  12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       5
 171 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
 172 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
 173 Unknown_Attribute       0x0032   001   001   000    Old_age   Always       -       1587
 174 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       1
 183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
 184 End-to-End_Error        0x0032   100   100   000    Old_age   Always       -       0
 187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
 194 Temperature_Celsius     0x0022   062   052   000    Old_age   Always       -       38 (Min/Max 23/48)
 196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
 197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
 198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
 199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
 202 Unknown_SSD_Attribute   0x0030   000   000   001    Old_age   Offline  FAILING_NOW 100
 206 Unknown_SSD_Attribute   0x000e   100   100   000    Old_age   Always       -       0
 246 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       45166345756
 247 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       1500301497
 248 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       28982653112
 180 Unused_Rsvd_Blk_Cnt_Tot 0x0033   000   000   000    Pre-fail  Always       -       1257
 210 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
 


(debug) Raw Check List: Current_Pending_Sector,Reallocated_Sector_Ct,Program_Fail_Cnt_Total,Uncorrectable_Error_Cnt,Offline_Uncorrectable,Runtime_Bad_Block
(debug) Exclude List for Checks: Percent_Lifetime_Used,202
(debug) Exclude List for Perfdata: Percent_Lifetime_Used,202
(debug) Warning Thresholds:

(debug) Raw_Read_Error_Rate not in raw check list (raw value: 0)

(debug) Reallocated_Sector_Ct is OK (0)

(debug) Power_On_Hours not in raw check list (raw value: 25718)

(debug) Power_Cycle_Count not in raw check list (raw value: 5)

(debug) Runtime_Bad_Block is OK (0)

(debug) End-to-End_Error not in raw check list (raw value: 0)

(debug) Reported_Uncorrect not in raw check list (raw value: 0)

(debug) Temperature_Celsius not in raw check list (raw value: 38)

(debug) Reallocated_Event_Count not in raw check list (raw value: 0)

(debug) Current_Pending_Sector is OK (0)

(debug) Offline_Uncorrectable is OK (0)

(debug) UDMA_CRC_Error_Count not in raw check list (raw value: 0)

SMART Attribute Unknown_SSD_Attribute failed at FAILING_NOW but was set to be ignored
(debug) SMART Attribute Unknown_SSD_Attribute was set to be ignored

(debug) Unknown_SSD_Attribute not in raw check list (raw value: 0)

(debug) Unused_Rsvd_Blk_Cnt_Tot not in raw check list (raw value: 1257)

(debug) gathered perfdata:
Raw_Read_Error_Rate=0 Reallocated_Sector_Ct=0 Power_On_Hours=25718 Power_Cycle_Count=5 Runtime_Bad_Block=0 End-to-End_Error=0 Reported_Uncorrect=0 Temperature_Celsius=38 Reallocated_Event_Count=0 Current_Pending_Sector=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=0 Unknown_SSD_Attribute=0 Unused_Rsvd_Blk_Cnt_Tot=1257

###########################################################
(debug) LOCAL STATUS: CRITICAL, FINAL STATUS: CRITICAL
###########################################################


(debug) final status/output: CRITICAL
(debug) drives  ok: 
(debug) drives nok: Health status: FAILED!
(debug)   msg_list: Drive  Crucial_CT275MX300SSD1 S/N 16441482E7D2: ^Health status: FAILED!

CRITICAL: Drive  Crucial_CT275MX300SSD1 S/N 16441482E7D2:  Health status: FAILED!|Raw_Read_Error_Rate=0 Reallocated_Sector_Ct=0 Power_On_Hours=25718 Power_Cycle_Count=5 Runtime_Bad_Block=0 End-to-End_Error=0 Reported_Uncorrect=0 Temperature_Celsius=38 Reallocated_Event_Count=0 Current_Pending_Sector=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=0 Unknown_SSD_Attribute=0 Unused_Rsvd_Blk_Cnt_Tot=1257

Incorrect output with sat+megaraid,[0-3]

Hello,

I can get correct results with this command :
./check_smart.pl -g /dev/sda -i sat+megaraid,0
OK: [sat+megaraid,0] - Device is clean|

But not with this one :
./check_smart.pl -g /dev/sda -i 'sat+megaraid,[0-3]'
UNKNOWN: [megaraid,0] - No health status line found[megaraid,0] - [megaraid,0] - --- [megaraid,1] - No health status line found[megaraid,1] - [megaraid,1] - --- [megaraid,2] - No health status line found[megaraid,2] - [megaraid,2] - --- [megaraid,3] - No health status line found[megaraid,3] - [megaraid,3] - |

It seems with the second command, "sat+" is not kept.
Would the command be different in this case, any character to escape maybe ?

Thanks

Use of uninitialized value $device

Use of uninitialized value $device in string eq at ./check_smart_orig.pl line 117.

No output after pipe

Hello

First of all thank you for your time in writing this very useful plugin. I have installed the plugin on nagios 4.4.14 - Rocky 9.

When I try it ( check_smart -d /dev/sda -i auto) I only get some of the output instead of the complete result as shown in the examples.

I only get this. Nothing is shown after the pipe

OK: Drive DELL MD32xx S/N 3C3005A: no SMART errors detected. |

Can you please let me know if I am doing something wrong? Is there something I need to change to make it work?

Thank you

6.12.0 regression: invalid interface

6.12.0 introduced a regression when using an interface with additional comma input, for example on -i megaraid,1:

~# /usr/lib/nagios/plugins/check_smart -d /dev/sda -i megaraid,1
invalid interface megaraid,1 for /dev/sda!

check_smart v6.12.0
The monitoring plugins come with ABSOLUTELY NO WARRANTY. You may redistribute
copies of the plugins under the terms of the GNU General Public License.
For more information about these matters, see the file named COPYING.

Usage: check_smart {-d=<block device>|-g=<block device glob>} -i=(auto|ata|scsi|3ware,N|areca,N|hpt,L/M/N|aacraid,H,L,ID|cciss,N|megaraid,N) [-r list] [-w list] [-b N] [-e list] [-E list] [--debug]

At least one of the below. -d supersedes -g
  -d/--device: a physical block device to be SMART monitored, eg /dev/sda. Pseudo-device /dev/bus/N is allowed.
  -g/--global: a glob pattern name of physical devices to be SMART monitored
       Example: '/dev/sd[a-z]' will search for all /dev/sda until /dev/sdz devices and report errors globally.
       It is also possible to use -g in conjunction with megaraid devices. Example: -i 'megaraid,[0-3]'.
       Does not output performance data for historical value graphing.
Note that -g only works with a fixed interface (e.g. scsi, ata) and megaraid,N.

Other options
  -i/--interface: device's interface type (auto|ata|scsi|nvme|3ware,N|areca,N|hpt,L/M/N|aacraid,H,L,ID|cciss,N|megaraid,N)
  (See http://www.smartmontools.org/wiki/Supported_RAID-Controllers for interface convention)
  -r/--raw Comma separated list of ATA or NVMe attributes to check
       ATA default: Current_Pending_Sector,Reallocated_Sector_Ct,Program_Fail_Cnt_Total,Uncorrectable_Error_Cnt,Offline_Uncorrectable,Runtime_Bad_Block,Reported_Uncorrect,Reallocated_Event_Count
       NVMe default: Media_and_Data_Integrity_Errors
  -b/--bad: Threshold value for Current_Pending_Sector for ATA and 'grown defect list' for SCSI drives
  -w/--warn Comma separated list of thresholds for ATA drives (e.g. Reallocated_Sector_Ct=10,Current_Pending_Sector=62)
  -e/--exclude: Comma separated list of SMART attribute names or numbers which should be excluded (=ignored) with regard to checks
  -E/--exclude-all: Comma separated list of SMART attribute names or numbers which should be completely ignored for checks as well as perfdata
  -s/--selftest: Enable self-test log check
  -l/--ssd-lifetime: Check attribute 'Percent_Lifetime_Remain' available on some SSD drives
  --skip-self-assessment: Skip SMART self-assessment health status check
  -h/--help: this help
  -q/--quiet: When faults detected, only show faulted drive(s) (only affects output when used with -g parameter)
  --debug: show debugging information
  -v/--version: Version number

Reallocated_Sector_Ct & Seek_Error_Rate

Good day

Can your script please too monitor Reallocated_Sector_Ct & Seek_Error_Rate?

Kind Regards
Brent Clark

megaraid,N not work with 6.12

I have configured few checks on hdd disks on my AVAGO MegaRAID SAS 9361-8i controler:

command[check_smart_storage_drive0]=/usr/local/nagios/libexec/smart.pl -d '/dev/sda' -i 'megaraid,23' -r Current_Pending_Sector,Reallocated_Sector_Ct,Program_Fail_Cnt_Total,Uncorrectable_Error_Cnt,Offline_Uncorrectable,Runtime_Bad_Block,Reported_Uncorrect,Reallocated_Event_Count,Command_Timeout,Temperature_Celsius,Airflow_Temperature_Cel -w Airflow_Temperature_Cel=45,Temperature_Celsius=45
command[check_smart_storage_drive1]=/usr/local/nagios/libexec/smart.pl -d '/dev/sda' -i 'megaraid,24' -r Current_Pending_Sector,Reallocated_Sector_Ct,Program_Fail_Cnt_Total,Uncorrectable_Error_Cnt,Offline_Uncorrectable,Runtime_Bad_Block,Reported_Uncorrect,Reallocated_Event_Count,Command_Timeout,Temperature_Celsius,Airflow_Temperature_Cel -w Airflow_Temperature_Cel=45,Temperature_Celsius=45
command[check_smart_storage_drive2]=/usr/local/nagios/libexec/smart.pl -d '/dev/sda' -i 'megaraid,25' -r Current_Pending_Sector,Reallocated_Sector_Ct,Program_Fail_Cnt_Total,Uncorrectable_Error_Cnt,Offline_Uncorrectable,Runtime_Bad_Block,Reported_Uncorrect,Reallocated_Event_Count,Command_Timeout,Temperature_Celsius,Airflow_Temperature_Cel -w Airflow_Temperature_Cel=45,Temperature_Celsius=45
command[check_smart_storage_drive3]=/usr/local/nagios/libexec/smart.pl -d '/dev/sda' -i 'megaraid,26' -r Current_Pending_Sector,Reallocated_Sector_Ct,Program_Fail_Cnt_Total,Uncorrectable_Error_Cnt,Offline_Uncorrectable,Runtime_Bad_Block,Reported_Uncorrect,Reallocated_Event_Count,Command_Timeout,Temperature_Celsius,Airflow_Temperature_Cel -w Airflow_Temperature_Cel=45,Temperature_Celsius=45



command[check_smart_backups_drive0]=/usr/local/nagios/libexec/smart.pl -d '/dev/sdb' -i 'megaraid,27' -r Current_Pending_Sector,Reallocated_Sector_Ct,Program_Fail_Cnt_Total,Uncorrectable_Error_Cnt,Offline_Uncorrectable,Runtime_Bad_Block,Reported_Uncorrect,Reallocated_Event_Count,Command_Timeout,Temperature_Celsius,Airflow_Temperature_Cel -w Airflow_Temperature_Cel=45,Temperature_Celsius=45
command[check_smart_backups_drive1]=/usr/local/nagios/libexec/smart.pl -d '/dev/sdb' -i 'megaraid,28' -r Current_Pending_Sector,Reallocated_Sector_Ct,Program_Fail_Cnt_Total,Uncorrectable_Error_Cnt,Offline_Uncorrectable,Runtime_Bad_Block,Reported_Uncorrect,Reallocated_Event_Count,Command_Timeout,Temperature_Celsius,Airflow_Temperature_Cel -w Airflow_Temperature_Cel=45,Temperature_Celsius=45
command[check_smart_backups_drive2]=/usr/local/nagios/libexec/smart.pl -d '/dev/sdb' -i 'megaraid,29' -r Current_Pending_Sector,Reallocated_Sector_Ct,Program_Fail_Cnt_Total,Uncorrectable_Error_Cnt,Offline_Uncorrectable,Runtime_Bad_Block,Reported_Uncorrect,Reallocated_Event_Count,Command_Timeout,Temperature_Celsius,Airflow_Temperature_Cel -w Airflow_Temperature_Cel=45,Temperature_Celsius=45

-----------------------------------------------------------------------------------
EID:Slt DID State DG     Size Intf Med SED PI SeSz Model                   Sp Type 
-----------------------------------------------------------------------------------
21:0     24 Onln   0 1.818 TB SATA HDD N   N  512B Hitachi HUA723020ALA640 U  -    
21:1     25 Onln   0 1.818 TB SATA HDD N   N  512B HGST HUS724020ALA640    U  -    
21:2     26 Onln   0 1.818 TB SATA HDD N   N  512B HGST HUS724020ALE640    U  -    
21:3     23 Onln   0 1.818 TB SATA HDD N   N  512B HGST HUS724020ALA640    U  -    
21:4     27 Onln   1 1.818 TB SATA HDD N   N  512B Hitachi HUA723020ALA640 U  -    
21:5     29 Onln   1 1.818 TB SATA HDD N   N  512B Hitachi HUA723020ALA640 U  -    
21:6     28 Onln   1 1.818 TB SATA HDD N   N  512B Hitachi HUA723020ALA640 U  -    
-----------------------------------------------------------------------------------

but after update the smart.pl with latest version, when I run:

/usr/local/nagios/libexec/smart.pl -d '/dev/sda' -i 'megaraid,24'
invalid interface megaraid,24 for /dev/sda!

With new version 6.12 on another server with nvme all works fine and performance_data work correctly but on that version of plugin I unable to check smart on my hdd drives

Feature request: An exclude-flag when using -g option

It would be wonderful if it was possible to use the g-option with an optional exclude-flag.
Right now I have multiple harddrives on one of my systems, and also a 3g-modem connected.
The 3g-modem has a usb-hdd (dev/sdd), and when I'm using the g-option, the overall health becomes:
UNKNOWN: [/dev/sda] - Device is clean ... [/dev/sdd] - No health status line found ---

As said, it would be nice if one could do the following:
/usr/lib/nagios/plugins/check_smart.pl -i ata -g /dev/sd --exlude /dev/sdd /dev/sde

status line 2000GB Gigabyte AORUS M.2 2280 PCIe 4.0 x4 NVMe

Your plugin does work very well - thank you very much! Regarding NVMe, it did work in all my settings with Samsung devices.

Recently, I did purchase this Gigabyte product:
2000GB Gigabyte AORUS M.2 2280 PCIe 4.0 x4 NVMe 1.3 3D-NAND TLC (GP-ASM2NE6200TTTD)

It does yield meaningful SMART outout using

smartctl -a /dev/nvme0n1
nvme --smart-log /dev/nvme0n1

However, the plugin does return "UNKNOWN: Drive GIGABYTE GP-ASM2NE6200TTTD S/N SN...: No health status line found".

Is there a setting available to cure this?

Regards,

Michael Schefczyk

flag to disable temperature check

Some disks have strange max temp (25°C ???) and build date (year 2002 ???):

# /var/lib/icinga2/checks/check_smart.pl -d /dev/sda -i auto
CRITICAL: Drive  Hitachi HDS722020ALA330 S/N 11111:  Disk temperature is higher than maximum|temperature=30;;25

# /var/lib/icinga2/checks/check_smart.pl -g /dev/sd\? -i auto
CRITICAL: [/dev/sda] - Disk temperature is higher than maximum --- [/dev/sdb] - Disk temperature is higher than maximum --- [/dev/sdc] - Disk temperature is higher than maximum --- [/dev/sdd] - Disk temperature is higher than maximum --- [/dev/sde] - Disk temperature is higher than maximum --- [/dev/sdf] - Disk temperature is higher than maximum --- [/dev/sdg] - Disk temperature is higher than maximum --- [/dev/sdh] - Disk temperature is higher than maximum --- [/dev/sdi] - Disk temperature is higher than maximum --- [/dev/sdj] - Disk temperature is higher than maximum --- [/dev/sdk] - Disk temperature is higher than maximum --- [/dev/sdl] - Disk temperature is higher than maximum --- [/dev/sdm] - Disk temperature is higher than maximum --- [/dev/sdn] - Disk temperature is higher than maximum --- [/dev/sdo] - Disk temperature is higher than maximum --- [/dev/sdp] - Disk temperature is higher than maximum|

# smartctl -a -d auto /dev/sda
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.11.22-3-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               Hitachi
Product:              HDS722020ALA330
Revision:             R001
Compliance:           SPC-3
User Capacity:        2,000,398,934,016 bytes [2.00 TB]
Logical block size:   512 bytes
Rotation Rate:        10000 rpm
Logical Unit id:      
Serial number:        
Device type:          disk
Transport protocol:   Fibre channel (FCP-2)
Local Time is:        Sun Apr 10 22:48:26 2022 UTC
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     30 C
Drive Trip Temperature:        25 C

Manufactured in week 30 of year 2002
Specified cycle count over device lifetime:  4278190080
Accumulated start-stop cycles:  256
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          0          0.000           0
write:         0        0         0         0          0          0.000           0

Non-medium error count:        0

Device does not support Self Test logging

# smartctl -a /dev/sda -A
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.27-1-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               Seagate
Product:              ST4000VN008-2DR1
Revision:             R001
Compliance:           SPC-3
User Capacity:        4,000,787,030,016 bytes [4.00 TB]
Logical block size:   512 bytes
Rotation Rate:        10000 rpm
Logical Unit id:      
Serial number:        
Device type:          disk
Transport protocol:   Fibre channel (FCP-2)
Local Time is:        Sun Apr 10 22:55:42 2022 UTC
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     30 C
Drive Trip Temperature:        25 C

Manufactured in week 30 of year 2002
Specified cycle count over device lifetime:  4278190080
Accumulated start-stop cycles:  256
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          0          0.000           0
write:         0        0         0         0          0          0.000           0

Non-medium error count:        0

Device does not support Self Test logging

Temperature reported twice in metrics for some drives

I have some drives (Seagate Nytro XF1230) that report Temperature_Celsius twice: with type Old_age and with type Pre-fail. Both may be relevant performance data but the current temperature, which is the most interesting information for metrics and fancy dashboards is in the first one.

This is what these drives return in smartctl:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
...
194 Temperature_Celsius     0x0002   071   058   000    Old_age   Always       -       29 (Min/Max 17/42)
...
231 Temperature_Celsius     0x0033   100   100   001    Pre-fail  Always       -       100

A line from check_smart.pl for one of these drives looks like this:

OK: no SMART errors detected. |Raw_Read_Error_Rate=0 Reallocated_Sector_Ct=0 Power_On_Hours=3706 Power_Cycle_Count=302 Program_Fail_Count_Chip=0 Erase_Fail_Count_Chip=0 Wear_Leveling_Count=1432832 Used_Rsvd_Blk_Cnt_Chip=45 Used_Rsvd_Blk_Cnt_Tot=661 Unused_Rsvd_Blk_Cnt_Tot=0 Program_Fail_Cnt_Total=0 Erase_Fail_Count_Total=0 Runtime_Bad_Block=0 End-to-End_Error=0 Reported_Uncorrect=0 Command_Timeout=0 Unknown_SSD_Attribute=0 Airflow_Temperature_Cel=0 Temperature_Celsius=29 Hardware_ECC_Recovered=0 UDMA_CRC_Error_Count=0 Unknown_SSD_Attribute=0 Soft_ECC_Correction=0 Temperature_Celsius=100 Total_LBAs_Written=2973 Total_LBAs_Read=3900 Read_Error_Retry_Rate=992

This has become a problem after I upgraded my monitoring hosts from Debian stretch to buster. The Icinga 2 version 2.6 in stretch fed the first occurrence of the attribute in the performance data to Graphite, whereas the newer version 2.10 from buster seems to use the second. Therefore, all my disks now show a temperature of 99 or 100°C in the database.

As far as I understand it, labels in the performance data should be unique and the order of the label/value pairs in the performance data is irrelevant, so I think Icinga is not at fault here as the behavior in case of non-unique labels is undefined.

Since I deploy check_smart.pl via Ansible on all hosts and have a local copy of it in one of my Ansible roles, I fixed the problem by replacing the Regex on line 396 with something like this (could be better, but I'm not fluent in Perl and Regex, so this was the quick and dirty solution my brain came up with):

next unless $line =~ /^\s*\d+\s(\S+)\s+(\S+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\d+)/;
my ($attribute_name, $type, $when_failed, $raw_value) = ($1, $6, $8, $9);

and then a few lines later I added:

if ($attribute_name eq "Temperature_Celsius" and $type eq "Pre-fail") {
        next;
}

Maybe I'm not the only one with this or a similar issue and maybe there is a more generic way to do this (e.g. adding the attribute type to the label in the performance data or implementing more flexible exludes?), so I'll just leave this here.

Make -g option work for 3ware interface

This is already supported for megaraid interface.

-g parameter is limited to 26 devices

The -g parameter only matches 26 devices when for instance you issue:

-g /dev/sd

It matches /dev/sda, /dev/sdb, ..... till /dev/sdz

BUT in a system with more disks (/dev/sdaa, sdab, sdac, etc.... ) these disks are ignored because the glob function does not match them:

line 78:

@dev =glob($opt_g."?");

The ? matches only 1 character (a till z) so sdaa is ignored (and /dev/sda1 which is ok!)

So it should match /dev/sd [a-z][a-z] or something similar (Don't know how to convert that to perl)
to match /dev/sdaa too but not the numbered partitions like /dev/sda1.

check_smart does not support General Purpose Log

One of my computers has an SSD (model ADATA SP600NS34) which doesn't report its used life as a SMART attribute. It does, however, report it in the "General Purpose Log".

$ smartctl -l ssd /dev/sda
…
Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x07  =====  =               =  ===  == Solid State Device Statistics (rev 1) ==
0x07  0x008  1              56  N--  Percentage Used Endurance Indicator
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Supported tables can be listed with smartctl -l devstat,0 and each supported table can be retrieved using smartctl -l devstat,<page>. -l ssd is equivalent to -l devstat,7. The SMART attributes for the aforementioned SSD:

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000a   100   100   000    Old_age   Always       -       0
  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0007   100   100   050    Pre-fail  Always       -       0
  5 Reallocated_Sector_Ct   0x0013   100   100   050    Pre-fail  Always       -       0
  7 Unknown_SSD_Attribute   0x000b   100   100   050    Pre-fail  Always       -       0
  8 Unknown_SSD_Attribute   0x0005   100   100   050    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       18904
 10 Unknown_SSD_Attribute   0x0013   100   100   050    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0012   100   100   000    Old_age   Always       -       658
167 Unknown_Attribute       0x0022   100   100   000    Old_age   Always       -       0
168 Unknown_Attribute       0x0012   100   100   000    Old_age   Always       -       93
169 Unknown_Attribute       0x0013   100   100   010    Pre-fail  Always       -       4295950346
170 Unknown_Attribute       0x0013   100   100   010    Pre-fail  Always       -       0
173 Unknown_Attribute       0x0012   143   143   000    Old_age   Always       -       25862866203
175 Program_Fail_Count_Chip 0x0013   100   100   010    Pre-fail  Always       -       0
180 Unused_Rsvd_Blk_Cnt_Tot 0x0033   099   099   020    Pre-fail  Always       -       1089
192 Power-Off_Retract_Count 0x0012   100   100   000    Old_age   Always       -       93
194 Temperature_Celsius     0x0022   063   063   030    Old_age   Always       -       37 (Min/Max 28/38)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
231 Temperature_Celsius     0x0033   100   100   005    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   100   100   000    Old_age   Always       -       691419471104
234 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       1075000130176
240 Unknown_SSD_Attribute   0x0013   100   100   050    Pre-fail  Always       -       0
241 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       32170085668
242 Total_LBAs_Read         0x0032   100   100   000    Old_age   Always       -       9610546032

check_smart could retrieve the general purpose log pages and treat them similar to the SMART attributes.

Handling dots in attribute names

Reported by e-mail:

When you monitor nvme with
Nvme_0 /usr/local/bin/check_smart.pl -d /dev/nvme0 -i nvme
The result will show like this
Nvme_0 0 OK: Drive SAMSUNG MZVLB512HAJQ-00000 S/N S3W8NA0M345262: no SMART errors detected. |Temperature=25 Available_Spare=100 Available_Spare_Threshold=10 Percentage_Used=22 Data_Units_Read=35774467 Data_Units_Written=280451586 Host_Read_Commands=637677302 Host_Write_Commands=2270597693 Controller_Busy_Time=5846 Power_Cycles=22 Power_On_Hours=1268 Unsafe_Shutdowns=7 Media_and_Data_Integrity_Errors=0 Error_Information_Log_Entries=10 Warning__Comp._Temperature_Time=0 Critical_Comp._Temperature_Time=0 Temperature_Sensor_1=25 Temperature_Sensor_2=34
The problem is the . from Critical_Comp._Temperature_Time . It won’t generate graphics .
I have added at line 491
$attribute_name =~ s/.//g;
For me is working.

Percent_Lifetime_Remain threshold unset with -w

Hello

It seems there is an issue on -w option handling. When I give a threshold for a particular smartctl item (not lifetime), the Percent_Lifetime_Remain threshold is not set to 90%:

warning => ./check_smart -i auto -g '/dev/sda' -w Reallocated_Sector_Ct=250 -l
ok => ./check_smart -i auto -g '/dev/sda' -w Reallocated_Sector_Ct=250,Percent_Lifetime_Remain=90 -l
ok => ./check_smart -i auto -g '/dev/sda' -l

Before working on a patch, can you tell me if this behaviour is normal or not.

Regards

MegaRAID: errors not shown by disk ?

Hi,

because we found HDDs with errors I try to setup also a Nagios/Icinga check for SMART status and found this nice script which works also for MegaRAID controllers nicely within first calls.

But the output it not very helpful in error cases because

there is no complete device name shown like sat+megaraid,13 /dev/sda to identify disks in the HW Raid
there are no errors shown neither by device nor globally:

root@test ~ # /usr/lib/nagios/plugins/check_smart.pl -i  "sat+megaraid,[4-19]" -g "/dev/sd[a]"
OK: [/dev/sda] - Device is clean --- [/dev/sda] - Device is clean --- [/dev/sda] - Device is clean --- [/dev/sda] - Device is clean --- [/dev/sda] - Device is clean --- [/dev/sda] - Device is clean --- [/dev/sda] - Device is clean --- [/dev/sda] - Device is clean --- [/dev/sda] - Device is clean --- [/dev/sda] - Device is clean --- [/dev/sda] - Device is clean --- [/dev/sda] - Device is clean --- [/dev/sda] - Device is clean --- [/dev/sda] - Device is clean --- [/dev/sda] - Device is clean --- [/dev/sda] - Device is clean|

Only if I enable self-test log check I got a global warning notification:

root@test ~ # /usr/lib/nagios/plugins/check_smart.pl -i  "sat+megaraid,[4-19]" -g "/dev/sd[a]" -s
WARNING: [/dev/sda] - Device is clean --- [/dev/sda] - Device is clean --- [/dev/sda] - Device is clean --- [/dev/sda] - Device is clean --- [/dev/sda] - Device is clean --- [/dev/sda] - Device is clean --- [/dev/sda] - Device is clean --- [/dev/sda] - Device is clean --- [/dev/sda] - Device is clean --- [/dev/sda] - Device is clean --- [/dev/sda] - Device is clean --- [/dev/sda] - Device is clean --- [/dev/sda] - Device is clean --- [/dev/sda] - Device is clean --- [/dev/sda] - Device is clean --- [/dev/sda] - Self-test log contains errors|

and in actual test case there are 2 disks with errors...

I think this would be commonly a helpful improvement.
Bests

Warning thresholds does NOT give the expected result.

Hi,

first of all, thank you for writing check_smart.pl.

When I try to check the temperature the given argument will not be honored:

# ./check_smart.pl -d /dev/sda -i megaraid,11 -w 'Temperature_Celsius=35' --debug

I would expect, that the temperature from smartctl (37° C) will be over the limit of 35° C an a WARNING should be displayed, BUT OK will be shown.

See the --debug output below:

# ./check_smart.pl -d /dev/sda -i megaraid,11 -w 'Temperature_Celsius=35' --debug
Found /dev/sda
###########################################################
(debug) CHECK 1: getting overall SMART health status for /dev/sda 
###########################################################


(debug) executing:
sudo smartctl -d megaraid,11 -Hi /dev/sda

(debug) output:
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1127.19.1.el7.x86_64] (local build)
 Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org
 
 === START OF INFORMATION SECTION ===
 Model Family:     Western Digital RE4
 Device Model:     WDC WD1003FBYX-01Y7B1
 Serial Number:    WD-WMAW31105150
 LU WWN Device Id: 5 0014ee 206d2bb92
 Firmware Version: 01.01V02
 User Capacity:    1.000.204.886.016 bytes [1,00 TB]
 Sector Size:      512 bytes logical/physical
 Rotation Rate:    7200 rpm
 Device is:        In smartctl database [for details use: -P show]
 ATA Version is:   ATA8-ACS (minor revision not indicated)
 SATA Version is:  SATA 3.0, 3.0 Gb/s (current: 3.0 Gb/s)
 Local Time is:    Mon Oct  5 08:04:52 2020 CEST
 SMART support is: Available - device has SMART capability.
 SMART support is: Enabled
 
 === START OF READ SMART DATA SECTION ===
 SMART Status not supported: ATA return descriptor not supported by controller firmware
 SMART overall-health self-assessment test result: PASSED
 Warning: This result is based on an Attribute check.
 
 Last login: Mo Okt  5 08:04:40 CEST 2020 on pts/0


(debug) parsing line:
Device Model:     WDC WD1003FBYX-01Y7B1


(debug) found model:  WDC WD1003FBYX-01Y7B1

(debug) parsing line:
Serial Number:    WD-WMAW31105150


(debug) found serial number WD-WMAW31105150

(debug) parsing line:
SMART overall-health self-assessment test result: PASSED


(debug) found string 'PASSED'; status OK

###########################################################
(debug) CHECK 2: getting silent SMART health check
###########################################################


(debug) executing:
sudo smartctl -d megaraid,11 -q silent -A /dev/sda

Last login: Mo Okt  5 08:04:52 CEST 2020 on pts/0
(debug) exit code:
0

(debug) zero exit code, status OK

###########################################################
(debug) CHECK 3: getting detailed statistics from attributes
(debug) information contains a few more potential trouble spots
(debug) plus, we can also use the information for perfdata/graphing
###########################################################


(debug) executing:
sudo smartctl -d megaraid,11 -A /dev/sda

(debug) output:
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1127.19.1.el7.x86_64] (local build)
 Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org
 
 === START OF READ SMART DATA SECTION ===
 SMART Attributes Data Structure revision number: 16
 Vendor Specific SMART Attributes with Thresholds:
 ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
   1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       3
   3 Spin_Up_Time            0x0027   186   175   021    Pre-fail  Always       -       3675
   4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       874
   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
   7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
   9 Power_On_Hours          0x0032   017   017   000    Old_age   Always       -       60999
  10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
  11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
  12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       45
 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       35
 193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       838
 194 Temperature_Celsius     0x0022   110   107   000    Old_age   Always       -       37
 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
 197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
 198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
 200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0
 
 Last login: Mo Okt  5 08:04:52 CEST 2020 on pts/0


(debug) Raw Check List: Current_Pending_Sector,Reallocated_Sector_Ct,Program_Fail_Cnt_Total,Uncorrectable_Error_Cnt,Offline_Uncorrectable,Runtime_Bad_Block,Reported_Uncorrect,Reallocated_Event_Count
(debug) Exclude List for Checks: 
(debug) Exclude List for Perfdata: 
(debug) Warning Thresholds:
Temperature_Celsius=35

(debug) Raw_Read_Error_Rate not in raw check list (raw value: 3)

(debug) Spin_Up_Time not in raw check list (raw value: 3675)

(debug) Start_Stop_Count not in raw check list (raw value: 874)

(debug) Reallocated_Sector_Ct is OK (0)

(debug) Seek_Error_Rate not in raw check list (raw value: 0)

(debug) Power_On_Hours not in raw check list (raw value: 60999)

(debug) Spin_Retry_Count not in raw check list (raw value: 0)

(debug) Calibration_Retry_Count not in raw check list (raw value: 0)

(debug) Power_Cycle_Count not in raw check list (raw value: 45)

(debug) Power-Off_Retract_Count not in raw check list (raw value: 35)

(debug) Load_Cycle_Count not in raw check list (raw value: 838)

(debug) Temperature_Celsius not in raw check list (raw value: 37)

(debug) Reallocated_Event_Count is OK (0)

(debug) Current_Pending_Sector is OK (0)

(debug) Offline_Uncorrectable is OK (0)

(debug) UDMA_CRC_Error_Count not in raw check list (raw value: 0)

(debug) Multi_Zone_Error_Rate not in raw check list (raw value: 0)

(debug) gathered perfdata:
Raw_Read_Error_Rate=3 Spin_Up_Time=3675 Start_Stop_Count=874 Reallocated_Sector_Ct=0 Seek_Error_Rate=0 Power_On_Hours=60999 Spin_Retry_Count=0 Calibration_Retry_Count=0 Power_Cycle_Count=45 Power-Off_Retract_Count=35 Load_Cycle_Count=838 Temperature_Celsius=37 Reallocated_Event_Count=0 Current_Pending_Sector=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=0 Multi_Zone_Error_Rate=0

###########################################################
(debug) LOCAL STATUS: OK, FINAL STATUS: OK
###########################################################


(debug) final status/output: OK
(debug) drives  ok: 
(debug) drives nok: 
(debug)   msg_list: Drive  WDC WD1003FBYX-01Y7B1 S/N WD-WMAW31105150: no SMART errors detected. 

OK: Drive  WDC WD1003FBYX-01Y7B1 S/N WD-WMAW31105150: no SMART errors detected. |Raw_Read_Error_Rate=3 Spin_Up_Time=3675 Start_Stop_Count=874 Reallocated_Sector_Ct=0 Seek_Error_Rate=0 Power_On_Hours=60999 Spin_Retry_Count=0 Calibration_Retry_Count=0 Power_Cycle_Count=45 Power-Off_Retract_Count=35 Load_Cycle_Count=838 Temperature_Celsius=37 Reallocated_Event_Count=0 Current_Pending_Sector=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=0 Multi_Zone_Error_Rate=0

Did I misunderstood something?

Thank you!
Klaus.

Add NVMe support

Hello,
NVMe interface type is not currently supported but are becoming popular.
My suggestion is to add nvme option to -i parameter.

Use of uninitialized value $exclude_list in split at ./check_smart.pl line 159.

When using the most recent version of the script I get this error message. It looks like -e option is required and not optional.

There's also no info about this option when executing the check with -h.

-g flag doesn't work with nvme-devices

Hi,

-g flag doesn't work with nvme-devices the way it works with sata/meragaid/cciss:

Device list:
ls /dev/nvme* /dev/nvme0 /dev/nvme0n1 /dev/nvme0n1p1 /dev/nvme1 /dev/nvme1n1 /dev/nvme1n1p1 /dev/nvme2 /dev/nvme2n1 /dev/nvme2n1p1 /dev/nvme3 /dev/nvme3n1 /dev/nvme3n1p1

Plugin Run:
./check_smart.pl -g /dev/nvme[0-3] -i auto OK: [/dev/nvme0] - Device is clean|

Expected output:
OK: [nvme,0] - Device is clean --- [nvme,1] - Device is clean --- [nvme,2] - Device is clean --- [nvme,3] - Device is clean |

Define default "raw attribute" check list

Sometimes vendors use different attribute names. To catch as many as possible, this issue serves as placeholder/discussion to collect data. Help is much wanted and needed!
I will update the issue with relevant info from the comments below.

Note: This list does not apply to NVMe devices, as they use a different kind of attribute list.

Current Default List

As of version 6.0, the current defined default list is:
'Current_Pending_Sector,Reallocated_Sector_Ct,Program_Fail_Cnt_Total,Uncorrectable_Error_Cnt,Offline_Uncorrectable,Runtime_Bad_Block';

Important SMART attributes

Current_Pending_Sector

Attribute ID: 197
Hard drives: ✔️
Solid state drives: ❌
Vendor alias: Seem to use the same name

Reallocated_Sector_Ct

Attribute ID: 5
Hard drives: ✔️
Solid state drives: ✔️
Vendor alias: Seem to use the same name

Program_Fail_Cnt_Total

Attribute ID: 181
Hard drives: ❌
Solid state drives: ✔️
Vendor alias: TBD

Reported_Uncorrectable_Errors

Attribute ID: 187
Hard drives: ✔️
Solid state drives: ✔️
Vendor alias: Reported_Uncorrect (seen in SSD's and Seagate HDD)

Uncorrectable_Error_Cnt

Attribute ID: 198
Hard drives: ✔️
Solid state drives: ✔️
Vendor alias: Offline_Uncorrectable (seen in Toshiba and WDC HDD, WDC SSD)

Suggested changes

Add Reported_Uncorrect and Reallocated_Event_Count to the default list

--warn option doesn't work, -w works fine

sudo /usr/lib/nagios/plugins/check_smart.pl --global '/dev/sd[a-b]' --interface sat -w 'Reallocated_Sector_Ct=2,Reallocated_Event_Count=2'
OK: [/dev/sda] - Device is clean --- [/dev/sdb] - Device is clean|

sudo /usr/lib/nagios/plugins/check_smart.pl --global '/dev/sd[a-b]' --interface sat --warn 'Reallocated_Sector_Ct=2,Reallocated_Event_Count=2'
check_smart.pl v6.7.0
The monitoring plugins come with ABSOLUTELY NO WARRANTY. You may redistribute
copies of the plugins under the terms of the GNU General Public License.
For more information about these matters, see the file named COPYING.

-g doesn't work for -i sat

This is a regression, it used to work in 5.10. In the recent version I no longer see "sat" in the list of supported interfaces but the check, however, doesn't report any error:

sudo /usr/lib/nagios/plugins/check_smart.pl -g /dev/sd[a-b] -i sat
OK: [/dev/sda] - Device is clean|

There's also sdb though:

sudo /usr/lib/nagios/plugins/check_smart.pl -d /dev/sdb -i sat
OK: Drive  ST2000VN004-2E4164[...]

changing "sat" to either "ata" or "auto" leads to the same result: only the status of the first drive is reported.

Unbale to match /dev/sd[a-z][a-z]* with -g

Hello,
I am unable to match all devices on big disk node which has devices from /dev/sda to /dev/sdcc...
When I use "/dev/sd[a-z]" I get only /dev/sda to /dev/sdz, when I use "/dev/sd[a-z][^0-9]" I get /dev/sda1 to /dev/sdz9, only partitions. With "/dev/sd[a-z][a-z]" I get /dev/sdaa to /dev/sdcc including partitons.
Any try with counts "{1,2}" or "+" results in "Could not find any valid block/character special device for pattern"
S

add aacraid

Hi, can you add support for SAS disks on adaptec raids?

here is smartctl command line
smartctl -a /dev/sda -d aacraid,0,0,1

Intel ssd wearout not reported when almost dead

Similar as #73 .. Disk is failing now but not reported as crit
The Smart is

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       1668
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       2
170 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
174 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       2
175 Power_Loss_Cap_Test     0x0033   100   100   010    Pre-fail  Always       -       2617 (2 65535)
183 SATA_Downshift_Count    0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error_Count  0x0033   100   100   090    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Drive_Temperature       0x0022   071   063   000    Old_age   Always       -       29 (Min/Max 19/38)
192 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       2
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       29
197 Pending_Sector_Count    0x0012   100   100   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       7005511
226 Workld_Media_Wear_Indic 0x0032   100   100   000    Old_age   Always       -       8396
227 Workld_Host_Reads_Perc  0x0032   100   100   000    Old_age   Always       -       1
228 Workload_Minutes        0x0032   100   100   000    Old_age   Always       -       100130
232 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   092   092   000    Old_age   Always       -       0
234 Thermal_Throttle_Status 0x0032   100   100   000    Old_age   Always       -       0/0
235 Power_Loss_Cap_Test     0x0033   100   100   010    Pre-fail  Always       -       2617 (2 65535)
241 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       7005511
242 Host_Reads_32MiB        0x0032   100   100   000    Old_age   Always       -       71050
243 NAND_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       8255662

The disk info

=== START OF INFORMATION SECTION ===
Model Family:     Intel S4510/S4610/S4500/S4600 Series SSDs
Device Model:     INTEL SSDSC2KB240G8
Serial Number:    :)
LU WWN Device Id: 5 5cd2e4 151dfac3f
Firmware Version: XCV10110
User Capacity:    240,057,409,536 bytes [240 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic, zeroed
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Nov 16 14:54:42 2022 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Output for plugin

./check_smart.pl -l -i auto -g '/dev/sd*[a-z]'
OK: [/dev/sda] - Device is clean --- [/dev/sdb] - Device is clean|
./check_smart.pl -v
check_smart.pl v6.13.0

Unable to check "virtual blockdevices" eg /dev/bus/0

Hi,
checking smart values for MegaRAID allows using non-existing (virtual) blockdevices /dev/bus/X instead of /dev/sda (which may change) fails.

# smartctl  --scan
/dev/sda -d scsi # /dev/sda, SCSI device
/dev/sdb -d scsi # /dev/sdb, SCSI device
/dev/sdc -d scsi # /dev/sdc, SCSI device
/dev/bus/0 -d megaraid,0 # /dev/bus/0 [megaraid_disk_00], SCSI device
/dev/bus/0 -d megaraid,1 # /dev/bus/0 [megaraid_disk_01], SCSI device
/dev/bus/0 -d megaraid,2 # /dev/bus/0 [megaraid_disk_02], SCSI device
/dev/bus/0 -d megaraid,3 # /dev/bus/0 [megaraid_disk_03], SCSI device
/dev/bus/0 -d megaraid,4 # /dev/bus/0 [megaraid_disk_04], SCSI device

These devices are not allowed for check_smart at the moment:

# /usr/local/sbin/check_smart -d /dev/bus/0 -i 'megaraid,0' 
Could not find any valid block/character special device for device /dev/bus/0  !

By disabling block-device-check ( if (-b $opt_dl || -c $opt_dl) ) everything works again.

Percent_Lifetime_Remain usage

Hello,

I'm not sure to understand how to deal with Percent_Lifetime_Remain, it seems the threshold is working up side down:

./check_smart.pl --device=/dev/sdb --interface=megaraid,14 --selftest --ssd-lifetime --warn Percent_Lifetime_Remain=2
WARNING: Drive  CT120BX300SSD1 S/N 1745E10657F2:  Percent_Lifetime_Remain is non-zero (4), |Raw_Read_Error_Rate=0 Reallocate_NAND_Blk_Cnt=0 Power_On_Hours=31882 Power_Cycle_Count=201 Program_Fail_Count=0 Erase_Fail_Count=0 Ave_Block-Erase_Count=137 Unexpect_Power_Loss_Ct=89 Unused_Reserve_NAND_Blk=44 SATA_Interfac_Downshift=0 Error_Correction_Count=0 Reported_Uncorrect=0 Temperature_Celsius=24 Reallocated_Event_Count=0 Current_Pending_ECC_Cnt=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=154 Percent_Lifetime_Remain=4 Write_Error_Rate=0 Success_RAIN_Recov_Cnt=0 Total_LBAs_Written=7218163800 Host_Program_Page_Count=85117225 FTL_Program_Page_Count=131236333

 ./check_smart.pl --device=/dev/sdb --interface=megaraid,14 --selftest --ssd-lifetime --warn Percent_Lifetime_Remain=10
OK: Drive  CT120BX300SSD1 S/N 1745E10657F2: no SMART errors detected.  Percent_Lifetime_Remain is non-zero (4) (but less than threshold 10)|Raw_Read_Error_Rate=0 Reallocate_NAND_Blk_Cnt=0 Power_On_Hours=31882 Power_Cycle_Count=201 Program_Fail_Count=0 Erase_Fail_Count=0 Ave_Block-Erase_Count=137 Unexpect_Power_Loss_Ct=89 Unused_Reserve_NAND_Blk=44 SATA_Interfac_Downshift=0 Error_Correction_Count=0 Reported_Uncorrect=0 Temperature_Celsius=24 Reallocated_Event_Count=0 Current_Pending_ECC_Cnt=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=154 Percent_Lifetime_Remain=4 Write_Error_Rate=0 Success_RAIN_Recov_Cnt=0 Total_LBAs_Written=7218163800 Host_Program_Page_Count=85117225 FTL_Program_Page_Count=131236338

and it seems there's is no support for regular Nagios threshold using colon. Can you help me with that ?

Regards, Adam.

-w warning option never results in a warning

The warning option never seems to result in an actual warning result.
In the below example the Temperature_Celsius and Power-Off_Retract_Count do have higher values than the supplied warning thresholds but the result is still: OK: no SMART errors detected.

./check_smart.pl -d /dev/sda -i auto -w 'Temperature_Celsius=40,Power-Off_Retract_Count=45' --debug

Found /dev/sda
###########################################################
(debug) CHECK 1: getting overall SMART health status for /dev/sda
###########################################################

(debug) executing:
sudo smartctl -d auto -H /dev/sda

(debug) output:
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.18-18-pve] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

(debug) parsing line:
SMART overall-health self-assessment test result: PASSED

(debug) found string 'PASSED'; status OK

###########################################################
(debug) CHECK 2: getting silent SMART health check
###########################################################

(debug) executing:
sudo smartctl -d auto -q silent -A /dev/sda

(debug) exit code:
0

(debug) zero exit code, status OK

###########################################################
(debug) CHECK 3: getting detailed statistics
(debug) information contains a few more potential trouble spots
(debug) plus, we can also use the information for perfdata/graphing
###########################################################

(debug) executing:
sudo smartctl -d auto -A /dev/sda

(debug) output:
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.18-18-pve] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 179 138 021 Pre-fail Always - 10008
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 85
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 068 068 000 Old_age Always - 23908
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 82
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 48
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 37
194 Temperature_Celsius 0x0022 105 091 000 Old_age Always - 47
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

(debug) Raw Check List: Current_Pending_Sector,Reallocated_Sector_Ct,Program_Fail_Cnt_Total,Uncorrectable_Error_Cnt,Offline_Uncorrectable,Runtime_Bad_Block
(debug) Exclude List:
(debug) Warning Thresholds:
Power-Off_Retract_Count=45
Temperature_Celsius=40

(debug) Raw_Read_Error_Rate not in raw check list (raw value: 0)

(debug) Spin_Up_Time not in raw check list (raw value: 10008)

(debug) Start_Stop_Count not in raw check list (raw value: 85)

(debug) Reallocated_Sector_Ct is OK (0)

(debug) Seek_Error_Rate not in raw check list (raw value: 0)

(debug) Power_On_Hours not in raw check list (raw value: 23908)

(debug) Spin_Retry_Count not in raw check list (raw value: 0)

(debug) Calibration_Retry_Count not in raw check list (raw value: 0)

(debug) Power_Cycle_Count not in raw check list (raw value: 82)

(debug) Power-Off_Retract_Count not in raw check list (raw value: 48)

(debug) Load_Cycle_Count not in raw check list (raw value: 37)

(debug) Temperature_Celsius not in raw check list (raw value: 47)

(debug) Reallocated_Event_Count not in raw check list (raw value: 0)

(debug) Current_Pending_Sector is OK (0)

(debug) Offline_Uncorrectable is OK (0)

(debug) UDMA_CRC_Error_Count not in raw check list (raw value: 0)

(debug) Multi_Zone_Error_Rate not in raw check list (raw value: 0)

(debug) gathered perfdata:
Raw_Read_Error_Rate=0 Spin_Up_Time=10008 Start_Stop_Count=85 Reallocated_Sector_Ct=0 Seek_Error_Rate=0 Power_On_Hours=23908 Spin_Retry_Count=0 Calibration_Retry_Count=0 Power_Cycle_Count=82 Power-Off_Retract_Count=48 Load_Cycle_Count=37 Temperature_Celsius=47 Reallocated_Event_Count=0 Current_Pending_Sector=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=0 Multi_Zone_Error_Rate=0

###########################################################
(debug) LOCAL STATUS: OK, FINAL STATUS: OK
###########################################################

(debug) final status/output: OK
OK: no SMART errors detected. |Raw_Read_Error_Rate=0 Spin_Up_Time=10008 Start_Stop_Count=85 Reallocated_Sector_Ct=0 Seek_Error_Rate=0 Power_On_Hours=23908 Spin_Retry_Count=0 Calibration_Retry_Count=0 Power_Cycle_Count=82 Power-Off_Retract_Count=48 Load_Cycle_Count=37 Temperature_Celsius=47 Reallocated_Event_Count=0 Current_Pending_Sector=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=0 Multi_Zone_Error_Rate=0

Add TBW calculations for end of life prediction in SSDs

Now having had a few experiences (https://www.claudiokuenzler.com/blog/1163/western-digital-green-ssd-dead-without-pre-fail-indication) with SSDs dying out of the blue, I think we need to add another sub-check to check_smart.
For Solid State Drives the TBW limit seems to be a good (?) indicator when the predicted "safe life time" has ended.

The big work is certainly to create something like a table with a general oversight of the TBW warranty limits of the different vendors and SSD models. Unless such a table exists already somewhere. Hints are welcome!

Attribute 190/194 with Seagate HDDs

Hey,

would it be possible to manually ignore certain attributes in check_smart.pl?
For Seagate HDDs the attribute 190 and 194 seem to get "branded" when once failed in the past:

190 Airflow_Temperature_Cel 0x0022 058 044 045 Old_age Always In_the_past 42 (0 37 56 24 0)

This results in check_smart.pl returning a WARNING, though the value isn't bad anymore.
Instead of adding a separate parameter, perhaps ignoring the combination of attribute 190 + Old_age + In_the_past as a warning would suffice?

Thanks

Output issue

Hi,

I just ran the script on one of my machines and found saw that several statuses of disks were not shown, any idea why this happens?

./check_smart.pl -g /dev/sd -i sat

WARNING: [/dev/sda] - Device is clean --- [/dev/sdb] - Device is clean --- [/dev/sdc] - Device is clean --- [/dev/sdd] - Attribute Airflow_Temperature_Cel failed at In_the_past --- [/dev/sde] - Attribute Airflow_Temperature_Cel failed at In_the_past --- [/dev/sdf] - --- [/dev/sdg] - --- [/dev/sdh] - --- [/dev/sdi] - |

Status of disks F through I is not shown for some reason.
Here's the debug output:

./check_smart.pl -g /dev/sd -i sat --debug

Found /dev/sda
Found /dev/sdb
Found /dev/sdc
Found /dev/sdd
Found /dev/sde
Found /dev/sdf
Found /dev/sdg
Found /dev/sdh
Found /dev/sdi

(debug) CHECK 1: getting overall SMART health status for a

(debug) executing:
sudo smartctl -d sat -H /dev/sda

(debug) output:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64](local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

(debug) parsing line:
SMART overall-health self-assessment test result: PASSED

(debug) found string 'PASSED'; status OK

(debug) CHECK 2: getting silent SMART health check

(debug) executing:
sudo smartctl -d sat -q silent -A /dev/sda

(debug) exit code:
0

(debug) zero exit code, status OK

(debug) CHECK 3: getting detailed statistics
(debug) information contains a few more potential trouble spots
(debug) plus, we can also use the information for perfdata/graphing

(debug) executing:
sudo smartctl -d sat -A /dev/sda

(debug) output:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64](local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 1280
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 109
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0007 032 100 000 Pre-fail Always - 0
229 Halt_System/Flash_ID 0x0002 100 --- 000 Old_age Always - 0x00ecd514b674ecd5
232 Firmware_Version_Info 0x0002 100 --- 000 Old_age Always - 0x3039303331300802
233 ECC_Fail_Record 0x0002 100 --- 000 Old_age Always - 0x000102a11f07
234 Avg/Max_Erase_Ct 0x0002 100 --- 000 Old_age Always - 638/755
235 Good/Sys_Block_Ct 0x0002 100 --- 000 Old_age Always - 8176/532

(debug) gathered perfdata:

(debug) FINAL STATUS: OK

(debug) final status/output:

(debug) CHECK 1: getting overall SMART health status for b

(debug) executing:
sudo smartctl -d sat -H /dev/sdb

(debug) output:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64](local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

(debug) parsing line:
SMART overall-health self-assessment test result: PASSED

(debug) found string 'PASSED'; status OK

(debug) CHECK 2: getting silent SMART health check

(debug) executing:
sudo smartctl -d sat -q silent -A /dev/sdb

(debug) exit code:
0

(debug) zero exit code, status OK

(debug) CHECK 3: getting detailed statistics
(debug) information contains a few more potential trouble spots
(debug) plus, we can also use the information for perfdata/graphing

(debug) executing:
sudo smartctl -d sat -A /dev/sdb

(debug) output:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64](local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 118 099 006 Pre-fail Always - 187485489
3 Spin_Up_Time 0x0003 087 087 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 85
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 078 060 030 Pre-fail Always - 4366723369
9 Power_On_Hours 0x0032 079 079 000 Old_age Always - 18404
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 85
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 086 086 000 Old_age Always - 14
190 Airflow_Temperature_Cel 0x0022 060 050 045 Old_age Always - 40 (Min/Max 25/41)
194 Temperature_Celsius 0x0022 040 050 000 Old_age Always - 40 (0 11 0 0)
195 Hardware_ECC_Recovered 0x001a 048 046 000 Old_age Always - 187485489
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0

(debug) gathered perfdata:

(debug) FINAL STATUS: OK

(debug) final status/output:

(debug) CHECK 1: getting overall SMART health status for c

(debug) executing:
sudo smartctl -d sat -H /dev/sdc

(debug) output:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64](local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

(debug) parsing line:
SMART overall-health self-assessment test result: PASSED

(debug) found string 'PASSED'; status OK

(debug) CHECK 2: getting silent SMART health check

(debug) executing:
sudo smartctl -d sat -q silent -A /dev/sdc

(debug) exit code:
0

(debug) zero exit code, status OK

(debug) CHECK 3: getting detailed statistics
(debug) information contains a few more potential trouble spots
(debug) plus, we can also use the information for perfdata/graphing

(debug) executing:
sudo smartctl -d sat -A /dev/sdc

(debug) output:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64](local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail Always - 158049008
3 Spin_Up_Time 0x0003 094 094 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 86
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 071 060 030 Pre-fail Always - 14117928
9 Power_On_Hours 0x0032 086 086 000 Old_age Always - 12375
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 43
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 060 052 045 Old_age Always - 40 (Min/Max 28/41)
194 Temperature_Celsius 0x0022 040 048 000 Old_age Always - 40 (0 18 0 0)
195 Hardware_ECC_Recovered 0x001a 021 021 000 Old_age Always - 158049008
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 47614007455932
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 2594868242
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 252744892

(debug) gathered perfdata:

(debug) FINAL STATUS: OK

(debug) final status/output:

(debug) CHECK 1: getting overall SMART health status for d

(debug) executing:
sudo smartctl -d sat -H /dev/sdd

(debug) output:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64](local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
Please note the following marginal Attributes:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022 056 044 045 Old_age Always In_the_past 44 (0 5 45 29)

(debug) parsing line:
SMART overall-health self-assessment test result: PASSED

(debug) found string 'PASSED'; status OK

(debug) CHECK 2: getting silent SMART health check

(debug) executing:
sudo smartctl -d sat -q silent -A /dev/sdd

(debug) exit code:
0

(debug) zero exit code, status OK

(debug) CHECK 3: getting detailed statistics
(debug) information contains a few more potential trouble spots
(debug) plus, we can also use the information for perfdata/graphing

(debug) executing:
sudo smartctl -d sat -A /dev/sdd

(debug) output:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64](local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 118 099 006 Pre-fail Always - 190982012
3 Spin_Up_Time 0x0003 087 087 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 84
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 062 060 030 Pre-fail Always - 176164150598
9 Power_On_Hours 0x0032 078 078 000 Old_age Always - 19723
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 84
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 099 099 000 Old_age Always - 1
190 Airflow_Temperature_Cel 0x0022 056 044 045 Old_age Always In_the_past 44 (0 5 45 29)
194 Temperature_Celsius 0x0022 044 056 000 Old_age Always - 44 (0 13 0 0)
195 Hardware_ECC_Recovered 0x001a 055 049 000 Old_age Always - 190982012
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0

(debug) parsed SMART attribute Airflow_Temperature_Cel with error condition:
In_the_past

(debug) gathered perfdata:

(debug) FINAL STATUS: OK

(debug) final status/output:

(debug) CHECK 1: getting overall SMART health status for e

(debug) executing:
sudo smartctl -d sat -H /dev/sde

(debug) output:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64](local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
Please note the following marginal Attributes:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022 055 044 045 Old_age Always In_the_past 45 (0 6 45 29)

(debug) parsing line:
SMART overall-health self-assessment test result: PASSED

(debug) found string 'PASSED'; status OK

(debug) CHECK 2: getting silent SMART health check

(debug) executing:
sudo smartctl -d sat -q silent -A /dev/sde

(debug) exit code:
0

(debug) zero exit code, status OK

(debug) CHECK 3: getting detailed statistics
(debug) information contains a few more potential trouble spots
(debug) plus, we can also use the information for perfdata/graphing

(debug) executing:
sudo smartctl -d sat -A /dev/sde

(debug) output:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64](local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 118 099 006 Pre-fail Always - 200544731
3 Spin_Up_Time 0x0003 087 087 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 81
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 071 060 030 Pre-fail Always - 17231581050
9 Power_On_Hours 0x0032 080 080 000 Old_age Always - 17939
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 1
12 Power_Cycle_Count 0x0032 100 037 020 Old_age Always - 81
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 055 044 045 Old_age Always In_the_past 45 (0 6 45 29)
194 Temperature_Celsius 0x0022 045 056 000 Old_age Always - 45 (0 13 0 0)
195 Hardware_ECC_Recovered 0x001a 057 049 000 Old_age Always - 200544731
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0

(debug) parsed SMART attribute Airflow_Temperature_Cel with error condition:
In_the_past

(debug) gathered perfdata:

(debug) FINAL STATUS: OK

(debug) final status/output:

(debug) CHECK 1: getting overall SMART health status for f

(debug) executing:
sudo smartctl -d sat -H /dev/sdf

(debug) output:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64](local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

(debug) parsing line:
SMART overall-health self-assessment test result: PASSED

(debug) found string 'PASSED'; status OK

(debug) CHECK 2: getting silent SMART health check

(debug) executing:
sudo smartctl -d sat -q silent -A /dev/sdf

(debug) exit code:
0

(debug) zero exit code, status OK

(debug) CHECK 3: getting detailed statistics
(debug) information contains a few more potential trouble spots
(debug) plus, we can also use the information for perfdata/graphing

(debug) executing:
sudo smartctl -d sat -A /dev/sdf

(debug) output:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64](local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 100 099 006 Pre-fail Always - 705104
3 Spin_Up_Time 0x0003 076 076 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 31
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 081 060 030 Pre-fail Always - 125637146
9 Power_On_Hours 0x0032 083 083 000 Old_age Always - 15180
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 31
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 099 000 Old_age Always - 2
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 065 057 045 Old_age Always - 35 (Min/Max 22/38)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 27
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 31
194 Temperature_Celsius 0x0022 035 043 000 Old_age Always - 35 (0 18 0 0)
195 Hardware_ECC_Recovered 0x001a 007 004 000 Old_age Always - 705104
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 1
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 164540197124940
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 391323627
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 3691842656

(debug) gathered perfdata:

(debug) FINAL STATUS: OK

(debug) final status/output:

(debug) CHECK 1: getting overall SMART health status for g

(debug) executing:
sudo smartctl -d sat -H /dev/sdg

(debug) output:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64](local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

(debug) parsing line:
SMART overall-health self-assessment test result: PASSED

(debug) found string 'PASSED'; status OK

(debug) CHECK 2: getting silent SMART health check

(debug) executing:
sudo smartctl -d sat -q silent -A /dev/sdg

(debug) exit code:
0

(debug) zero exit code, status OK

(debug) CHECK 3: getting detailed statistics
(debug) information contains a few more potential trouble spots
(debug) plus, we can also use the information for perfdata/graphing

(debug) executing:
sudo smartctl -d sat -A /dev/sdg

(debug) output:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64](local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 101 099 006 Pre-fail Always - 3386296
3 Spin_Up_Time 0x0003 076 076 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 15
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 081 060 030 Pre-fail Always - 138740700
9 Power_On_Hours 0x0032 083 083 000 Old_age Always - 15162
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 15
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 1
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 063 057 045 Old_age Always - 37 (Min/Max 24/40)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 11
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 15
194 Temperature_Celsius 0x0022 037 043 000 Old_age Always - 37 (0 20 0 0)
195 Hardware_ECC_Recovered 0x001a 006 004 000 Old_age Always - 3386296
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 1
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 39067022539578
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 1827489655
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 3772235559

(debug) gathered perfdata:

(debug) FINAL STATUS: OK

(debug) final status/output:

(debug) CHECK 1: getting overall SMART health status for h

(debug) executing:
sudo smartctl -d sat -H /dev/sdh

(debug) output:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64](local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

(debug) parsing line:
SMART overall-health self-assessment test result: PASSED

(debug) found string 'PASSED'; status OK

(debug) CHECK 2: getting silent SMART health check

(debug) executing:
sudo smartctl -d sat -q silent -A /dev/sdh

(debug) exit code:
0

(debug) zero exit code, status OK

(debug) CHECK 3: getting detailed statistics
(debug) information contains a few more potential trouble spots
(debug) plus, we can also use the information for perfdata/graphing

(debug) executing:
sudo smartctl -d sat -A /dev/sdh

(debug) output:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64](local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 099 099 006 Pre-fail Always - 318992
3 Spin_Up_Time 0x0003 075 075 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 23
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 082 060 030 Pre-fail Always - 177484905
9 Power_On_Hours 0x0032 083 083 000 Old_age Always - 15261
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 23
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 1
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 062 056 045 Old_age Always - 38 (Min/Max 25/40)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 19
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 23
194 Temperature_Celsius 0x0022 038 044 000 Old_age Always - 38 (0 18 0 0)
195 Hardware_ECC_Recovered 0x001a 006 004 000 Old_age Always - 318992
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 149958783155101
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 2884127008
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 3349212466

(debug) gathered perfdata:

(debug) FINAL STATUS: OK

(debug) final status/output:

(debug) CHECK 1: getting overall SMART health status for i

(debug) executing:
sudo smartctl -d sat -H /dev/sdi

(debug) output:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64](local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

(debug) parsing line:
SMART overall-health self-assessment test result: PASSED

(debug) found string 'PASSED'; status OK

(debug) CHECK 2: getting silent SMART health check

(debug) executing:
sudo smartctl -d sat -q silent -A /dev/sdi

(debug) exit code:
0

(debug) zero exit code, status OK

(debug) CHECK 3: getting detailed statistics
(debug) information contains a few more potential trouble spots
(debug) plus, we can also use the information for perfdata/graphing

(debug) executing:
sudo smartctl -d sat -A /dev/sdi

(debug) output:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64](local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 102 099 006 Pre-fail Always - 4723120
3 Spin_Up_Time 0x0003 076 076 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 20
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 082 060 030 Pre-fail Always - 173274118
9 Power_On_Hours 0x0032 083 083 000 Old_age Always - 15165
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 20
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 099 099 000 Old_age Always - 1
190 Airflow_Temperature_Cel 0x0022 061 056 045 Old_age Always - 39 (Min/Max 26/41)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 16
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 20
194 Temperature_Celsius 0x0022 039 044 000 Old_age Always - 39 (0 19 0 0)
195 Hardware_ECC_Recovered 0x001a 007 003 000 Old_age Always - 4723120
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 4831838223164
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 2215541345
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 3072043202

(debug) gathered perfdata:

(debug) FINAL STATUS: OK

(debug) final status/output:
WARNING: [/dev/sda] - Device is clean --- [/dev/sdb] - Device is clean --- [/dev/sdc] - Device is clean --- [/dev/sdd] - Attribute Airflow_Temperature_Cel failed at In_the_past --- [/dev/sde] - Attribute Airflow_Temperature_Cel failed at In_the_past --- [/dev/sdf] - --- [/dev/sdg] - --- [/dev/sdh] - --- [/dev/sdi] - |

Add attribute 188 Command_Timeout to raw check list

According to Blackblaze's statistics (https://www.backblaze.com/blog/what-smart-stats-indicate-hard-drive-failures/), the following five SMART attributes are good/helpful indicators of a failing drive:

5 (Reallocated Sectors Count)
187 (Reported Uncorrectable Errors)
188 (Command Timeout)
197 (Current Pending Sector Count)
198 (Uncorrectable Sector Count)

The current raw list (hard coded in the plugin, can be overwritten using the -r parameter) checks the following attributes on their raw values:

https://github.com/Napsty/check_smart/blob/master/check_smart.pl#L183

my $raw_check_list = $opt_r // 'Current_Pending_Sector,Reallocated_Sector_Ct,Program_Fail_Cnt_Total,Uncorrectable_Error_Cnt,Offline_Uncorrectable,Runtime_Bad_Block,Reported_Uncorrect,Reallocated_Event_Count';

Attribute 188 Command_Timeout is missing here and should be added.

napsty / check_smart Goto Github PK

check_smart's Introduction

check_smart monitoring plugin

Full and up to date documentation

Introduction

Sudoers entry

Successful tests/examples

check_smart's People

Contributors

Stargazers

Watchers

Forkers

check_smart's Issues

Current Default List

Important SMART attributes

Current_Pending_Sector

Reallocated_Sector_Ct

Program_Fail_Cnt_Total

Reported_Uncorrectable_Errors

Uncorrectable_Error_Cnt

Suggested changes

Output for plugin

./check_smart.pl -d /dev/sda -i auto -w 'Temperature_Celsius=40,Power-Off_Retract_Count=45' --debug

./check_smart.pl -g /dev/sd -i sat

./check_smart.pl -g /dev/sd -i sat --debug

Recommend Projects