We found some disks thats almost dead because Wear_Leveling_Count standarded value is

I found an interesting Samsung (official) document. <a href="https://image-us.samsung.

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thanks for reporting this. In <a class="issue-link js-issue-link" data-error-text="Fai

I found some info <a href="https://web.archive.org/web/20150310051031/http://www.s

Ok, now i run... <div class="snippet-clipboard-content notranslate position-relati

Wear_Leveling_Count is not reported as CRIT when disk is almost dead about check_smart HOT 7 CLOSED

napsty commented on May 30, 2024

Wear_Leveling_Count is not reported as CRIT when disk is almost dead

from check_smart.

Comments (7)

Napsty commented on May 30, 2024 1

I found an interesting Samsung (official) document. https://image-us.samsung.com/SamsungUS/b2b/resource/2016/05/31/WHP-SSD-SSDSMARTATTRIBUTES-APR16J.pdf

The raw value of Wear Leveling Count reports the amount of NAND writes as a function of consumed P/e cycles, meaning that an increment of 1 corresponds to one full drive write. it should be noted that one full drive write in this context means the physical, raw NAND capacity of the drive, so in case of a 960gb sM863 for example, an increase of 1 in Wear Leveling Count translates to 1,024gib of NAND writes.

This indicates, that Wear_Leveling_Count (raw value) means the number of full drive writes. So if you have a 500GB drive and the Wear_Leveling_Count raw value is at 19 (in my case), this would mean that (roughly) 19 * 500GB has been written on the drive.

This can help you to calculate an estimated lifetime remaining (see Samsung document for the formula) but it does not indicate a pending failure.

The document mentions the following attributes to be considered critical for drive health:

The four SMART attributes listed in the table below are the most important indicators of drive health. if any of the normalized values drop below the 10% threshold, it’s recommended to replace the drive as soon as possible because it’s approaching the end of its life and may become unreliable if used longer.

179 Unused Reserved block Count (Used_Rsvd_Blk_Cnt_Tot)
181 Program fail Count (Program_Fail_Cnt_Total) -> already part of default raw list
182 Erase Fail Count (Erase_Fail_Count_Total)
183 Runtime Bad Count (Runtime_Bad_Block) -> already part of default raw list

So I suggest to add Erase_Fail_Count_Total to the default raw list.

from check_smart.

Napsty commented on May 30, 2024 1

@pschonmann https://raw.githubusercontent.com/Napsty/check_smart/6.11.1/check_smart.pl now contains Erase_Fail_Count_Total in the default raw list.
This will be released in the next version, 6.12.0.

from check_smart.

Napsty commented on May 30, 2024

Thanks for reporting this. In #36 I tried to determine which attributes should be default be added into the raw (check) list.
It seems that this attribute 177 Wear_Leveling_Count is not used by all SSD models. I can see this attribute on my Samsung (Samsung SSD 850 EVO 500GB) SSDs, but not on SanDisk or Western Digital SSDs.

Now the big question is whether this Wear_Leveling_Count attribute is really a strong/important indicator of pending drive failure. Do you have any official Samsung documentation at hand?

(A good comparison is the Total_Bad_Block attribute, which is used by some SSD models. The name itself sounds alarming yet the values can vary a lot, even for brand new drives, and they don't really show a pending failure).

Also in the superuser link you posted, someone mentions:

All of your drives are at between 95 and 100, and will eventually drop to 0.

I'm not sure that this is correct. The attribute counters shown by smartctl usually start at 0 and increase to 100. This can be seen in Crucial MX SSDs with the Percent_Lifetime_Remain attributes (see https://www.claudiokuenzler.com/blog/1077/when-is-solid-state-drive-ssd-dead-analysis-crucial-mx500-1tb for a detailed analysis). Although the name indicates "remain", the counter actually starts at 0. This could be the same case for the Wear_Leveling_Count attribute (TBV!).

In my own Samsung SSDs, I can see the following values:

ckadm@mintp ~ $ sudo smartctl -a /dev/sda | grep "Wear_Leveling_Count"
177 Wear_Leveling_Count     0x0013   099   099   000    Pre-fail  Always       -       19
ckadm@mintp ~ $ sudo smartctl -a /dev/sdb | grep "Wear_Leveling_Count"
177 Wear_Leveling_Count     0x0013   098   098   000    Pre-fail  Always       -       22

I personally interpret this as 19% and 22% - which would still be largely OK if 100% is the assumed MAX value.

Now the big question is the following: Do we find proof somewhere, that Wear_Leveling_Count is really an important indicator for a pre-failure? If yes -> From which value on is this considered to be CRITICAL? Above 90? I see in your drive you have a value of 2981 - whatever this means.

Until this is discussed and solved, you can use the following workaround (append the raw list):

$ ./check_smart.pl -r "Current_Pending_Sector,Reallocated_Sector_Ct,Program_Fail_Cnt_Total,Uncorrectable_Error_Cnt,Offline_Uncorrectable,Runtime_Bad_Block,Reported_Uncorrect,Reallocated_Event_Count,Wear_Leveling_Count" -d /dev/sda -i ata
WARNING: Drive  Samsung SSD 850 EVO 500GB S/N XXX:  Wear_Leveling_Count is non-zero (19), |Reallocated_Sector_Ct=0 Power_On_Hours=19456 Power_Cycle_Count=435 Wear_Leveling_Count=19 Used_Rsvd_Blk_Cnt_Tot=0 Program_Fail_Cnt_Total=0 Erase_Fail_Count_Total=0 Runtime_Bad_Block=0 Uncorrectable_Error_Cnt=0 Airflow_Temperature_Cel=34 ECC_Error_Rate=0 CRC_Error_Count=0 POR_Recovery_Count=8 Total_LBAs_Written=24168613381

from check_smart.

pschonmann commented on May 30, 2024

I found some info
https://web.archive.org/web/20150310051031/http://www.samsung.com/global/business/semiconductor/minisite/SSD/global/html/whitepaper/whitepaper07.html

This attribute represents the number of media program and erase operations (the number of times a block has been erased). This value is directly related to the lifetime of the SSD. The raw value of this attribute shows the total count of P/E Cycles.

SRC: https://newbedev.com/how-to-check-the-life-left-in-ssd-or-the-medium-s-wear-level

from check_smart.

pschonmann commented on May 30, 2024

Ok, now i run...

check_smart.pl -g '/dev/sd[a-z] /dev/sd[abc][a-z]' -i 'auto' -E Airflow_Temperature_Cel -w 'Reallocated_Sector_Ct=15,Current_Pending_Sector=100,Reallocated_Event_Count=100,Runtime_Bad_Block=100,Uncorrectable_Error_Cnt=100,Wear_Leveling_Count=300,Erase_Fail_Count_Total=1' --debug

But in debug output i see
(debug) Erase_Fail_Count_Total not in raw check list (raw value: 0)
Is that value monitored ? I have no disk with value > 0 to test (unfortunately :) )

EDIT:
OH, i have old version 6.9.0. Updated and seems ok

from check_smart.

pschonmann commented on May 30, 2024

Thanks.
And would be possible to monitor Wear_leveling_count normalised values ? Normalized value: decrements from 100 to 0. Would be fine be informed when last 10% and change of disk is recommended.

from check_smart.

Napsty commented on May 30, 2024

Unfortunately we cannot. We only can read the raw values from smartctl. Unless you know how to?

from check_smart.

Wear_Leveling_Count is not reported as CRIT when disk is almost dead about check_smart HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent