Hard-drive errors

hard-disklogssmartctl

My /home file system is JFS, it got to RO mode several times already, so I had to reboot/remount it. I saw this at '/var/log/messages`:

Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925711] ata2.00: configured for UDMA/133
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925755] sd 1:0:0:0: [sda] Unhandled sense code
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925759] sd 1:0:0:0: [sda]  
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925763] sd 1:0:0:0: [sda]  
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925770]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925778]         0e 5a b2 b8 
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925782] sd 1:0:0:0: [sda]  
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925785] sd 1:0:0:0: [sda] CDB: 
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925815] sd 1:0:0:0: [sda] Unhandled sense code
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925817] sd 1:0:0:0: [sda]  
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925820] sd 1:0:0:0: [sda]  
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925825]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925833]         00 00 00 00 
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925836] sd 1:0:0:0: [sda]  
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925839] sd 1:0:0:0: [sda] CDB: 
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925863] sd 1:0:0:0: [sda] Unhandled sense code
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925865] sd 1:0:0:0: [sda]  
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925868] sd 1:0:0:0: [sda]  
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925872]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925879]         00 00 00 00 
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925882] sd 1:0:0:0: [sda]  
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925885] sd 1:0:0:0: [sda] CDB: 
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925908] ata2: EH complete

And smartctl -a /dev/sda gave me this:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   179   174   021    Pre-fail  Always       -       2008
  4 Start_Stop_Count        0x0032   099   099   000    Old_age   Always       -       1005
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   082   082   000    Old_age   Always       -       13675
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       998
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       37
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       810861
194 Temperature_Celsius     0x0022   106   091   000    Old_age   Always       -       41
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

Hard-drive model:

Model Family:     Western Digital Scorpio Blue Serial ATA (Adv. Format)
Device Model:     WDC WD7500BPVT-24HXZT3
Serial Number:    WD-WX91A91R4010
LU WWN Device Id: 5 0014ee 601b831c9
Firmware Version: 03.01A03

Upd: I started another self-test (the first one I did several months ago) and got some updates:

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     13680         229857912
# 2  Extended offline    Completed without error       00%      9661         -
# 3  Extended offline    Completed: read failure       90%      9654         96004576
# 4  Extended offline    Completed: read failure       90%      9653         96004576

lines from #2 to #4 I already had before.
I followed these guides: Badblock HOWTO and Debug the Filesystem. It seems the block is not reported as erroneous anymore, but it's not in Relocated blocks are not increased as well. The only thing that have been increased is Raw_Read_Error_Rate after I wrote zero to a bad block.

The questions is should I consider ordering a new hard-drive?

Best Answer

From the smartctl man page:

The Attribute table printed out by smartctl also shows the "TYPE" of the Attribute. Attributes are one of two possible types: Pre-failure or Old age. Pre-failure Attributes are ones which, if less than or equal to their threshold values, indicate pending disk failure. Old age, or usage Attributes, are ones which indicate end-of-product life from old-age or normal aging and wearout, if the Attribute value is less than or equal to the threshold. Please note: the fact that an Attribute is of type 'Pre-fail' does not mean that your disk is about to fail! It only has this meaning if the Attribute´s current Normalized value is less than or equal to the threshold value.

If the Attribute´s current Normalized value is less than or equal to the threshold value, then the "WHEN_FAILED" column will display "FAILING_NOW". If not, but the worst recorded value is less than or equal to the threshold value, then this column will display "In_the_past". If the "WHEN_FAILED" column has no entry (indicated by a dash: ´-´) then this Attribute is OK now (not failing) and has also never failed in the past.

So according to the smartctl output section you have posted, your drive actually looks in good shape. However, that doesn't necessarily mean that there is not another problem.

Unfortunately the Unhandled sense code message does mean that something went wrong, but the kernel doesn't know what. You could try looking at the rest of the smartctl output to see if there is any thing wrong. There should be a part tha summarises the drive's overall health. You can get it on its own with the -H option.

If the drive supports self testing, you can start one with:

smartctl -t long /dev/sda

This starts one in the background, so you will have to keep checking for results. If the drive is not mounted, you can add the -C option enable captive mode which should take less time. A short test is also possible, but less thorough.

It is also a good idea to check physical connectors etc to make sure nothing as come loose - its an easy fix if it has.

Update

Wikipedia has a good reference for smart attributes. Note that the 'Better' column refers to the raw values in rightmost column of the output and not the normalised value at the start. Here is the part on 'Current Pending Sector' mentioned by frostschutz:

Count of "unstable" sectors (waiting to be remapped, because of unrecoverable read errors). If an unstable sector is subsequently read successfully, the sector is remapped and this value is decreased. Read errors on a sector will not remap the sector immediately (since the correct value cannot be read and so the value to remap is not known, and also it might become readable later); instead, the drive firmware remembers that the sector needs to be remapped, and will remap it the next time it's written. However some drives will not immediately remap such sectors when written; instead the drive will first attempt to write to the problem sector and if the write operation is successful then the sector will be marked good (in this case, the "Reallocation Event Count" (0xC4) will not be increased). This is a serious shortcoming, for if such a drive contains marginal sectors that consistently fail only after some time has passed following a successful write operation, then the drive will never remap these problem sectors.

Related Solutions

Does the hard-drive have bad sectors or not

Your disk had some problems with reading data from the surface, but it seems that the disk dealt with it. I had similar situation:

Error 29 occurred at disk power-on lifetime: 18836 hours (784 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 00 40 37 e6  Error: UNC 8 sectors at LBA = 0x06374000 = 104284160

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 00 40 37 e6 08      03:39:32.447  READ DMA
  c8 00 08 f8 3f 37 e6 08      03:39:32.447  READ DMA
  c8 00 08 f0 3f 37 e6 08      03:39:32.447  READ DMA
  c8 00 08 e8 3f 37 e6 08      03:39:32.447  READ DMA
  c8 00 08 e0 3f 37 e6 08      03:39:32.447  READ DMA

And when I wanted to perform test, I got:

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 7  Short offline       Completed: read failure       90%     18845         104284160

Ultimately, I managed to unblock the sectors, and after running the extended test, which scan the whole surface, I got the following result:

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 3  Extended offline    Completed without error       00%     18858         -

If there were bad blocks, they could be observed in the table under:

  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0

In your case, there's no indication of bad sectors because the extended test was performed (11746 h) after the last error occurred (11706 h). So, you can sleep peacefully. :)

As I mentioned in comments, there's two types of badblocks. Here's short info about the difference between the two:

There are two types of bad sectors — often divided into “physical” and “logical” bad sectors or “hard” and “soft” bad sectors.

A physical — or hard — bad sector is a cluster of storage on the hard drive that’s physically damaged. The hard drive’s head may have touched that part of the hard drive and damaged it, some dust may have settled on that sector and ruined it, a solid-state drive’s flash memory cell may have worn out, or the hard drive may have had other defects or wear issues that caused the sector to become physically damaged. This type of sector cannot be repaired.

A logical — or soft — bad sector is a cluster of storage on the hard drive that appears to not be working properly. The operating system may have tried to read data on the hard drive from this sector and found that the error-correcting code (ECC) didn’t match the contents of the sector, which suggests that something is wrong. These may be marked as bad sectors, but can be repaired by overwriting the drive with zeros — or, in the old days, performing a low-level format. Windows’ Disk Check tool can also repair such bad sectors.

S.M.A.R.T show’s high Load_Cycle_Count | Why and how to prevent the number from increaseing

My findings so far:

The Cause

Regarding to Western Digital and various websites 1, 2, 3, 4, 5, 6 the high number in S.M.A.R.T Attribute 193 Load_Cycle_Count is related to a technique introduced by WesternDigital named Intellipark.
Intellipark is implemented in some of their hard drive models, especially in their green series.
It is designed to reduce power consumption if the drive is not beeing used.
In some usecases, especially when combined with a Linux operating system, this intellipark-feature tends to shorten hdd's live.

Solutions

Western Digital explains it's not their features fault, it's the bad configured operating system and they give some advices on how to properly configure linux.
Western Digital also released a DOS tool to modify the intellipark-feature on some devices.
For the Linux platform Christophe Bothamy released idle3-tools to modify that intellipark-feature - big thank you from my site.
as mentioned in the comments below, hdparm -J does either modify the wd idle3 timer.

What I've done

Now I downloaded idle3ctl and turned off intellipark completely. Hopefully this will help to prevent the disks from failing quick. But anyway at least one disk is almost dead, regarding to S.M.A.R.T.

To disable the intellipark-feature i followed the idle3-tools instructions.

First I read out the idle3 timer value of this intellipark feature: sudo ./idle3ctl -g /dev/sdx

Than i disabled the timer sudo ./idle3ctl -d /dev/sdx

A power off/on cycle is necessary to take effect sudo hdparm -Y /dev/sdx

After that i rechecked the idle3 time and did the same after a reboot:

alex@silent-ssd:~/idle3tools/idle3-tools-0.9.1$ sudo ./idle3ctl -g /dev/sdd
Idle3 timer is disabled