Hard-drive errors

hard-disklogssmartctl

My /home file system is JFS, it got to RO mode several times already, so I had to reboot/remount it. I saw this at '/var/log/messages`:

Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925711] ata2.00: configured for UDMA/133
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925755] sd 1:0:0:0: [sda] Unhandled sense code
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925759] sd 1:0:0:0: [sda]  
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925763] sd 1:0:0:0: [sda]  
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925770]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925778]         0e 5a b2 b8 
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925782] sd 1:0:0:0: [sda]  
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925785] sd 1:0:0:0: [sda] CDB: 
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925815] sd 1:0:0:0: [sda] Unhandled sense code
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925817] sd 1:0:0:0: [sda]  
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925820] sd 1:0:0:0: [sda]  
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925825]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925833]         00 00 00 00 
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925836] sd 1:0:0:0: [sda]  
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925839] sd 1:0:0:0: [sda] CDB: 
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925863] sd 1:0:0:0: [sda] Unhandled sense code
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925865] sd 1:0:0:0: [sda]  
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925868] sd 1:0:0:0: [sda]  
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925872]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925879]         00 00 00 00 
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925882] sd 1:0:0:0: [sda]  
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925885] sd 1:0:0:0: [sda] CDB: 
Dec 31 10:12:49 uvv-laptop-y570 kernel: [  983.925908] ata2: EH complete

And smartctl -a /dev/sda gave me this:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   179   174   021    Pre-fail  Always       -       2008
  4 Start_Stop_Count        0x0032   099   099   000    Old_age   Always       -       1005
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   082   082   000    Old_age   Always       -       13675
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       998
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       37
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       810861
194 Temperature_Celsius     0x0022   106   091   000    Old_age   Always       -       41
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

Hard-drive model:

Model Family:     Western Digital Scorpio Blue Serial ATA (Adv. Format)
Device Model:     WDC WD7500BPVT-24HXZT3
Serial Number:    WD-WX91A91R4010
LU WWN Device Id: 5 0014ee 601b831c9
Firmware Version: 03.01A03

Upd: I started another self-test (the first one I did several months ago) and got some updates:

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     13680         229857912
# 2  Extended offline    Completed without error       00%      9661         -
# 3  Extended offline    Completed: read failure       90%      9654         96004576
# 4  Extended offline    Completed: read failure       90%      9653         96004576

lines from #2 to #4 I already had before.
I followed these guides: Badblock HOWTO and Debug the Filesystem. It seems the block is not reported as erroneous anymore, but it's not in Relocated blocks are not increased as well. The only thing that have been increased is Raw_Read_Error_Rate after I wrote zero to a bad block.

The questions is should I consider ordering a new hard-drive?

Best Answer

From the smartctl man page:

The Attribute table printed out by smartctl also shows the "TYPE" of the Attribute. Attributes are one of two possible types: Pre-failure or Old age. Pre-failure Attributes are ones which, if less than or equal to their threshold values, indicate pending disk failure. Old age, or usage Attributes, are ones which indicate end-of-product life from old-age or normal aging and wearout, if the Attribute value is less than or equal to the threshold. Please note: the fact that an Attribute is of type 'Pre-fail' does not mean that your disk is about to fail! It only has this meaning if the Attribute´s current Normalized value is less than or equal to the threshold value.

If the Attribute´s current Normalized value is less than or equal to the threshold value, then the "WHEN_FAILED" column will display "FAILING_NOW". If not, but the worst recorded value is less than or equal to the threshold value, then this column will display "In_the_past". If the "WHEN_FAILED" column has no entry (indicated by a dash: ´-´) then this Attribute is OK now (not failing) and has also never failed in the past.

So according to the smartctl output section you have posted, your drive actually looks in good shape. However, that doesn't necessarily mean that there is not another problem.

Unfortunately the Unhandled sense code message does mean that something went wrong, but the kernel doesn't know what. You could try looking at the rest of the smartctl output to see if there is any thing wrong. There should be a part tha summarises the drive's overall health. You can get it on its own with the -H option.

If the drive supports self testing, you can start one with:

smartctl -t long /dev/sda

This starts one in the background, so you will have to keep checking for results. If the drive is not mounted, you can add the -C option enable captive mode which should take less time. A short test is also possible, but less thorough.

It is also a good idea to check physical connectors etc to make sure nothing as come loose - its an easy fix if it has.

Update

Wikipedia has a good reference for smart attributes. Note that the 'Better' column refers to the raw values in rightmost column of the output and not the normalised value at the start. Here is the part on 'Current Pending Sector' mentioned by frostschutz:

Count of "unstable" sectors (waiting to be remapped, because of unrecoverable read errors). If an unstable sector is subsequently read successfully, the sector is remapped and this value is decreased. Read errors on a sector will not remap the sector immediately (since the correct value cannot be read and so the value to remap is not known, and also it might become readable later); instead, the drive firmware remembers that the sector needs to be remapped, and will remap it the next time it's written. However some drives will not immediately remap such sectors when written; instead the drive will first attempt to write to the problem sector and if the write operation is successful then the sector will be marked good (in this case, the "Reallocation Event Count" (0xC4) will not be increased). This is a serious shortcoming, for if such a drive contains marginal sectors that consistently fail only after some time has passed following a successful write operation, then the drive will never remap these problem sectors.

Related Question