Hard Disk Hardware SSD Smartctl – How to Check Life Left in SSD

hard-diskhardwaresmartctlssd

We all know that SSDs have a limited predetermined life span. How do I check in Linux what the current health status of an SSD is?

Most Google search results would ask you to look up S.M.A.R.T. information for a percentage field called Media_Wearout_Indicator, or other jargons indicators like Longterm Data Endurance — which don't exist — Yes I did check two SSDs, both lack these fields. I could go on to find a third SSD, but I feel the fields are not standardized.

To demonstrate the problem here are the two examples.

With the first SSD, it is not clear which field indicates wearout level. However there is only one Unknown_Attribute whose RAW VALUE is between 1 and 100, thus I can only assume that is what we are looking for:

    $ sudo smartctl -A /dev/sda                                             
    smartctl 6.2 2013-04-20 r3812 [x86_64-linux-3.11.0-14-generic] (local build)
    Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

    === START OF READ SMART DATA SECTION ===                                 
    SMART Attributes Data Structure revision number: 1                       
    Vendor Specific SMART Attributes with Thresholds:                        
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      5 Reallocated_Sector_Ct   0x0002   100   100   000    Old_age   Always       -       0
      9 Power_On_Hours          0x0002   100   100   000    Old_age   Always       -       6568
     12 Power_Cycle_Count       0x0002   100   100   000    Old_age   Always       -       1555
    171 Unknown_Attribute       0x0002   100   100   000    Old_age   Always       -       0
    172 Unknown_Attribute       0x0002   100   100   000    Old_age   Always       -       0
    173 Unknown_Attribute       0x0002   100   100   000    Old_age   Always       -       57
    174 Unknown_Attribute       0x0002   100   100   000    Old_age   Always       -       296
    187 Reported_Uncorrect      0x0002   100   100   000    Old_age   Always       -       0
    230 Unknown_SSD_Attribute   0x0002   100   100   000    Old_age   Always       -       190
    232 Available_Reservd_Space 0x0003   100   100   005    Pre-fail  Always       -       0
    234 Unknown_Attribute       0x0002   100   100   000    Old_age   Always       -       350
    241 Total_LBAs_Written      0x0002   100   100   000    Old_age   Always       -       742687258
    242 Total_LBAs_Read         0x0002   100   100   000    Old_age   Always       -       1240775277

So this SSD has used 57% of its rewrite life-span, is it correct?

With the other disk, the SSD_Life_Left ATTRIBUTE stands out, but its Raw value of 0, indicating 0% life left, is unlikely for an apparently-healthy SSD unless it happen to be in peril (we will see in a few days), and if it reads "0% life has been used", also impossible for a worn hard disk (worn = used for more than a year).

    > sudo /usr/sbin/smartctl -A /dev/sda
    smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.11.6-4-desktop] (SUSE RPM)
    Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

    === START OF READ SMART DATA SECTION ===
    SMART Attributes Data Structure revision number: 10
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      1 Raw_Read_Error_Rate     0x000f   104   100   050    Pre-fail  Always       -       0/8415644
      5 Retired_Block_Count     0x0033   100   100   003    Pre-fail  Always       -       0
      9 Power_On_Hours_and_Msec 0x0032   100   100   000    Old_age   Always       -       4757h+02m+17.130s
     12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       1371
    171 Program_Fail_Count      0x0032   000   000   000    Old_age   Always       -       0
    172 Erase_Fail_Count        0x0032   000   000   000    Old_age   Always       -       0
    174 Unexpect_Power_Loss_Ct  0x0030   000   000   000    Old_age   Offline      -       52
    177 Wear_Range_Delta        0x0000   000   000   000    Old_age   Offline      -       2
    181 Program_Fail_Count      0x0032   000   000   000    Old_age   Always       -       0
    182 Erase_Fail_Count        0x0032   000   000   000    Old_age   Always       -       0
    187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
    194 Temperature_Celsius     0x0022   030   030   000    Old_age   Always       -       30 (Min/Max 30/30)
    195 ECC_Uncorr_Error_Count  0x001c   104   100   000    Old_age   Offline      -       0/8415644
    196 Reallocated_Event_Count 0x0033   100   100   000    Pre-fail  Always       -       0
    231 SSD_Life_Left           0x0013   100   100   010    Pre-fail  Always       -       0
    233 SandForce_Internal      0x0000   000   000   000    Old_age   Offline      -       3712
    234 SandForce_Internal      0x0032   000   000   000    Old_age   Always       -       1152
    241 Lifetime_Writes_GiB     0x0032   000   000   000    Old_age   Always       -       1152
    242 Lifetime_Reads_GiB      0x0032   000   000   000    Old_age   Always       -       3072

Best Answer

In your first example, what I think you are referring to is the "Media Wearout Indicator" on Intel drives, which is attribute 233. Yes, it has a range of 0-100, with 100 being a brand new, unused drive, and 0 being completely worn out. According to your ouptut, this field doesn't seem to exist.

In your second example, please read the official docs about SSD_Life_Left. Per that page:

The RAW value of this attribute is always 0 and has no meaning. Check the normalized VALUE instead. It starts at 100 and indicates the approximate percentage of SDD life left. It typically decreases when Flash blocks are marked as bad, see the RAW value of Retired_Block_Count

It's really important that you fully understand what smartctl(8) is saying, and not making assumptions. Unfortunately, the S.M.A.R.T. tools aren't always up to date with the latest SSDs and their attributes. As such, there isn't always a clean way to tell how many times the chips have been written to. Best you can do, is look at the "Power_On_Hours", which in your case is "6568", determine your average disk utilization, and average it out.

You should be able to lookup your drive specs, and determine the process used to make the chips. 32nm process chips will have a longer write endurance than 24nm process chips. However, it seems that "on average", you could probably expect about 3,000 to 4,000 writes, with a minimum of 1,000 and a max of 6,000. So, if you have a 64GB SSD, then you should expect somewhere in the neighborhood of a total of 192TB to 256TB written to the SSD, assuming wear leveling.

As an example, if you're sustaining a utilization of say 11 KBps to your drive, then you could expect to see about 40 MB written per hour. At 6568 powered on hours, you've written roughly 260 GB to disk. Knowing that you could probably sustain about 200 TB of total writes, before failure, you have about 600 years before failure due to wearing out the chips. Your disk will likely fail due to worn out capacitors or voltage regulation.

Related Solutions

How to defragment the SSD on linux

In general you can just ignore fragmentation altogether. More so for SSD which do not suffer from seek times like HDD. Defragmenting a SSD will do nothing except waste write cycles.

Although there may be extreme cases where fragmentation has a noticable effect, such as a sparse file written to in random order (as some BitTorrent clients do), or when the disk runs out of free space, when the last file that was written to will be split up in thousands of fragments as there was no other consecutive space available to fit the needs.

But that's the exception. It doesn't happen usually. Most filesystems are very good at avoiding fragmentation, and the Linux kernel is good at avoiding ill effects caused by fragmentation. Once more than one process read/write files concurrently, the disk will have to be everywhere at once anyway.

There aren't too many defragmentation solutions for Linux. XFS has xfs_fsr which works great, so if you absolutely want to use defragmentation, XFS is a good choice.

You can check file fragmentation using filefrag or hdparm:

# filefrag debian-6.0.6-amd64-netinst.iso 
debian-6.0.6-amd64-netinst.iso: 4 extents found

# hdparm --fibmap debian-6.0.6-amd64-netinst.iso

debian-6.0.6-amd64-netinst.iso:
 filesystem blocksize 4096, begins at LBA 0; assuming 512 byte sectors.
 byte_offset  begin_LBA    end_LBA    sectors
           0    3003928    3072407      68480
    35061760    2872856    2938391      65536
    68616192    2171576    2302647     131072
   135725056   56259072   56338047      78976

If that doesn't give you hundreds or thousands of extents (fragments), it's nothing to worry about.

A generic defragmentation method is to make a copy of the file and then replace the original with it, such as:

cp -a yourfile yourfile.defrag
mv yourfile.defrag yourfile

Your filesystem should have a good amount of free space as otherwise the probability is high that the new file will just be as fragmented as the old one. (Check if the new file is better than the old one before replacing).

But as I said, there is usually no need to do this unless a file got a really bad case of fragmentation somehow.

Hard-drive errors

From the smartctl man page:

The Attribute table printed out by smartctl also shows the "TYPE" of the Attribute. Attributes are one of two possible types: Pre-failure or Old age. Pre-failure Attributes are ones which, if less than or equal to their threshold values, indicate pending disk failure. Old age, or usage Attributes, are ones which indicate end-of-product life from old-age or normal aging and wearout, if the Attribute value is less than or equal to the threshold. Please note: the fact that an Attribute is of type 'Pre-fail' does not mean that your disk is about to fail! It only has this meaning if the Attribute´s current Normalized value is less than or equal to the threshold value.

If the Attribute´s current Normalized value is less than or equal to the threshold value, then the "WHEN_FAILED" column will display "FAILING_NOW". If not, but the worst recorded value is less than or equal to the threshold value, then this column will display "In_the_past". If the "WHEN_FAILED" column has no entry (indicated by a dash: ´-´) then this Attribute is OK now (not failing) and has also never failed in the past.

So according to the smartctl output section you have posted, your drive actually looks in good shape. However, that doesn't necessarily mean that there is not another problem.

Unfortunately the Unhandled sense code message does mean that something went wrong, but the kernel doesn't know what. You could try looking at the rest of the smartctl output to see if there is any thing wrong. There should be a part tha summarises the drive's overall health. You can get it on its own with the -H option.

If the drive supports self testing, you can start one with:

smartctl -t long /dev/sda

This starts one in the background, so you will have to keep checking for results. If the drive is not mounted, you can add the -C option enable captive mode which should take less time. A short test is also possible, but less thorough.

It is also a good idea to check physical connectors etc to make sure nothing as come loose - its an easy fix if it has.

Update

Wikipedia has a good reference for smart attributes. Note that the 'Better' column refers to the raw values in rightmost column of the output and not the normalised value at the start. Here is the part on 'Current Pending Sector' mentioned by frostschutz:

Count of "unstable" sectors (waiting to be remapped, because of unrecoverable read errors). If an unstable sector is subsequently read successfully, the sector is remapped and this value is decreased. Read errors on a sector will not remap the sector immediately (since the correct value cannot be read and so the value to remap is not known, and also it might become readable later); instead, the drive firmware remembers that the sector needs to be remapped, and will remap it the next time it's written. However some drives will not immediately remap such sectors when written; instead the drive will first attempt to write to the problem sector and if the write operation is successful then the sector will be marked good (in this case, the "Reallocation Event Count" (0xC4) will not be increased). This is a serious shortcoming, for if such a drive contains marginal sectors that consistently fail only after some time has passed following a successful write operation, then the drive will never remap these problem sectors.

Best Answer

Related Solutions

How to defragment the SSD on linux

Hard-drive errors

Update

Related Question