Linux – Why is this SSD returning inconsistent data, why doesn’t the backup image file match the checksum

backupchecksumhard drivelinuxssd

This is about the SSD in a notebook. It appears the SSD is already going bad, possibly even corrupting data. It appears to return different data everytime it's accessed while not in use (see below for details). Which tools can be used to confirm this suspicion?

When an HDD slowly begins to die, there's usually a clear indication in the SMART output, a graphical tool like gsmart control would highlight a certain value in red and a service like smartd might already have generated a warning. At that point, the user might still have some time to create a backup before the drive starts to corrupt data. Of course, if the drive has already begun to corrupt data, some files in that backup could be damaged.

It seems there's no clear warning in the SMART output for this SSD, no kernel errors have been logged to dmesg etc (on the other hand, ecryptfs has logged errors). In other words, it was only by chance discovered that this SSD might already be so bad that it's corrupting data even when it's not in use.
After making a backup (1:1 dd image) of this SSD (sda), I discovered that the checksum of the image file doesn't match the checksum of the SSD. Of course, this was in a live system, so the SSD was not mounted, which means its contents could not have changed during the backup process.

This is what I did to make the backup copy. "BUTTER" is where I mounted an external drive, which is formatted with BTRFS so that I would be able to find out about data errors in case the backup drive is also bad (unlike most other filesystems, BTRFS has checksums).

[root@localhost mnt]# time dd if=/dev/sda of=BUTTER/SSD.dd.img bs=400M && echo OK
610+1 records in
610+1 records out
256060514304 bytes (256 GB, 238 GiB) copied, 5188.27 s, 49.4 MB/s

real    86m28.726s
user    0m0.008s
sys 7m3.597s
OK

I created an MD5 checksum of the image file and another one of the SSD and they didn't match. After repeating this procedure, I realized that the MD5 checksum of the SSD is different every single time.

[root@localhost mnt]# time dd if=/dev/sda bs=400M | md5sum >/tmp/MD5-again

610+1 records in
610+1 records out
256060514304 bytes (256 GB, 238 GiB) copied, 1059.87 s, 242 MB/s

real    17m39.904s
user    8m21.708s
sys 3m58.466s
[root@localhost mnt]# cat /tmp/MD5-again
24e71715359158f3ab38e748af93718c  -
[root@localhost mnt]# time dd if=/dev/sda bs=400M | md5sum >>/tmp/MD5-again
610+1 records in
610+1 records out
256060514304 bytes (256 GB, 238 GiB) copied, 1073.7 s, 238 MB/s

real    17m53.735s
user    8m28.494s
sys 4m23.993s
[root@localhost mnt]# cat /tmp/MD5-again
24e71715359158f3ab38e748af93718c  -
569d517626c1b7394acca0c4020c99bc  -

Again, the SSD was never mounted at any point during that process.

# mount | grep -c sda
0

I ran a SMART test, which did not find anything. No SMART error is logged.
SMART attributes:

Device Model: SanDisk SD8TN8U256G1001

[root@localhost mnt]# smartctl -A /dev/sda
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.16.3-301.fc28.x86_64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 4
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0032   100   100   ---    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   ---    Old_age   Always       -       3173
 12 Power_Cycle_Count       0x0032   100   100   ---    Old_age   Always       -       1117
170 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       0
171 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       0
172 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       0
173 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       37
174 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       41
178 Used_Rsvd_Blk_Cnt_Chip  0x0032   100   100   ---    Old_age   Always       -       0
180 Unused_Rsvd_Blk_Cnt_Tot 0x0033   100   100   010    Pre-fail  Always       -       100
184 End-to-End_Error        0x0033   100   100   097    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   ---    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   056   062   ---    Old_age   Always       -       44 (Min/Max 13/62)
199 UDMA_CRC_Error_Count    0x0032   100   100   ---    Old_age   Always       -       0
233 Media_Wearout_Indicator 0x0033   093   100   001    Pre-fail  Always       -       15484248
234 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       11127
241 Total_LBAs_Written      0x0030   253   253   ---    Old_age   Offline      -       3192
242 Total_LBAs_Read         0x0030   253   253   ---    Old_age   Offline      -       66461
249 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       9346

What's happening?

Best Answer

Right after posting this question, I found my mistake. I used Fedora XFCE as live system, which has automatically enabled the swap partition that just happens to be on the SSD in question. And of course, while the live system was actively using the swap partition on the SSD, it was thereby changing the contents of the SSD.

[root@localhost mnt]# swapon --show
NAME      TYPE      SIZE   USED PRIO
/dev/sda3 partition   8G 103.3M   -2

That's a bit awkward now that I've already posted the question. But I'll leave it there, hoping it will be useful for someone else who might be doing the same mistake.

All I had to do was disable the swap partition that was previously automatically mounted:

[root@localhost mnt]# swapoff -a

I'd like to point out that the swap partition was mounted automatically when I booted the live system. I didn't want that swap partition to be mounted. I wonder what happens if there was a hibernate image on that swap partition.

After disabling the unwanted swap partition, everything worked as expected. Using the commands shown above, the image file's checksum now matches the checksum of the SSD. In other words, this SSD isn't bad.

Related Question