Encrypted ext3 damaged; how to proceed

data-recoveryencryptionfsck

My home partition on a Debian wheezy install is an encrypted LVM volume. It is ext3. Earlier today, I had a weird message in a terminal window about an attempt to write to a file in my /home tree failing due to having a read only file system. I rebooted and ended up with an error message saying /dev/sda1 is reported as clean. fsck.ext3, which runs automatically and reports that there is no such device as /dev/mapper/sda1_crypt and reports exit code 8. I get dropped to a maintenance shell and told there was an attempt to write a log to /var/log/fsck/checkfs.

That log reads:

[Timestamp]

fsck from util-linux 2.20.1
/dev/mapper/sda1_crypt: Super blocks need_recovery flag is clear, but journal has data.
/dev/mapper/sda1_crypt: Run journal anyway

/dev/mapper/sda1_crypt: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY
     (i.e., without -a or -p options)
fsck died with exit status 4

I ran

$ fsck -vnM /dev/mapper/sda1

A bunch of illegal block #nnnn (mmmmmmmmm) in inode ppppppp IGNORED messages blew past, followed by

too many blocks in Inode somenumberhere

Then running additional passes to resolve blocks claimed by more than one inode

It then output

Pass 1B: Rescanning for multiply claimed blocks

After a bit, I got a wall of

Illegal block number passed to ext2fs_test_block_bitmap somenumberhere for multiply claimed block map

These were followed by 2 Multiply claimed blocks in I node anothernumber: [lists of 5 and 8 block numbers]

Then I got a number of stanzas like

[ 3828.181915] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[ 3828.182462] ata1.01 BMDMA stat 0x64
[ 3828.183810] ata1.01 failed command: READ DMA EXT
[ 3828.185889] ata1.01 cmd 25/00:08:08:10:9c/00:00:29:00:00/f0 tag dma 4096 in
[ 3828.185891] res 51/40:00:09:10:9c/40:00:29:00:00/f0 Emask 0x9 (media error)
[ 3828.190071] ata1.01 status: { DRDY ERR }
[ 3828.192153] ata1.01 status: { UNC }

These were followed by

[ 3830.509338] end_request: I/O error, deb SDA, sector 698093577
[ 3830.509841] Buffer I/O error on device dm-3, logical block 87261184
Error reading block 87261184 (Attempt to read block from filesystem resulted in short read) while reading I node and block bitmaps. Ignore error? no

fsck.ext3: Can't read an block bitmap while retrying to read bitmaps for /dev/mappersfa1_crypt

/dev/mapper/sda1_crypt: ******* WARNING: Filesystem still has errors *******

e2fsck: aborted

/dev/mapper/sda1_crypt: ******* WARNING: Filesystem still has errors *******

And the it aborted with a warning that the filesystem still had errors.

My questions are:

  1. Is my data toasted? (My rigorous backup policy hasn't been rigorously followed of late; I am being punished by the universe, I am sure.)

  2. What can/ought I to do now?

  3. Did I do the wrong thing already?

  4. Will someone hold me until the shaking stops?

EDIT

I also asked on my local LUG mailing list. The advice I got there was to take an image of the drive with ddrescue and run fsck on a copy of that image. That seems sound and unlikely to make things worse. So, that is the present plan of attack, pending any better suggestions.

Best Answer

It sounds like the hard disk itself is having problems. ("short read," etc.) If so, dmesg | tail will probably show some I/O errors.

Another way to check this is to run badblocks -n on the problem partition. Or better, on the entire disk. Whatever you test, it needs to be unmounted. This will take hours on a large modern disk. If there's anything on the partition(s) that do mount that you can't live without, copy it off onto removable media or a network volume first.

The suggestion to mirror the disk is also good. It's kind of a "lite" version of the badblocks -n check, because by forcing the disk to read in every sector, it can cause the disk to relocate problem blocks, as badblocks -n will. badblocks -n is more effective because dodgy sectors can be barely-readable, and only be shown to the disk as bad enough to move by attempting to write to them. Still, if the disk has enough life left in it to survive a rescue, the extra read pass won't be enough to finish it off.

I don't hold much hope that running fsck on the disk image will recover everything. You'll almost certainly lose sectors in this process, which means some files will be unreadable or corrupted beyond use. A JPEG will partially decode with corrupted data, for example, but a JPEG with the bottom ⅔ cropped off might not be useful to you.

Is my data toasted?

Possibly, possibly not. The badblocks -n pass can sometimes fix the problem. If it does, you still need to replace the HDD, since a disk can only get into such a bad state by being nearly dead to start.

Did I do the wrong thing already?

Other than forgetting the meaning of the word "rigorous," no. :)

Related Question