Linux – Salvage files from ext3 filesystem with physical errors

data-recoveryext2linux

I have a disk from a crashed Linux laptop with files on it that the unhappy owner would like to have back if at all possible (no backup solutions please). I have not had anything to do with it before. The disk is recognized by both OS X and Ubuntu 11.10:

root@ubuntu1110:~# fdisk -l /dev/sdc

Disk /dev/sdc: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders, total 976773168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x80d549b4

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1   *          63   953602334   476801136   83  Linux
/dev/sdc2       953602335   976768064    11582865    5  Extended
/dev/sdc5       953602398   976768064    11582833+  82  Linux swap / Solaris

This looks consistent with a stock installation of a Linux distribution with a swap partition.

Unfortunately some rather nasty messages show up in dmesg, after Ubuntu says it cannot mount the sdc1 partition:

[  181.228092] sd 6:0:0:0: [sdc] 976773168 512-byte logical blocks: (500 GB/465 GiB)
[  181.232176] sd 6:0:0:0: [sdc] Write Protect is off
[  181.232181] sd 6:0:0:0: [sdc] Mode Sense: 21 00 00 00
[  181.236359] sd 6:0:0:0: [sdc] No Caching mode page present
[  181.236364] sd 6:0:0:0: [sdc] Assuming drive cache: write through
[  181.246696] sd 6:0:0:0: [sdc] No Caching mode page present
[  181.246707] sd 6:0:0:0: [sdc] Assuming drive cache: write through
[  182.835915]  sdc: sdc1 sdc2 < sdc5 >
[  182.854199] sd 6:0:0:0: [sdc] No Caching mode page present
[  182.854204] sd 6:0:0:0: [sdc] Assuming drive cache: write through
[  182.854208] sd 6:0:0:0: [sdc] Attached SCSI disk
[  218.250174] sd 6:0:0:0: [sdc] Unhandled sense code
[  218.250179] sd 6:0:0:0: [sdc]  Result: hostbyte=DID_ERROR driverbyte=DRIVER_SENSE
[  218.250182] sd 6:0:0:0: [sdc]  Sense Key : Hardware Error [current] 
[  218.250187] Info fld=0x0
[  218.250188] sd 6:0:0:0: [sdc]  Add. Sense: No additional sense information
[  218.250193] sd 6:0:0:0: [sdc] CDB: Read(10): 28 00 00 00 01 08 00 00 08 00
[  218.250200] end_request: I/O error, dev sdc, sector 264
[  218.250206] Buffer I/O error on device sdc, logical block 33
[  255.398994] sd 6:0:0:0: [sdc] Unhandled sense code
[  255.399029] sd 6:0:0:0: [sdc]  Result: hostbyte=DID_ERROR driverbyte=DRIVER_SENSE
[  255.399032] sd 6:0:0:0: [sdc]  Sense Key : Hardware Error [current] 
[  255.399037] Info fld=0x0
[  255.399038] sd 6:0:0:0: [sdc]  Add. Sense: No additional sense information
[  255.399053] sd 6:0:0:0: [sdc] CDB: Read(10): 28 00 00 00 01 08 00 00 08 00
[  255.399061] end_request: I/O error, dev sdc, sector 264
[  255.399066] Buffer I/O error on device sdc, logical block 33
[  281.340599] sd 6:0:0:0: [sdc] Unhandled sense code
[  281.340609] sd 6:0:0:0: [sdc]  Result: hostbyte=DID_ERROR driverbyte=DRIVER_SENSE
[  281.340618] sd 6:0:0:0: [sdc]  Sense Key : Hardware Error [current] 
[  281.340653] Info fld=0x0
[  281.340655] sd 6:0:0:0: [sdc]  Add. Sense: No additional sense information
[  281.340659] sd 6:0:0:0: [sdc] CDB: Read(10): 28 00 00 00 00 67 00 00 08 00
[  281.340667] end_request: I/O error, dev sdc, sector 103
[  281.340739] EXT3-fs (sdc1): error: can't read group descriptor 4

My current theory is that the harddisk has run out of spare blocks so now a real bad block has been introduced and it is in the area used when mounting the partition. This is confirmed by dd:

root@ubuntu1110:~# dd if=/dev/sdc1 of=/dev/null bs=10240 conv=noerror
dd: reading `/dev/sdc1': Input/output error
2+0 records in
2+0 records out
20480 bytes (20 kB) copied, 44.7084 s, 0.5 kB/s
dd: reading `/dev/sdc1': Input/output error
9+1 records in
9+1 records out
96256 bytes (96 kB) copied, 162.933 s, 0.6 kB/s
dd: reading `/dev/sdc1': Input/output error
9+1 records in
9+1 records out
96256 bytes (96 kB) copied, 180.083 s, 0.5 kB/s

Bad blocks early and very slow transmission rate even later in the process (not shown)

My problem now is how to approach from here. I need something that can read from a broken ext2/ext3-filesystem so we can copy those files still there off the disk, and I have not done much Linux system administration in the last 15 years so I do not know the right terms for searching.

I could probably copy a disk image over night, but then the "this block is bad" information is lost.

What kind of program would be useful in this situation?

Best Answer

First rule of disk recovery: Stop using the disk. If there are hardware issues (such as a head crash), any usage risks further damage; if the filesystem is corrupt, any mount or fsck has the potential to make it worse. (Even in ro mode! Note that mount -t ext3 -o ro will attempt to playback the journal and write to disk!)

Use dd_rescue or ddrescue to copy as much of the disk image to another system as possible, put the disk away, and make copies of the image. Perform all attempts at recovery from one of the copies.

Now, I gave some tips for ext data recovery here. In short,

Your partition layout appears to be still valid. If it weren't, you could use TestDisk or gpart to attempt recovery of the partition table.
e2fsck may be able to munge the filesystem back into a mountable state. It'll place dangling inodes into /lost+found and report errors.
ext4magic tries to recover data from journaled metadata. Whether files are recoverable from the journa is up to luck and chance, but it's possible there's stuff in there.
The Sleuth Kit can parse and output most filesystem structures. If you know a fair amount about the filesystem's internal layout and have a hex editor handy (to do stuff like "superblock is corrupt and backup superblock is out of date but I can pick enough data out to reconstruct it myself"), IMO this is the absolutely most useful tool for recovering the most data.
PhotoRec will attempt to find byte sequences that look like files. It is only guessing at file start/end, will not know anything about the filesystem structure such as directories and filenames, and will not find fragmented files.

Related Solutions

Which sector size shall I choose to run ddrescue with direct access on an Advanced Format drive

I've exchanged emails with the author of ddrescue, Antonio Diaz, and he told me that the correct parameter to use with an "advanced format" drive (i.e., a drive with 4096-byte physical sectors, but 512-byte "logical sectors") is:

 -b4096

If you wanted it to read just one 4096-byte sector at a time (slow!) then you would also specify:

-c1

Antonio is not active on StackExchange, but he supports ddrescue via this email mailing list:

https://www.mail-archive.com/bug-ddrescue@gnu.org/

If you send your email to bug-ddrescue@gnu.org then your email will appear on that summary page, as will his answer, in nicely organized form (but without your email address shown, of course). Additionally, you may search on that page to try to find previous discussions of your issue or question, before bothering Antonio. (He is a very busy man, so please don't waste his time!)

The reason that your ddrescue logfile contains a 512-byte "bad" area is that you initially ran ddrescue with the default sector size of 512 bytes. That's not disastrous, but if ddrescue thinks the drive has 512 byte sectors, and a read is issued that returns 0 bytes of data due to a read error, then ddrescue assumes that only the first of 512 bytes are unreadable, and makes no assumption about the rest. So only 512 bytes is marked as bad in the logfile.

Linux – LVM: PV missing after reboot

Does the LV become mountable if you do a sudo vgscan and sudo vgchange -ay? If those commands result in errors, you probably have a different problem and should probably add those error messages in your original post.

But if the LV becomes ready for mounting after those commands, read on...

The LVM logical volume pathname (e.g. /dev/mapper/vgNAME-lvNAME) in /etc/fstab alone won't give the system a clue that this particular filesystem cannot be mounted until networking and iSCSI have been activated.

Without that clue, the system will assume that filesystem is on a local disk and will attempt to mount it as early as possible, normally before networking has been activated, which will obviously fail with an iSCSI LUN. So you'll need to supply that clue somehow.

One way would be to add _netdev to the mount options for that filesystem in /etc/fstab. From this Ubuntu help page it appears to be supported on Ubuntu. This might actually also trigger a vgscan or similar detection of new LVM PVs (+ possibly other helpful stuff) just before the attempt to mount any filesystems marked with _netdev.

Another way would be to use the systemd-specific mount option x-systemd.requires=<iSCSI initiator unit name>. That should achieve the same thing, by postponing any attempts to mount that filesystem until the iSCSI initiator has been successfully activated.

When the iSCSI initiator activates, it will automatically make any configured LUNs available, and as they become available, LVM should auto-activate any VGs on them. So, once you get the mount attempt postponed, that should be enough.

The lack of PARTUUID is a clue that the disk/LUN does not have a GPT partition table. Since /dev/sdc is listed as TYPE="LVM2_member" it actually does not have any partition table at all. In theory, it should cause no problems for Linux, but I haven't personally tested an Ubuntu 18.04 system with iSCSI storage, so cannot be absolutely certain.

The problem with disks/LUNs with no partition table is that other operating systems won't recognize the Linux LVM header as a sign that the disk is in use, and will happily overwrite it with minimal prompting. If your iSCSI storage administrator has accidentally presented the storage LUN corresponding to your /dev/sdc to another system, this might have happened.

You should find the LVM configuration backup file in /etc/lvm/backup directory that corresponds to your missing VG, and read it to find the expected UUID of the missing PV. If it matches what blkid reports, ask your storage administrator to double-check his/her recent work for mistakes like described above. If it turns out the PV has been overwritten by some other system, any remaining data on the LUN is likely to be more or less corrupted and it would be best to restore it from backup... once you get a new, guaranteed-unconflicted LUN from your iSCSI admin.

If it turns out the actual UUID of /dev/sdc is different from expected, someone might have accidentally run a pvcreate -f /dev/sdc somehow. If that's the only thing that has been done, that's relatively easy to fix. (NOTE: check man vgcfgrestore chapter REPLACING PHYSICAL VOLUMES for updated instructions - your LVM tools may be newer than mine.) First restore the UUID:

pvcreate --restorefile /etc/lvm/backup/<your VG backup file> --uuid <the old UUID of /dev/sdc from the backup file> /dev/sdc

Then restore the VG configuration:

vgcfgrestore --file /etc/lvm/backup/<your VG backup file> <name of the missing VG>

After this, it should be possible to activate the VG, and if no other damage has been done, mount the filesystem after that.

Best Answer

Related Solutions

Which sector size shall I choose to run ddrescue with direct access on an Advanced Format drive

Linux – LVM: PV missing after reboot

Related Question