The problems you're having sound far more extensive than what I'd expect from mere loss of power (even during fairly heavy write activity) on a device. I have to wonder if you're really having more problems at the interface/driver level, or a corrupted partition table or something of that sort.
From the sounds of things you may have exacerbated the problem further with all the thrashing around you've done while trying to fix the issue.
I don't know if we can help with this case but don't give up yet.
For the future I'd suggest that you learn the following technique:
When you have trouble with a drive under Linux or UNIX you can usually use dd
to make a bit-image copy of the whole device to some other location. Find a drive that's at least as large as the one in question and try a command like: dd if=$PROBLEMATIC of=$TARGET bs=4M
... be very careful about the if (input file) and of (output file) directives. Leave that run. It's a good idea to run tail -f /var/log/messages &
(or possible variant as appropriate to your /etc/syslog.conf) ... either do that in the background or in another window. There are enhanced versions of dd
which can handle retries and continuing past bad blocks more robustly (sdd
is a name that comes to mind). But try just using the stock GNU dd
command at first.
You can make such a copy of the whole device (/dev/sdd, for example) or just the partition (/dev/sdd1). If you get "short read or similar errors then it suggests that either the device has physical errors preventing reads past certain cylinders or, in the case of a partition, that the partition table is mangled in some way. You can even make two different dd
images ... one of each.
Here's the trick: do all your fsck
and mount
attempts, and use your various other recovery tools such as TCT (The Coroner's Toolkit) on the copied image!
This minimizes the time spent running the drive (which is possibly degrading at the hardware level as you operate it) and minimizes the impact of failed and possibly misguided recovery attempts. (In some situations you make one image, then another based on that and always operate on the tertiary image ... depends on how much the data is worth).
I personally suggest that you run something like hexdump
or strings
to read through the image ... just let it scroll past for a long time and look for plain text that looks like it might be fragments of your data. I have used grep
to recover useful (textual) data from otherwise completely mangled filesystems. In case I'm not suggesting it as data recovery heroics ... but as a sanity check. If you scroll through 10s of megabytes or a few gigabytes of data and don't see any recognizable text ... then you probably have a hopeless case or you've done something very wrong (were you really careful about those if= and of= options?).
I don't know if any of this will help you with the current effort. But learn these tricks now and they will definitely make your next foray into data recovery much less scary. (Yes, practice on a healthy system once or twice --- go use a hex editor and try adding your own creative corruption here and there --- to the COPY of course! Then try fix it).
Oh, and this is a really good time to review your backup and data recovery plans and procedures (or provide better advice to your customer/colleague/client/friend/whatever).
Best Answer
First off, you're right about running fsck on the partition - fsck only works on filesystems, not entire disks. You can get a list of all partitions on the disk with
fdisk -l /dev/sdd
.You're filesystem type is probably ext3 (the default in most Linux distros), which means it will usually pass an fsck as long its journal is clean.
fsck -f
will, as mentioned above, force a full check.However, if you have read errors on the disk, no amount of fsck will help dd - since dd really doesn't care about the content of the disk.
To get dd to read the disk and continue on read errors, use
dd conv=noerror,sync
, which will continue on read errors and append null bytes to any block when there is a read error.After you have finished the backup, you should run
fsck -f
on the clone to get it up and running again.Another tip: If you backup the partition to a file, you can loopback mount it with
mount -o loop filename.ext3 /mountpoint
. Also, say you are cloning a 200G partition to a 500G drive, you can then runresize2fs /dev/sdx1
(where sdx is your new drive, partitioned with a single 500G partition), and the filesystem will be resized to 500G.Lastly, if the disk is in such a shape that it's giving you read errors, I would advise you to avoid turning the disk off and on until you're finished recovering data. In some failure modes, the disk will at some point simply no longer spin up or fail to be recognized by the OS, and at that point getting data out of the drive becomes quite expensive.