Linux – Rebuild massively fragmented files with a partial image and a list of their sectors

command linedata-recoverylinuxmftntfs

In an attempt to recover as much data as possible off of a failing 3TB HDD, I proceeded like this :

  • I made a surface scan with HD Sentinel, which identified two small damaged areas and about 100 bad sectors (before that the count was at 16).
  • Then I identified which files were affected by bad sectors using various methods.
  • I moved those files (six large video files) to a special folder, and copied the rest of the files and folders, by decreasing order of importance ; everything was copied successfully, except one unimportant .eml file, which happened to be located close to the already identified bad sectors.
  • Then I figured that the safest way to get the most out of the remaining files (TV broadcasts which are no longer online and for which I have no backup) would be to use ddrescue – but since the only empty HDD I had was a 500GB one, I couldn't image everything. Some of those files are massively fragmented (6000 to 12000 fragments each – they were downloaded simultaneously, I guess that's why they were written in an “interlaced” pattern causing that level of fragmentation, because otherwise the HDD had plenty of free space), so I could not recover them simply by extracting the sectors they occupied, but I thought that by imaging the first 10GB, usually containing the whole MFT and all the other system files, plus the four areas where those files were located, I would be able to extract them easily from the image, using WinHex or R-Studio.

But unfortunately, I didn't get the whole MFT : some of it (as I later found out examining the complete nfi.exe listing of that partition I had made earlier) is located around the 200GB mark, and a third chunk is located at the very end of the partition, close to the 3TB mark. I didn't think that the HDD's state would deteriorate so quickly during the recovery attempt (now it's got more than 12000 reallocated sectors plus 9000 pending sectors, just a few hours later !…), and I didn't take the precaution to save the MFT from WinHex when I could. Now, with ddrescue it has become painfully slow, I probably won't get the whole MFT. Also, if I open that partial image with WinHex, it uses the same volume snapshot that was created when I examined the physical device, the files I want are listed with their correct size and dates, if I click on them it displays the correct first sector, but it still can't extract them (only 0 byte files are extracted), apparently the volume snapshot doesn't contain all the required data regarding the allocated sectors, WinHex seems to rely on the MFT at that point, so that won't work either.

But I have recovered a good portion of the chunks of data containing those six files, and I have for each of them a detailed list of the sectors / clusters they occupy (obtained with three different tools : nfi.exe, Recuva, HD Sentinel). Now, how can I rebuild those files with that information, using an automated script ? (It would be an impossible task to do this manually.)

With ddrescue I could use the -i (input position) -o (output position) and -s (input size) switches, but how could I automate the process, and run those thousands of commands all at once ?

On Windows, I know a command line tool called dsfo which can extract data from any source to a destination file with a command like this :

dsfo [source] [offset] [size] [destination]

I could edit my list of sectors/clusters with a combination of Calc and TEDNotepad, to create a list of dsfo commands, but it would create thousands of chunks, which I would then have to join somehow. Is there a better way to do this in one step ?



EDIT :

So I took the list of clusters / sectors for one of these files, generated by HD Sentinel, which is presented like this :

R:\fichiers corrompus\2017_07_2223_58 - Arte - Pink Floyd - The Dark Side of the Moon Live.mp4
Total Size: 883 787 365 bytes   Position: 0     Attributes: Arc
Number of file fragments: 6040
VCN: 0  LCN: 516530293  Length: 4288    sectors: 4132506536 - 4132540839
VCN: 4288   LCN: 516534613  Length: 16  sectors: 4132541096 - 4132541223
VCN: 4304   LCN: 516534645  Length: 64  sectors: 4132541352 - 4132541863
VCN: 4368   LCN: 516534725  Length: 16  sectors: 4132541992 - 4132542119
VCN: 4384   LCN: 516534757  Length: 48  sectors: 4132542248 - 4132542631
VCN: 4432   LCN: 516534853  Length: 32  sectors: 4132543016 - 4132543271
VCN: 4464   LCN: 516534901  Length: 16  sectors: 4132543400 - 4132543527
VCN: 4480   LCN: 516534933  Length: 48  sectors: 4132543656 - 4132544039
VCN: 4528   LCN: 516535013  Length: 16  sectors: 4132544296 - 4132544423
...
VCN: 215760 LCN: 568126709  Length: 9   sectors: 4545277864 - 4545277935

The first field probably stands for “Virtual Cluster Number” (haven't found a detailed description in the integrated help), anyway, this value obviously represents the cluster number relative to the begining of the file. The second value must be the “Logical Cluster Number” and is the cluster number relative to the begining of the partition (see below, I had it wrong at first, thinking that this value was relative to the whole device). The third value represents the length of each fragment, also measured in clusters. Those three values should suffice for my intents and purposes.

I imported that into TED Notepad, and used the “Tools” > “Lines” > “Columns, numbers” function, selected columns 2, 3, 1 with tabs as separators, which produced this output :

LCN: 516530293  Length: 4288    VCN: 0
LCN: 516534613  Length: 16  VCN: 4288
LCN: 516534645  Length: 64  VCN: 4304
LCN: 516534725  Length: 16  VCN: 4368
LCN: 516534757  Length: 48  VCN: 4384
LCN: 516534853  Length: 32  VCN: 4432
LCN: 516534901  Length: 16  VCN: 4464
LCN: 516534933  Length: 48  VCN: 4480
LCN: 516535013  Length: 16  VCN: 4528
...
LCN: 568126709  Length: 9   VCN: 215760 

Then I imported that into Calc with tabs and spaces as separators, added a column to calculate the input offset from the cluster number (=LCN*8*512), another to calculate the length in bytes from the length in clusters (=Length*8*512) and finally another to get the output offset from the VCN value (=VCN*8*512), pasted the formulas to all the other lines, removed the extra columns, replaced “LCN:” with “ddrescue /media/sdb1/ST3000DM001-2.dd /media/sdb1/201707222358.mp4 -i”, replaced “Length:” with “-s”, replaced “VCN:” with “-o”…
Now I've got this (except there are 6000-12000 lines for each file) :

ddrescue /media/sdb1/ST3000DM001-2.dd /media/sdb1/201707222358.mp4 -i 2115708080128 -s 17563648 -o 0
ddrescue /media/sdb1/ST3000DM001-2.dd /media/sdb1/201707222358.mp4 -i 2115725774848 -s 65536 -o 17563648
ddrescue /media/sdb1/ST3000DM001-2.dd /media/sdb1/201707222358.mp4 -i 2115725905920 -s 262144 -o 17629184
ddrescue /media/sdb1/ST3000DM001-2.dd /media/sdb1/201707222358.mp4 -i 2115726233600 -s 65536 -o 17891328
ddrescue /media/sdb1/ST3000DM001-2.dd /media/sdb1/201707222358.mp4 -i 2115726364672 -s 196608 -o 17956864
ddrescue /media/sdb1/ST3000DM001-2.dd /media/sdb1/201707222358.mp4 -i 2115726757888 -s 131072 -o 18153472
ddrescue /media/sdb1/ST3000DM001-2.dd /media/sdb1/201707222358.mp4 -i 2115726954496 -s 65536 -o 18284544
ddrescue /media/sdb1/ST3000DM001-2.dd /media/sdb1/201707222358.mp4 -i 2115727085568 -s 196608 -o 18350080
ddrescue /media/sdb1/ST3000DM001-2.dd /media/sdb1/201707222358.mp4 -i 2115727413248 -s 65536 -o 18546688
...
ddrescue /media/sdb1/ST3000DM001-2.dd /media/sdb1/201707222358.mp4 -i 2327047000064 -s 36864 -o 883752960

So, what is the simplest way to run this huge series of commands on a Knoppix live system ? What is in Linux the equivalent of a batch script for the command prompt in Windows ?

(I could find that particular file on a P2P network, so it will allow me to test if this method works flawlessly, and if it does, to assess the level of damage. No such luck for the five others. One of those is not fragmented so I could extract it as one chunk of data : there are many blank sectors near the end, but the rest is readable. So there remain four files to extract that way.)

Best Answer

So I did run those ddrescue scripts (first made them executable with the “chmod +x” command, then called them with ./name_of_the_script) :

– At first the commands didn't work, ddrescue gave only errors, I had to edit the scripts again so that the parameters would be placed before the names of the input and output files. The commands then looked like this :

ddrescue -P -i 2115843346432 -s 17563648 -o 0  ST3000DM001-2.dd 201707222358.mp4
ddrescue -P -i 2115861041152 -s 65536 -o 17563648  ST3000DM001-2.dd 201707222358.mp4
ddrescue -P -i 2115861172224 -s 262144 -o 17629184  ST3000DM001-2.dd 201707222358.mp4
ddrescue -P -i 2115861499904 -s 65536 -o 17891328  ST3000DM001-2.dd 201707222358.mp4
ddrescue -P -i 2115861630976 -s 196608 -o 17956864  ST3000DM001-2.dd 201707222358.mp4
ddrescue -P -i 2115862024192 -s 131072 -o 18153472  ST3000DM001-2.dd 201707222358.mp4
...
ddrescue -P -i 2327182266368 -s 36864 -o 883752960  ST3000DM001-2.dd 201707222358.mp4

(Total size of that file : 883787365, or 883789824 with the slack space.)
(“-P” stands for “preview”, “-i” for “input position”, “-s” for “size”, “-o” for “output position”.)
(The paths could be omitted as the scripts, the image file and the expected output files were all in the same directory.)

– Then the first attempt produced an unreadable file, without a correct MP4 header. Why ? Because the list provided by Hard Disk Sentinel gives the physical/absolute sector numbers, but the logical cluster numbers (I verified by opening the image file with WinHex), so I had to add 264192x512 to the input offset calculation (the partition offset being 264192 sectors, or 129MB).

– Then it worked. It took just a few minutes and produced five video files, which are mostly readable, skippable to the end, with their expected content — I haven't watched them completely, but it seems as flawless as can be.

enter image description here

(I made all this on a secondary computer running on Knoppix live from a memory card, and used TeamViewer to command it from my primary computer on Windows 7, and also to be able to transfer the script files easily. Maybe there's a simpler setup for such purposes, but, well, it works ! :^p)

– But of course there are corrupted parts, since there were unreadable sectors in that partial image. How could I know where, quickly and reliably ? Well...
I had the idea to use ddrescue's “generate” mode, which creates logfiles (or mapfiles as they're called now) by parsing the output and considering that totally empty sectors are unread sectors, marked “?”, the rest being marked “+”. Since ddrescue expects an input file and an output file, but only the output file is actually parsed in that mode, I created dummy input files with this command, which copies only 1MB but extends the size to the size of the output files (just to save time and space) :

ddrescue -s 1048576 -x 883789824 201707222358.mp4 201707222358copy.mp4

Then I ran the “generate” command :

ddrescue -G 201707222358copy.mp4 201707222358.mp4 201707222358-generate.log

And then I opened those files with ddrescueview :

enter image description here enter image description here

(Three of the six files are seriously damaged like the first one above, with large chunks of empty data, the three others only have a few corrupted sectors like the second one. The second one is the one which was not fragmented, I extracted it with a single ddrescue command.)

And then I patted myself on the back with one hand, while I was slapping my face with the other for having used that 3TB HDD every day for months with no backup... (At first it was supposed to store only temporary stuff, while I would be making room on other HDDs, but it took longer than anticipated, and I ran out of space to store such videos, and even my personal pictures and videos at some point, it could have been a major disaster, but “it's only a glitch”, as Dick Jones would have said.)