Automatically ‘brute force’ a few bytes to recover a corrupt file

brute forcedata-recoveryrepair

Does anyone out there know of a way to brute force values at a particular offset in a file? It's 4 consecutive bytes which would need to be brute forced. I know the correct SHA-1 of the corrupt file. So, what I'd like to do is compare the complete file SHA-1, each time it changes the byte value.

I know the exact 4 bytes which were changed, because the file was given to me by a data recovery expert, as a recovery challenge. For those who are interested in knowing, the rar file has 4 bytes which were intentionally changed. I was told the offsets of the changed 4 bytes and the original SHA-1. The person said it's IMPOSSIBLE to recover the exact file in the archive once the 4 bytes were changed. Even if it was only a few bytes and you knew exactly where the corruption was located. Since it doesn't have a recovery record. I'm trying to see if there is a way for those particular 4 bytes to be filled in correctly so the file will decompress without error. The file size is around 5mb.

Example:

I uploaded photos so it's more clearly defined of exactly what I'm looking to do. I believe someone can post them here for me with more rep.

Screenshot One

Screenshot Two

The example offset I'm focusing on is 0x78 where the first pic shows the value as CA
I want the script to the take the value up by 1 so it becomes CB as shown in the second pic. I want it to keep increasing the value by 1 and then compare the whole file SHA-1 each time. Only making changes to those 4 bytes at the specified offset.

It will try CAC5C58A and compare the SHA-1. If doesn't match, then it will try CBC5C58A.Then once the first value reaches FF it will then go to 00C6C58A and so on. Basically, I would like it to be able to go from 00000000-FFFFFFFF but to also have the option to choose where you want it start and end. I know it could take some time but I would still like to try it. Keep in mind I know the exact offset of the bytes which are corrupt. I just need the correct values.

If you search on Google: "How to fix a corrupted file by brute force" There's a person that wrote a Linux program. However, it only works against the files included with the program. I'm looking for some way to use the same process with my file.

Best Answer

Here's a small Python program which does what you seem to be describing.

#!/usr/bin/env python3
from hashlib import sha1

with open('binaryfile', 'rb') as bin:
    binary = bin.read()

base = 0x0078
# ... is not valid Python; add more sequences, or take it out (or see below)
for seq in [[0xCA, 0xC5, 0xC5, 0x8A], [0xCB, 0xC5, 0xC5, 0x8A], ...]:
    copy = binary[0:base]
    copy += bytes(seq)
    copy += binary[base+len(seq):]
    if sha1(copy).hexdigest() == '9968733ce3ff0893bbb0a19e75faaf2fb0000e19':
        print('success with bytes {0}'.format(seq))
        break
else:
    print('no success')

UnOnly briefly tested; please ping me if you find typos.

The base specifies where to try to apply the four bytes, and the long string '996873... is the hex representation of the expected SHA1. The line for seq in... defines the bytes to try; and of course replace 'binaryfile' with the path to the file you want to attempt to salvage.

You can replace the literal list [[0xCA, 0xC5,...]] with something to actually loop over all possible values but it's basically just a placeholder for something more useful because I'm not really sure what exactly you want there.

Something like for seq in itertools.product(range(256), repeat=4)): will loop over all possible values from 0 to 232-1. (You will need to add import itertools near the top then.) Or perhaps you could simply add an offset; update the script to replace the current for seq in with the following (where again the import needs to go before the main program);

import struct

for n in range(2**32):
    val=(n+0x8AC5C5CA) % 2**32  # notice reverse order
    seq=list(reversed(struct.pack(">I", val)))
    copy = ...

I reversed the order of the bytes so that it naturally increments from 0x8AC5C5CA to 0x8AC5C5CB but then the next increment will be 0x8AC5C5CC etc. The struct magic is to convert this to a sequence of bytes (had to look it up from https://stackoverflow.com/a/26920983/874188). This will start at 0x8AC5C5CA and go to 0xFFFFFFFF, then wrap around to 0x00000000 and climb back up to 0x8AC5C5C9.

If you have multiple candidate ranges you would like to examine in a particular order, maybe something like

for rge in [(0x8AC5C5CA, 0x8AFFFFFF), (0x00C6C58A, 0x00FFFFFF),
        (0x00000000, 0x00C6C589), (0x01000000, 0x8AC5C5C9)]:
    for val in range(*rge):
        seq=list(reversed(struct.pack(">I", val)))
        copy = ...

but then you'll need to make sure yourself that the (start, end) pairs in rge cover all of the space between 0x00000000 and 0xFFFFFFFF if you really want to examine all of it. (And again, notice that the range increments the last byte and that seq applies the bytes of the value in reverse, in accordance with your stated requirements.)

If you wanted to use two different base addresses, you quickly run up against the limits of what's feasible to do in your lifetime with brute force; but you could, for example, split the 4-byte number into two 2-byte parts and apply those at different offsets.

base1 = 0x1234
base2 = 0x2345

for seq in range(whatever):
    copy = binary[0:base1]
    copy += bytes(seq[0:1])
    copy += binary[base1+2:base1+base2]
    copy += bytes(seq[2:3])
    copy += binary[base2+2:]
Related Question