Let's say I have a 4 GB file abc
on my local computer. I have uploaded it to a distant server via SFTP, it took a few hours.
Now I have slightly modified the file (probably 50 MB maximum, but not consecutive bytes in this file) locally, and saved it into abc2
. I also kept the original file abc
on my local computer.
How to compute a binary diff of abc
and abc2
?
Applications:
-
I could only send a
patch
file (probably max 100MB) to the distant server, instead of reuploading the wholeabc2
file (it would take a few hours again!), and recreateabc2
on the distant server fromabc
andpatch
only. -
Locally, instead of wasting 8 GB to backup both
abc
andabc2
, I could save onlyabc
+patch
, so it would take < 4100 MB only.
How to do this?
PS: for text, I know diff
, but here I'm looking for something that could work for any raw binary format, it could be zip files or executables or even other types of file.
PS2: If possible, I don't want to use rsync
; I know it can replicate changes between 2 computers in an efficient way (not resending data that has not changed), but here I really want to have a patch
file, that is reproducible later if I have both abc
and patch
.
Best Answer
For the second application/issue, I would use a deduplicating backup program like
restic
orborgbackup
, rather than trying to manually keep track of "patches" or diffs. Therestic
backup program allows you to back up directories from multiple machines to the same backup repository, deduplicating the backup data both amongst fragments of files from an individual machine as well as between machine. (I have no user experience withborgbackup
, so I can't say anything about that program.)Calculating and storing a diff of the
abc
andabc2
files can be done withrsync
.This is an example with
abc
andabc2
being 153 MB. The fileabc2
has been modified by overwriting the first 2.3 MB of the file with some other data:We create out patch for transforming
abc
intoabc2
and call itabc-diff
:The generated file
abc-diff
is the actual diff (your "patch file"), whileabc-diff.sh
is a short shell script thatrsync
creates for you:This script modifies
abc
so that it becomes identical toabc2
, given the fileabc-diff
:The file
abc-diff
could now be transferred to wherever else you haveabc
. With the commandrsync --read-batch=abc-diff abc
, you would apply the patch to the fileabc
, transforming its contents to be the same as theabc2
file on the system where you created the diff.Re-applying the patch a second time seems safe. There is no error messages nor does the file's contents change (the MD5 checksum does not change).
Note that unless you create an explicit "reverse patch", there is no way to easily undo the application of the patch.
I also tested writing the 2.3 MB modification to some other place in the
abc2
data, a bit further in (at about 50 MB), as well as at the start. The generated "patch" was 4.6 MB large, suggesting that only the modified bits were stored in the patch.