Rolling Diffs – Efficient Storage of Highly Similar Files

backupcommand linediff()shell

At work we do a nightly dump of our mysql databases. From day to day, I would guestimate that close to 90-95% of the data is duplicate, increasing as time goes on. ( Heck at this point some are probably 99% )

These dumps are where one line is a single mysql INSERT statement, so the only differences are whole lines, and the order in which they're in in the file. If I got them sorted, the actual difference from file to file would be very small.

I've been looking, and I haven't found any way to sort the output on dump. I could pipe it through the sort command, though. Then there would be long, long blocks of identical lines.

So I'm trying to figure a way to store only the diffs. I could start with a master dump, and diff against that each night. But the diffs would be larger each night. Or, I could make rolling diffs, which individually would be very small, but seems like it would take longer and longer to compute, if I have to put together a master diff of the whole series each night.

Is this feasible? With what tools?

Edit I'm not asking how to do mysql backups. Forget mysql for the moment. It's a red herring. What I'm wanting to know is how to make a series of rolling diffs from a series of files. Each night we get a file ( which happens to be a mysqldump file ) that is 99% similar to the one before it. Yes, we gzip them all. But it's redundant to have all that redundancy in the first place. All I really need is the differences from the night before… which is only 1% different from the night before… and so on. So what I'm after is how to make a series of diffs so I need only store that 1% each night.

Best Answer

Two backup tools that can store binary diffs are rdiff-backup and duplicity. Both are based on librsync, but above that they behave quite differently. Rdiff-backup stores the latest copy and reverse diffs, while duplicity stores traditional incremental diffs. The two tools also offer a different set of peripheral features.

Related Solutions

Linux – Diff Entire Linux Systems

I think your idea is not far from a solution. To outline a possible way: I am using rsnapshot for backups. It creates a directory (backup-)structure of all or of a subset of your files with entry points of (e.g.) /backup/hourly.1/... and /backup/hourly.0/..., where each branch carries the whole data, but using (hard-)links for files where no changes have been done. Doing a recursive ls or find on both structures and comparing the (sorted, in case of find) output will show the missing files, and inspecting the link-count (in ls -l it would be the second column) will show new files (which have a link count 1). For details of changes in the files you can (for the identified files) use ordinary diff tools. As said this is an outline, will need some work to implement, and may have non-apparent quirks, so take that proposal with a grain of salt

Diff Binary Files – How to Compare Two Large Raw Binary Files

For the second application/issue, I would use a deduplicating backup program like restic or borgbackup, rather than trying to manually keep track of "patches" or diffs. The restic backup program allows you to back up directories from multiple machines to the same backup repository, deduplicating the backup data both amongst fragments of files from an individual machine as well as between machine. (I have no user experience with borgbackup, so I can't say anything about that program.)

Calculating and storing a diff of the abc and abc2 files can be done with rsync.

This is an example with abc and abc2 being 153 MB. The file abc2 has been modified by overwriting the first 2.3 MB of the file with some other data:

$ ls -lh
total 626208
-rw-r--r--  1 kk  wheel   153M Feb  3 16:55 abc
-rw-r--r--  1 kk  wheel   153M Feb  3 17:02 abc2

We create out patch for transforming abc into abc2 and call it abc-diff:

$ rsync --only-write-batch=abc-diff abc2 abc

$ ls -lh
total 631026
-rw-r--r--  1 kk  wheel   153M Feb  3 16:55 abc
-rw-------  1 kk  wheel   2.3M Feb  3 17:03 abc-diff
-rwx------  1 kk  wheel    38B Feb  3 17:03 abc-diff.sh
-rw-r--r--  1 kk  wheel   153M Feb  3 17:02 abc2

The generated file abc-diff is the actual diff (your "patch file"), while abc-diff.sh is a short shell script that rsync creates for you:

$ cat abc-diff.sh
rsync --read-batch=abc-diff ${1:-abc}

This script modifies abc so that it becomes identical to abc2, given the file abc-diff:

$ md5sum abc abc2
be00efe0a7a7d3b793e70e466cbc53c6  abc
3decbde2d3a87f3d954ccee9d60f249b  abc2
$ sh abc-diff.sh
$ md5sum abc abc2
3decbde2d3a87f3d954ccee9d60f249b  abc
3decbde2d3a87f3d954ccee9d60f249b  abc2

The file abc-diff could now be transferred to wherever else you have abc. With the command rsync --read-batch=abc-diff abc, you would apply the patch to the file abc, transforming its contents to be the same as the abc2 file on the system where you created the diff.

Re-applying the patch a second time seems safe. There is no error messages nor does the file's contents change (the MD5 checksum does not change).

Note that unless you create an explicit "reverse patch", there is no way to easily undo the application of the patch.

I also tested writing the 2.3 MB modification to some other place in the abc2 data, a bit further in (at about 50 MB), as well as at the start. The generated "patch" was 4.6 MB large, suggesting that only the modified bits were stored in the patch.

Best Answer

Related Solutions

Linux – Diff Entire Linux Systems

Diff Binary Files – How to Compare Two Large Raw Binary Files

Related Question