Rolling Diffs – Efficient Storage of Highly Similar Files

backupcommand linediff()shell

At work we do a nightly dump of our mysql databases. From day to day, I would guestimate that close to 90-95% of the data is duplicate, increasing as time goes on. ( Heck at this point some are probably 99% )

These dumps are where one line is a single mysql INSERT statement, so the only differences are whole lines, and the order in which they're in in the file. If I got them sorted, the actual difference from file to file would be very small.

I've been looking, and I haven't found any way to sort the output on dump. I could pipe it through the sort command, though. Then there would be long, long blocks of identical lines.

So I'm trying to figure a way to store only the diffs. I could start with a master dump, and diff against that each night. But the diffs would be larger each night. Or, I could make rolling diffs, which individually would be very small, but seems like it would take longer and longer to compute, if I have to put together a master diff of the whole series each night.

Is this feasible? With what tools?


Edit I'm not asking how to do mysql backups. Forget mysql for the moment. It's a red herring. What I'm wanting to know is how to make a series of rolling diffs from a series of files. Each night we get a file ( which happens to be a mysqldump file ) that is 99% similar to the one before it. Yes, we gzip them all. But it's redundant to have all that redundancy in the first place. All I really need is the differences from the night before… which is only 1% different from the night before… and so on. So what I'm after is how to make a series of diffs so I need only store that 1% each night.

Best Answer

Two backup tools that can store binary diffs are rdiff-backup and duplicity. Both are based on librsync, but above that they behave quite differently. Rdiff-backup stores the latest copy and reverse diffs, while duplicity stores traditional incremental diffs. The two tools also offer a different set of peripheral features.

Related Question