Is it possible and how to merge identical files between different snapshots in a btrfs file system

backupbtrfscompressionmergesnapshot

I regularly use btrfs's snapshot to backup the whole system, but it's redundant to have identical files between different snapshots. Commonly, for example, I took a snapshot of @ as @_without_install_nvidia_driver, and installed nvidia_driver. A few days later, I updated the system, which resulted in a massive number of file changes. Afterward, a few days later, I found that the nvidia_driver package was unstable, and reverted back to @_without_install_nvidia_driver, but I had to update the system again, and there I realized that identical files existed in both snapshots, because the system had now been updated in both snapshots.

That got me wondering: Is there a way to merge identical files between different snapshots?

Best Answer

You can use the bedup utility to de-duplicate the identical files. Once you've installed it, usage is fairly simple:

# bedup dedup /path/to/btrfs

You may need to set your snapshots writable (btrfs property set -ts /path/to/snapshot ro false) so it can de-duplicate them. You can change them back afterwards.

Note that depending on how many files you have, it could take a while (it first looks for files of the same size, then compares those files... so if you have a bunch of large files of the same size, that can take a bit.)

Finally, you can run it again from time to time and the future runs will be much quicker as it keeps track of the btrfs generation and uses that to skip old files.

Related Solutions

How to clone btrfs filesystem into different medium preserving snapshots’ sharing data

I asked a similar question 2 years ago.

However in my case, I was only planning to copy a single device onto raid0.

I eventually found a solution. At the time you couldn't convert from raid0 to raid10, but it looks like that since kernel 3.3, you can now. So that solution may work for you in the end.

A problem with that approach is that it copies the fsuid. Which means you can't mount both the FS and its copy on the same machine. At the time, there was no tool to change the fsuid of a FS, but it might have changed now.

The idea is to add a copy-on-write layer on top of the original device so that it can be written to, but any modification is done somewhere else which you can discard later on. That means you need additional storage space (for instance on an external drive).

Then mount that COW'd FS instead of the original, add the devices for the FS copy and remove the COW's device.

For copy-on-write, you can use the device mapper.

For the disposable copy on write area, here I use a loop device.

Let's say you want to clone /dev/sda onto /dev/sd[bcde]:

Create the COW back store:

truncate -s 100G /media/STORE/snap-store
losetup /dev/loop0 /media/STORE/snap-store

Now unmount the origin FS if mounted and modprobe -r btrfs to make sure it's not going to interfere and make it forget its device scan.

Then make the COW'd device:

echo "echo 0 $(blockdev --getsize /dev/sda) snapshot /dev/sda /dev/loop0 N 8 | dmsetup create cowed

Now /dev/mapper/cowed is like /dev/sda except that anything written to it will end up in /dev/loop0 and /dev/sda will be untouched.

Now, you can mount it:

mount /dev/mapper/cowed /mnt

Add the other devices:

btrfs dev add /dev/sd[bcde] /mnt

And remove the old one:

btrfs dev del /dev/mapper/cowed /mnt

When that's over, you may want to shutdown and unplug or make /dev/sda readonly as because it's got the same fsuid as the other ones, btrfs might still mess up with it.

Now, if I understand correctly, assuming you've got recent btrfs-prog, you should be able to do a:

btrfs balance start -d convert=raid10 /mnt

To convert to raid10. In theory, that should make sure that every data chunk is copied on a least 2 disks.

I would strongly recommend that you do tests on a dummy btrfs on loop devices first as all that is from memory and I might have gotten it wrong (see for instance my initial answer before my edit).

Note that since kernel 3.6, btrfs implements send/receive a bit like in zfs. That might be an option for you.

Btrfs – Understanding Snapshots and Space Usage

I think that most of your questions can be answered simply by remembering that in Btrfs, a snapshot is not really special, it's just a Btrfs subvolume. It just happens that when it's created, it has initial contents instead of being empty, and the storage space for those initial contents is shared with whatever subvolume the snapshot came from.

A snapshot is just like a (full) copy, except it's more economical because of the shared storage.

If I don't do snapshots, can you roll back a single file to several changes ago?

No. Just like with any regular filesystem, modifying files is destructive. You can't magically go back to an earlier version.

Can btrfs snapshots of root be used and thought of just like VMware/VirtualBox snapshots?

VM disk images are usually block devices, not filesystems or files on filesystems, so I think it's a little different.

You could use a Btrfs file as backing store for a VM virtual block device, I guess. In which case the answer to that question is yes. Except if you use the NOCOW option (which is actually recommended for disk images). Then probably not, because copy-on-write is the magic that makes snapshots work.

I label snapshot A, make changes and label it B. If I go back to snapshot A and make changes (even just by booting changing /var/log), are those changes made in a "detached" or "unlabeled" snapshot, so those changes would be invisible if going back to B?

Every subvolume (including snapshots) in Btrfs has a name, so you cannot have an "unlabeled" snapshot.

In general, any changes you make in one Btrfs subvolume (whether it was created as a snapshot or not) are absolutely not ever visible in another Btrfs subvolume. Just remember that a snapshot is just like a copy, but more economical.

When deleting a file, is there "this file is deleted" metadata written, so space is still taken by all the versions of the file?

When deleting a file, its directory entry is removed. That is a modification to the directory, and like all modifications, it will be private to the subvolume in which it occurred. Then after that, if and only if the storage space for the file is not used by any other part of the filesystem, it's freed.

Deleting a file whose storage is shared between multiple snapshots is a lot like deleting a file in any regular filesystem when it has multiple (hard) links. The storage [inode] is freed iff it is not referenced anymore.

If I build gcc from source, as an example, I think the build directory winds up being 5-8GB. If I build it periodically from source, I'm "chewing up" a bunch of hard drive space, right?

If you build gcc multiple times in multiple different directories, then yeah, it will use more and more space. If you delete copies in between builds or overwrite the same build directory each time, then, no, there's no particular reason why it would keep using more and more space.

Best Answer

Related Solutions

How to clone btrfs filesystem into different medium preserving snapshots’ sharing data

Btrfs – Understanding Snapshots and Space Usage

Related Question