As block level deduplication goes, I think ZFS is the uncontested best implementation out currently. It really isn't designed for after-the-fact optimization, because its deduplication (if turned on) is built directly into the read/write functions. Because of this, it can be a bit memory expensive under load, in trying to keep the most relevant portions of the deduplication table in memory, but ZFS is good at restricting itself to consuming not much more than 50% of memory, which depending on quantity of memory installed, could seem quite arbitrary (50% of 2Gb vs 50% of 64Gb, especially if few-if-any user tasks needing memory).
Depending on what you're looking to use it in, you've got some options:
OpenIndiana appears to have some good Desktop and Server options, based on Solaris
FreeBSD (since 9.0) has a pretty advanced version of ZFS (which includes deduplication) built in to it. One notable FreeBSD (then MonoWall) derived distribution is NAS4Free, which makes making a NAS pretty easy.
Linux has a few options, some with dedup, others without. Since you're looking for dedup, the most notable I've seen is zfsonlinux. I'm not sure what their progress is, or how stable their project is, but it definitely looks promising.
As to anything with partial block deduplication, I have seen NOTHING so far that reports an ability to do that.
I asked a similar question 2 years ago.
However in my case, I was only planning to copy a single device onto raid0.
I eventually found a solution. At the time you couldn't convert from raid0 to raid10, but it looks like that since kernel 3.3, you can now. So that solution may work for you in the end.
A problem with that approach is that it copies the fsuid. Which means you can't mount both the FS and its copy on the same machine. At the time, there was no tool to change the fsuid
of a FS, but it might have changed now.
The idea is to add a copy-on-write layer on top of the original device so that it can be written to, but any modification is done somewhere else which you can discard later on. That means you need additional storage space (for instance on an external drive).
Then mount that COW'd FS instead of the original, add the devices for the FS copy and remove the COW's device.
For copy-on-write, you can use the device mapper.
For the disposable copy on write area, here I use a loop device.
Let's say you want to clone /dev/sda
onto /dev/sd[bcde]
:
Create the COW back store:
truncate -s 100G /media/STORE/snap-store
losetup /dev/loop0 /media/STORE/snap-store
Now unmount the origin FS if mounted and modprobe -r btrfs
to make sure it's not going to interfere and make it forget its device scan.
Then make the COW'd device:
echo "echo 0 $(blockdev --getsize /dev/sda) snapshot /dev/sda /dev/loop0 N 8 | dmsetup create cowed
Now /dev/mapper/cowed
is like /dev/sda
except that anything written to it will end up in /dev/loop0
and /dev/sda
will be untouched.
Now, you can mount it:
mount /dev/mapper/cowed /mnt
Add the other devices:
btrfs dev add /dev/sd[bcde] /mnt
And remove the old one:
btrfs dev del /dev/mapper/cowed /mnt
When that's over, you may want to shutdown and unplug or make /dev/sda
readonly as because it's got the same fsuid as the other ones, btrfs
might still mess up with it.
Now, if I understand correctly, assuming you've got recent btrfs-prog, you should be able to do a:
btrfs balance start -d convert=raid10 /mnt
To convert to raid10. In theory, that should make sure that every data chunk is copied on a least 2 disks.
I would strongly recommend that you do tests on a dummy btrfs on loop devices first as all that is from memory and I might have gotten it wrong (see for instance my initial answer before my edit).
Note that since kernel 3.6, btrfs implements send/receive a bit like in zfs. That might be an option for you.
Best Answer
I wrote bedup for this purpose. It combines incremental btree scanning with CoW-deduplication. Best used with Linux 3.6, where you can run: