As block level deduplication goes, I think ZFS is the uncontested best implementation out currently. It really isn't designed for after-the-fact optimization, because its deduplication (if turned on) is built directly into the read/write functions. Because of this, it can be a bit memory expensive under load, in trying to keep the most relevant portions of the deduplication table in memory, but ZFS is good at restricting itself to consuming not much more than 50% of memory, which depending on quantity of memory installed, could seem quite arbitrary (50% of 2Gb vs 50% of 64Gb, especially if few-if-any user tasks needing memory).
Depending on what you're looking to use it in, you've got some options:
OpenIndiana appears to have some good Desktop and Server options, based on Solaris
FreeBSD (since 9.0) has a pretty advanced version of ZFS (which includes deduplication) built in to it. One notable FreeBSD (then MonoWall) derived distribution is NAS4Free, which makes making a NAS pretty easy.
Linux has a few options, some with dedup, others without. Since you're looking for dedup, the most notable I've seen is zfsonlinux. I'm not sure what their progress is, or how stable their project is, but it definitely looks promising.
As to anything with partial block deduplication, I have seen NOTHING so far that reports an ability to do that.
Generally, no. It would be possible to design a filesystem that provides this kind of deduplication, but it would be very costly, for very little practical benefit, so I doubt that it's been done. The issue is that deduplication only looks at aligned extents.
Deduplicating filesystems generally work at a block level. When the filesystem driver is about to store a block, it calculates a checksum for the block content and looks up this checksum in a table. If the table says that no block with this checksum exists, the block is stored and the checksum is added to the table. If the checksum is present in the table, the driver checks whether any of the blocks with that checksum is identical to the block that's about to be stored; if there is one, a new reference to this block is created, and if they aren't then the block is added.
As you can see, there's a cost to be paid whenever writing a block. But at least this cost is only paid once per write of a block. If file 1 contains aaaabbbbcccc
, file 2 contains aabbbbcccc
and the block size is 4, then the files do not contain any identical block, so no deduplication will take place. Detecting that file 2 is included in file 1 would require computing checksums for blocks at any alignment, at a prohibitive cost.
In general, the blocks of a file in a tar file are not aligned with the blocks of the filesystem. A file in a tar archive can start at any offset that's a multiple of 512 (the tar block size), but most filesystems use a larger block size. If the start of a file inside the archive happens to be aligned with the start of a filesystem block, then that file will be deduplicated if the opportunity presents itself. Typical filesystem block sizes are larger than that, though since they are a multiple of 512, deduplication will occasionally happen, e.g. about 1 in 8 for 4096-byte blocks assuming a uniform distribution of file sizes modulo 4096 (which isn't quite true, so the probability is in fact somewhat less).
The typical use case for deduplication is files that are identical or mostly identical: backup copies, old versions of a file, etc. Transformed files are not typical. Uncompressed archives are especially not typical.
Best Answer
It can be done, in theory. But it's very ugly and essentially involves constructing our archive by hand.
What we're up against
The
tar
format operates on 512-byte blocks. This size is fixed, and is intended to match the traditional disk sector size. When storing a file in an archive, the first 512-byte block is a header that contains file metadata (name, size, type, etc.), and the following blocks contain the file contents. So our archived data is going to be misaligned by 512 bytes.The block size ("--sectorsize") of btrfs is typically 4096 bytes. In theory we can choose this, but in practice it looks like it has to match the page size of our CPU. So we can't shrink btrfs' blocks.
The
tar
program has a concept of a larger "record" size, defined as a multiple of the block size, which almost looks like it would be useful. It turns out that this is meant to specify the sector size of a given tape drive, so thattar
will avoid writing partial tape records. However, the data is still constructed and packed in units of 512 bytes, so we can't use this to growtar
's blocks as you were hoping.A last point of data to know is that
tar
's end-of-archive marker is two consecutive all-zeroes blocks, except when those blocks are inside file data. So any sort of naive padding blocks are probably not going to be accepted.The Hack
What we can do is insert padding files. At the beginning of our archive, before we add the file we want to deduplicate (call it
dup
), we add a filepad
, sized so thatThat way,
dup
's data starts at a block boundary and can be deduplicated.Then, for each subsequent file, we also have to keep track of the previous file's size in order to calculate the correct padding. We also have to predict whether some sort of header extension is going to be needed: for instance, the basic tar header only has room for 100 bytes of file path, so longer paths are encoded using what is structurally a specially named file whose data is the full path. In general there's a lot of potential complexity in predicting the header size -- the
tar
file format has a lot of cruft from multiple historical implementations.A small silver lining is that all of the padding files can share the same name, so when we untar we'll only end up with a single extra file of less than 4096 bytes in size.
The cleanest way to reliably create an archive like this is probably to modify the GNU
tar
program. But if you want to be quick and dirty at the expense of CPU and I/O time, you could, for each file, do something like: