Are tars deduplicatable at the block level

btrfsdeduplicationtar

Quite simply, when a tar file is made on disk, would the extents be deduplicatable with extents inside and/or outside of the tar? I am asking in the theoretical sense, so if the extents of data are identical inside the tar (no shifting, or splitting within the extents in order to compact) then theoretically, the extents will match those outside the tar equally and so would be deduplicatable.

For example, if I were to tar a directory, then use block level deduplication, would the effective size of the tar, be the size of the additional headers, metadata and the end of archive marker.

Obviously I am talking about uncompressed tar, specifically GNU tar. I have looked at the GNU tar standard and it does seem to maintain the origional block data from what I have read, but maybe I have mis-interpreted what I have read.

Best Answer

Generally, no. It would be possible to design a filesystem that provides this kind of deduplication, but it would be very costly, for very little practical benefit, so I doubt that it's been done. The issue is that deduplication only looks at aligned extents.

Deduplicating filesystems generally work at a block level. When the filesystem driver is about to store a block, it calculates a checksum for the block content and looks up this checksum in a table. If the table says that no block with this checksum exists, the block is stored and the checksum is added to the table. If the checksum is present in the table, the driver checks whether any of the blocks with that checksum is identical to the block that's about to be stored; if there is one, a new reference to this block is created, and if they aren't then the block is added.

As you can see, there's a cost to be paid whenever writing a block. But at least this cost is only paid once per write of a block. If file 1 contains aaaabbbbcccc, file 2 contains aabbbbcccc and the block size is 4, then the files do not contain any identical block, so no deduplication will take place. Detecting that file 2 is included in file 1 would require computing checksums for blocks at any alignment, at a prohibitive cost.

In general, the blocks of a file in a tar file are not aligned with the blocks of the filesystem. A file in a tar archive can start at any offset that's a multiple of 512 (the tar block size), but most filesystems use a larger block size. If the start of a file inside the archive happens to be aligned with the start of a filesystem block, then that file will be deduplicated if the opportunity presents itself. Typical filesystem block sizes are larger than that, though since they are a multiple of 512, deduplication will occasionally happen, e.g. about 1 in 8 for 4096-byte blocks assuming a uniform distribution of file sizes modulo 4096 (which isn't quite true, so the probability is in fact somewhat less).

The typical use case for deduplication is files that are identical or mostly identical: backup copies, old versions of a file, etc. Transformed files are not typical. Uncompressed archives are especially not typical.

Related Solutions

Deduplication on partition level

As block level deduplication goes, I think ZFS is the uncontested best implementation out currently. It really isn't designed for after-the-fact optimization, because its deduplication (if turned on) is built directly into the read/write functions. Because of this, it can be a bit memory expensive under load, in trying to keep the most relevant portions of the deduplication table in memory, but ZFS is good at restricting itself to consuming not much more than 50% of memory, which depending on quantity of memory installed, could seem quite arbitrary (50% of 2Gb vs 50% of 64Gb, especially if few-if-any user tasks needing memory).

Depending on what you're looking to use it in, you've got some options:

OpenIndiana appears to have some good Desktop and Server options, based on Solaris

FreeBSD (since 9.0) has a pretty advanced version of ZFS (which includes deduplication) built in to it. One notable FreeBSD (then MonoWall) derived distribution is NAS4Free, which makes making a NAS pretty easy.

Linux has a few options, some with dedup, others without. Since you're looking for dedup, the most notable I've seen is zfsonlinux. I'm not sure what their progress is, or how stable their project is, but it definitely looks promising.

As to anything with partial block deduplication, I have seen NOTHING so far that reports an ability to do that.

Deduplication Scripts Using Btrfs CoW

I wrote bedup for this purpose. It combines incremental btree scanning with CoW-deduplication. Best used with Linux 3.6, where you can run:

sudo bedup dedup

Best Answer

Related Solutions

Deduplication on partition level

Deduplication Scripts Using Btrfs CoW

Related Question