Are tars deduplicatable at the block level

btrfsdeduplicationtar

Quite simply, when a tar file is made on disk, would the extents be deduplicatable with extents inside and/or outside of the tar? I am asking in the theoretical sense, so if the extents of data are identical inside the tar (no shifting, or splitting within the extents in order to compact) then theoretically, the extents will match those outside the tar equally and so would be deduplicatable.

For example, if I were to tar a directory, then use block level deduplication, would the effective size of the tar, be the size of the additional headers, metadata and the end of archive marker.

Obviously I am talking about uncompressed tar, specifically GNU tar. I have looked at the GNU tar standard and it does seem to maintain the origional block data from what I have read, but maybe I have mis-interpreted what I have read.

Best Answer

Generally, no. It would be possible to design a filesystem that provides this kind of deduplication, but it would be very costly, for very little practical benefit, so I doubt that it's been done. The issue is that deduplication only looks at aligned extents.

Deduplicating filesystems generally work at a block level. When the filesystem driver is about to store a block, it calculates a checksum for the block content and looks up this checksum in a table. If the table says that no block with this checksum exists, the block is stored and the checksum is added to the table. If the checksum is present in the table, the driver checks whether any of the blocks with that checksum is identical to the block that's about to be stored; if there is one, a new reference to this block is created, and if they aren't then the block is added.

As you can see, there's a cost to be paid whenever writing a block. But at least this cost is only paid once per write of a block. If file 1 contains aaaabbbbcccc, file 2 contains aabbbbcccc and the block size is 4, then the files do not contain any identical block, so no deduplication will take place. Detecting that file 2 is included in file 1 would require computing checksums for blocks at any alignment, at a prohibitive cost.

In general, the blocks of a file in a tar file are not aligned with the blocks of the filesystem. A file in a tar archive can start at any offset that's a multiple of 512 (the tar block size), but most filesystems use a larger block size. If the start of a file inside the archive happens to be aligned with the start of a filesystem block, then that file will be deduplicated if the opportunity presents itself. Typical filesystem block sizes are larger than that, though since they are a multiple of 512, deduplication will occasionally happen, e.g. about 1 in 8 for 4096-byte blocks assuming a uniform distribution of file sizes modulo 4096 (which isn't quite true, so the probability is in fact somewhat less).

The typical use case for deduplication is files that are identical or mostly identical: backup copies, old versions of a file, etc. Transformed files are not typical. Uncompressed archives are especially not typical.