Make tar (or other) archive, with data block-aligned like in original files for better block-level deduplication

archivebtrfsdeduplication

How can one generate a tar file, so the contents of tarred files are block-aligned like in the original files, so one could benefit from block-level deduplication ( https://unix.stackexchange.com/a/208847/9689 )?

(Am I correct that there is nothing intrinsic to the tar format that prevent us from getting such benefit? Otherwise, if not tar, is there maybe another archiver that has such a feature built in? )

P.S. I mean "uncompressed tar" – not tar+gz or something – uncompressed tar and question asks for some trick allowing aligning files block level.
AFAIRecall tar was designed for use with tape machines, so maybe adding some extra bits for alignment is possible and easy within file format?
I hope there might be even tool for it ;). As far as I recall tar files can be concatenated, so maybe there would be trick for filling space for alignment.

Best Answer

It can be done, in theory. But it's very ugly and essentially involves constructing our archive by hand.

What we're up against

The tar format operates on 512-byte blocks. This size is fixed, and is intended to match the traditional disk sector size. When storing a file in an archive, the first 512-byte block is a header that contains file metadata (name, size, type, etc.), and the following blocks contain the file contents. So our archived data is going to be misaligned by 512 bytes.

The block size ("--sectorsize") of btrfs is typically 4096 bytes. In theory we can choose this, but in practice it looks like it has to match the page size of our CPU. So we can't shrink btrfs' blocks.

The tar program has a concept of a larger "record" size, defined as a multiple of the block size, which almost looks like it would be useful. It turns out that this is meant to specify the sector size of a given tape drive, so that tar will avoid writing partial tape records. However, the data is still constructed and packed in units of 512 bytes, so we can't use this to grow tar's blocks as you were hoping.

A last point of data to know is that tar's end-of-archive marker is two consecutive all-zeroes blocks, except when those blocks are inside file data. So any sort of naive padding blocks are probably not going to be accepted.

The Hack

What we can do is insert padding files. At the beginning of our archive, before we add the file we want to deduplicate (call it dup), we add a file pad, sized so that

pad's header + pad's data + dup's header = 4096 bytes.

That way, dup's data starts at a block boundary and can be deduplicated.

Then, for each subsequent file, we also have to keep track of the previous file's size in order to calculate the correct padding. We also have to predict whether some sort of header extension is going to be needed: for instance, the basic tar header only has room for 100 bytes of file path, so longer paths are encoded using what is structurally a specially named file whose data is the full path. In general there's a lot of potential complexity in predicting the header size -- the tar file format has a lot of cruft from multiple historical implementations.

A small silver lining is that all of the padding files can share the same name, so when we untar we'll only end up with a single extra file of less than 4096 bytes in size.

The cleanest way to reliably create an archive like this is probably to modify the GNU tar program. But if you want to be quick and dirty at the expense of CPU and I/O time, you could, for each file, do something like:

#!/bin/bash

# Proof of concept and probably buggy.
# If I ever find this script in a production environment,
# I don't know whether I'll laugh or cry.

my_file="$2"
my_archive="$1"

file_size="$(wc -c <"$my_file")"
arch_size="$(tar cb 1 "$my_file" | wc -c)"  # "b 1": Remember that record size I mentioned?  Set it to equal the block size so we can measure usefully.
end_marker_size=1024  # End-of-archive marker: 2 blocks' worth of 0 bytes

hdr_size="$(( (arch_size - file_size - end_marker_size) % 4096 ))"
pad_size="$(( (4096 - 512 - hdr_size) % 4096 ))"
(( pad_size < 512 )) && pad_size="$(( pad_size + 4096 ))"

# Assume the pre-existing archive is already a multiple of 4096 bytes long
# (not including the end-of-archive marker), and add extra padding to the end
# so that it stays that way.
file_blocks_size="$(( ((file_size+511) / 512) * 512 ))"
end_pad_size="$(( 4096 - 512 - (file_blocks_size % 4096) ))"
(( end_pad_size < 512 )) && end_pad_size="$(( end_pad_size + 4096 ))"

head -c $pad_size /dev/zero > _PADDING_
tar rf "$my_archive" _PADDING_ "$my_file"
head -c $end_pad_size /dev/zero > _PADDING_
tar rf "$my_archive" _PADDING_
Related Question