Is it possible to get file list, byte range from the head of a tar+bzip2 file

bzip2large filestar

Essentially, I'm hoping someone with an advanced knowledge of tar/bz2 can answer whether this is possible.

The situation is that we have a periodic 24GB feed of data from a vendor, as a .tbz file. (tar+bzip2). This is downloaded from the vendor via curl. To dramatically speed up this slow process, I'd like to obtain:

A list of files contained in the .tbz file
the byte ranges of the specific files that we care about (a small subset of the whole archive).

Curl has the ability specify a byte range for downloading a file, so my hope is that if we download the first x bytes of a file, it might have an index of where we need to seek those relevant files from. From what I understand, tar itself has this information, but I'm not sure if the bzip2 compression allows for this in addition.

Best Answer

No.

Tar is a concatenation of files data, interleaved with files metadata (tar headers). That alone wouldn't necessarily be the dead end, since one could read the header, find out data length and (if the server allowed for that) skip to the next header (e.g. via the same functionality that allows to resume HTTP transmissions).

What really makes this difficult is the compression - the de-/compressed data usually depends on the preceding ones, thus on everything that precedes it. Now, for bzip2 everything is a block of 100kB to 900kB (with 100kB steps IIUC). Thus your algorithm would have to:

get the beginning of file;
read decompressed chunk length L from the header;
decompress the block - that means download the data as needed, until end of the bz2 block is reached;
check the tar header and lengths H of the first file's tar header and D of its data;
skip to next file: either it is in the decoded block (H + D < L) or additional compressed data has to be fetched (H + D > L). And this is exactly where it breaks - if I understand the bzip2 format correctly, the header doesn't contain the compressed block length (only uncompressed). Hence if you need to fetch another block, you can't really seek in the stream even if the underlying medium allowed you to.

Summary: if you can negotiate change of format to something that contains compressed block size in its header, it is solvable. On the other hand, one 24GB compressed tar file is a rather insane format for distribution of anything - it is one single-layer BD and I don't think a reasonable person would think of compressing contents to go onto a disc into a single file instead of splitting it into parts of at most 1-2 GB of size. So if negotiation is possible try asking about that (splitting into smaller pieces).

Another thing that could help you a little bit would be getting the list of files together with file sizes separately - that would allow you to make at least some guesses about what do download (and you could always get the bloacks around if needed). Such a list can be produced easily - just by redirecting tar's stdout to a file:

tar cvv all_the_uncompressed_gigabytes 2>list.txt | bzip2 -9 > data.tar.bz2

Related Solutions

How to filter the contents of a tar file, producing another tar file in the pipe

bsdtar (based on libarchive) can filter tar (and some other archives) from stdin to stdout. It can for example pass through only filenames matching a pattern, and can do s/old/new/ renaming. It's already packaged for most distros, for example as bsdtar in Ubuntu.

sudo apt-get install bsdtar   # or aptitude, if you have it.

# example from the man page:
bsdtar -c -f new.tar --include='*foo*' @old.tgz
#create new.tar containing only entries from old.tgz containing the string ‘foo’
bsdtar -czf - --include='*foo*' @-  # filter stdin to stdout, with gzip compression of output.

Note that has a wide choice of compression formats for input/output, so you don't have to manually pipe through gunzip / lz4 yourself. You can use - for stdin with the @tarfile syntax, and/or - for stdout like normal.

My searching also found this streaming tar modify tool which appears to want you to define the archive changes you want using javascript. (I think the whole thing is written in js).

https://github.com/mafintosh/tar-stream

Shell – Adding file to tbz files

While tar can add files to an already existing archive, it cannot be compressed. You will have to bunzip2 the compressed archive, leaving a standard tarball. You can then use tar's ability to add files to an existing archive, and then recompress with bzip2.

From the manual:

 -r      Like -c, but new entries are appended to the archive.  Note that this only
         works on uncompressed archives stored in regular files.  The -f option is
         required.

Best Answer

Related Solutions

How to filter the contents of a tar file, producing another tar file in the pipe

Shell – Adding file to tbz files

Related Question