Is it possible to get file list, byte range from the head of a tar+bzip2 file

bzip2large filestar

Essentially, I'm hoping someone with an advanced knowledge of tar/bz2 can answer whether this is possible.

The situation is that we have a periodic 24GB feed of data from a vendor, as a .tbz file. (tar+bzip2). This is downloaded from the vendor via curl. To dramatically speed up this slow process, I'd like to obtain:

  • A list of files contained in the .tbz file
  • the byte ranges of the specific files that we care about (a small subset of the whole archive).

Curl has the ability specify a byte range for downloading a file, so my hope is that if we download the first x bytes of a file, it might have an index of where we need to seek those relevant files from. From what I understand, tar itself has this information, but I'm not sure if the bzip2 compression allows for this in addition.

Best Answer

No.

Tar is a concatenation of files data, interleaved with files metadata (tar headers). That alone wouldn't necessarily be the dead end, since one could read the header, find out data length and (if the server allowed for that) skip to the next header (e.g. via the same functionality that allows to resume HTTP transmissions).

What really makes this difficult is the compression - the de-/compressed data usually depends on the preceding ones, thus on everything that precedes it. Now, for bzip2 everything is a block of 100kB to 900kB (with 100kB steps IIUC). Thus your algorithm would have to:

  1. get the beginning of file;

  2. read decompressed chunk length L from the header;

  3. decompress the block - that means download the data as needed, until end of the bz2 block is reached;

  4. check the tar header and lengths H of the first file's tar header and D of its data;

  5. skip to next file: either it is in the decoded block (H + D < L) or additional compressed data has to be fetched (H + D > L). And this is exactly where it breaks - if I understand the bzip2 format correctly, the header doesn't contain the compressed block length (only uncompressed). Hence if you need to fetch another block, you can't really seek in the stream even if the underlying medium allowed you to.

Summary: if you can negotiate change of format to something that contains compressed block size in its header, it is solvable. On the other hand, one 24GB compressed tar file is a rather insane format for distribution of anything - it is one single-layer BD and I don't think a reasonable person would think of compressing contents to go onto a disc into a single file instead of splitting it into parts of at most 1-2 GB of size. So if negotiation is possible try asking about that (splitting into smaller pieces).

Another thing that could help you a little bit would be getting the list of files together with file sizes separately - that would allow you to make at least some guesses about what do download (and you could always get the bloacks around if needed). Such a list can be produced easily - just by redirecting tar's stdout to a file:

tar cvv all_the_uncompressed_gigabytes 2>list.txt | bzip2 -9 > data.tar.bz2
Related Question