Ssh – Extract compressed file by it’s header as it is piped from stdout (locally or from remote location)

compressiondownloadpipessh

I am sending compressed file with piping either local or from a network location. And on the receiving end, I would like to detect the type of compression and use the appropriate decompression utility (gzip, bzip2, xz..etc) to extract it. Commands looks as follows:

Local:

cat misteryCompressedFile | [compressionUtility] -d -fc > /opt/files/uncompressedfile

Over network:

ssh user@ipaddr "cat misteryCompressedFile" | [compressionUtility] -d -fc > /opt/files/uncompressedfile

One can tell the type of compression used even if there is no extension provided (e.g., .gz or .bz2) by looking at first few hex values of the file. For example, if I use xxd to look at first few hex values of two compressed files, then I will 1f8b 0808 for gzip and 425a 6836 for bzip2.

However, to still use piping, how can I check the first incoming byte to select the proper decompression utility for the first of the file?

So if unknown compressed file is a gzip type, command will be this:

cat misteryCompressedFile | gzip -d -fc > /opt/files/uncompressedfile

and if unknown compressed file is bzip2 type, command will be this:

cat misteryCompressedFile | bzip2 -d -fc > /opt/files/uncompressedfile

Is it possible to make such decision with piping on the fly without having to download entire file and then make decision what to use for decompression?

Best Answer

Yes, you can do that in the pipeline, without having to read the whole file.

This first script fragment illustrates the mechanism by which we will intercept and inspect the header and pass it on. Notice that we print the header to stderr (>&2), yet it continues to appear in the output:

$ echo 0123456789ABCDEF |
(
    HEADER=$(dd bs=1 count=4);
    printf 'HEADER:%s\n' "$HEADER" >&2;
    printf '%s\n' "$HEADER";
    cat 
)
4+0 records in
4+0 records out
4 bytes (4 B) copied, 8.4293e-05 s, 47.5 kB/s
HEADER:0123
0123456789ABCDEF
$

The key is using the dd the file conversion utility with a small block size bs=1.

Expanding on that, this is a working solution. We'll use a temporary file to store the binary header. If it doesn't see one of the two 4-byte headers then it does nothing:

#!/bin/sh

trap "rm -f /tmp/$$; exit 1" 1 2 3 15

# grab the 1st 4 bytes off the input stream,
# store them in a file, convert to ascii,
# and store in variable:
HEADER=$(
    dd bs=1 count=4 2>/dev/null |
    tee /tmp/$$ |
    od -t x1 |
    sed '
        s/^00* //
        s/ //g
        q
    '
)

case "$HEADER" in
    1f8b0800)
        UNCOMPRESS='gzip -d -fc'
    ;;
    425a6839)
        UNCOMPRESS='bzip2 -d -fc'
    ;;
    *)
        echo >&2 "$0: unknown stream type for header '$HEADER'"
        exit 2
    ;;
esac

echo >&2 "$0: File header is '$HEADER' using '$UNCOMPRESS' on stream."
cat /tmp/$$ - | $UNCOMPRESS
rm /tmp/$$
Related Question