Shell – Self extracting scripts: tar -xO and dd

ddshell-scripttar

I am working with a script that is a self extracting script. It's a script to install packages on a QNAP NAS.

It has some scripting at the start which extracts the rest of the file. Here it goes:

script_len=102
/bin/dd if="${0}" bs=$script_len skip=1 | /bin/tar -xO | /bin/tar -xzv

This uses dd to copy the bytes from byte 102 onwards into tar, where it is extracted.

What does -xO do? And why is it extracted "twice" (two invocations of tar with -x) ? I couldn't find much discussion of this online – the man page seems to suggest it's something to do with "drives". (Looks like I got my 0s and Os mixed up!)

Subsequently, the script does:

offset=$(/usr/bin/expr $script_len + 2042)
/bin/dd if="${0}" bs=$offset skip=1 | /bin/cat | /bin/dd bs=1024 count=7 of=$_EXTRACT_DIR/data.tar.gz

This appears to skip further into the file, and copies the bytes there into a new zipped TAR. Presumably those bytes are already structured and encoded that way.

But didn't we already read those bytes through tar in the first command? I see no way in which dd was told to stop reading the file.

Best Answer

Let's take a look at a QNAP package, e.g. http://www.twonkyforum.com/downloads/8.3/TwonkyServerEU_8.3_arm-x41.qpkg

....
script_len=2467
/bin/dd if="${0}" bs=$script_len skip=1 | /bin/tar -xO | /bin/tar -xzv -C $_EXTRACT_DIR script_len=2467
....

Now let's copy the data with dd, and look what's inside:

%dd if=TwonkyServerEU_8.3_arm-x41.qpkg bs=2467 skip=1 > first

That's a raw TAR archive, with a single tar.gz file inside it:

%file first 
first: POSIX tar archive (GNU)

%tar -tvf first 
-rw-r--r-- admin/administrators 7175 2017-01-06 17:49 control.tar.gz

The next pipeline step is /bin/tar -xO, and here is what TAR manual says on it:

To write the extracted files to the standard output, instead of creating the files on the file system, use --to-stdout' (-O') in conjunction with --extract' (--get', `-x').

This option is useful if you are extracting files to send them through a pipe, and do not need to preserve them in the file system. If you extract multiple members, they appear on standard output concatenated, in the order they are found in the archive.

As there is just one file control.tar.gz inside the archive, it will get extracted to STDOUT, to be processed by the next pipeline step, which will invoke TAR again to extract the inner content from it.

So, basically, there is a 'tar.gz' archive inside the 'tar' archive, which is why two sequential tar commands are necessary to extract it.

Note that tar is inherently designed to operate on stream data, so it can reliably detect the end of archive, even if it is followed by more data:

Physically, an archive consists of a series of file entries terminated by an end-of-archive entry, which consists of two 512 blocks of zero bytes.

So, tar -xO, will stop after the first data file read, and discard the rest, which I guess was a rationale for using this storage format in qpkg.

Related Question