How to change the first line of a big gzip file without decompressing all of it

compressioncsvgziphead

I currently have a compressed file, A.gz which contains lots of tabulated data, including a header in the first line. I want to create another file, B.gz, which has the same data as the previous file, but with a different header.

The simple way to do this would be decompressing all of A.gz, tail-ing everything but the first line, and re-compressing everything. However, this seems terrible inefficient, specially because the concatenation of two gzip-ed files decompresses correctly to the concatenation of the decompressed versions.

I was wondering if there was a way to do this similar to this:

zcat A.gz | head -n 1 | process_header | gzip > B.gz
cat A.gz | (remove compressed header) >> B.gz

Without having to decompress all of A.gz.

Best Answer

If you just wanted to insert another line on top, it would be simple.

echo some line | gzip > newfile.gz
cat newfile.gz oldfile.gz > result.gz

gzip allows concatenation. If you don't mind it reporting a wrong uncompressed filesize if you just look at the file w/o uncompressing it, that is. Also some programs can not handle such files, WinRAR for example.

To get closer to what you actually want, the question is whether your gzip file is made up of blocks that function entirely independent from one another, and if so, how to find the block boundary.

If you knew you wanted to do this beforehand and created the gzip by concatenating two independent gzip files in the first place, it would be easy to solve; however on arbitrary gzip files, if it can be done at all, it would require more in depth knowledge of the gzip file format.

I remember there was such a program for bzip2 (but I forgot its name), it created a bzip2 block map that would allow you direct access to specific offsets without uncompressing everything that came before it.

On the bottom line, though, most people just recompress. You likely won't be able to avoid re-writing the entire file anyhow and writing files is usually slower than gzip can compress data, so - if you managed to pull it off, you'd probably save some CPU cycles, but no time.


Not a solution to your gzip question but... don't use tail to get rid of the first line, it's probably very inefficient compared to a sed 1d or whatever. No need to count all lines of a file just to get rid of the first one.