How to change the first line of a big gzip file without decompressing all of it

compressioncsvgziphead

I currently have a compressed file, A.gz which contains lots of tabulated data, including a header in the first line. I want to create another file, B.gz, which has the same data as the previous file, but with a different header.

The simple way to do this would be decompressing all of A.gz, tail-ing everything but the first line, and re-compressing everything. However, this seems terrible inefficient, specially because the concatenation of two gzip-ed files decompresses correctly to the concatenation of the decompressed versions.

I was wondering if there was a way to do this similar to this:

zcat A.gz | head -n 1 | process_header | gzip > B.gz
cat A.gz | (remove compressed header) >> B.gz

Without having to decompress all of A.gz.

Best Answer

If you just wanted to insert another line on top, it would be simple.

echo some line | gzip > newfile.gz
cat newfile.gz oldfile.gz > result.gz

gzip allows concatenation. If you don't mind it reporting a wrong uncompressed filesize if you just look at the file w/o uncompressing it, that is. Also some programs can not handle such files, WinRAR for example.

To get closer to what you actually want, the question is whether your gzip file is made up of blocks that function entirely independent from one another, and if so, how to find the block boundary.

If you knew you wanted to do this beforehand and created the gzip by concatenating two independent gzip files in the first place, it would be easy to solve; however on arbitrary gzip files, if it can be done at all, it would require more in depth knowledge of the gzip file format.

I remember there was such a program for bzip2 (but I forgot its name), it created a bzip2 block map that would allow you direct access to specific offsets without uncompressing everything that came before it.

On the bottom line, though, most people just recompress. You likely won't be able to avoid re-writing the entire file anyhow and writing files is usually slower than gzip can compress data, so - if you managed to pull it off, you'd probably save some CPU cycles, but no time.

Not a solution to your gzip question but... don't use tail to get rid of the first line, it's probably very inefficient compared to a sed 1d or whatever. No need to count all lines of a file just to get rid of the first one.

Related Solutions

Fastest way of working out uncompressed size of large GZIPPED file

I believe the fastest way is to modify gzip so that testing in verbose mode outputs the number of bytes decompressed; on my system, with a 7761108684-byte file, I get

% time gzip -tv test.gz
test.gz:     OK (7761108684 bytes)
gzip -tv test.gz  44.19s user 0.79s system 100% cpu 44.919 total

% time zcat test.gz| wc -c
7761108684
zcat test.gz  45.51s user 1.54s system 100% cpu 46.987 total
wc -c  0.09s user 1.46s system 3% cpu 46.987 total

To modify gzip (1.6, as available in Debian), the patch is as follows:

--- a/gzip.c
+++ b/gzip.c
@@ -61,6 +61,7 @@
 #include <stdbool.h>
 #include <sys/stat.h>
 #include <errno.h>
+#include <inttypes.h>

 #include "closein.h"
 #include "tailor.h"
@@ -694,7 +695,7 @@

     if (verbose) {
         if (test) {
-            fprintf(stderr, " OK\n");
+            fprintf(stderr, " OK (%jd bytes)\n", (intmax_t) bytes_out);

         } else if (!decompress) {
             display_ratio(bytes_in-(bytes_out-header_bytes), bytes_in, stderr);
@@ -901,7 +902,7 @@
     /* Display statistics */
     if(verbose) {
         if (test) {
-            fprintf(stderr, " OK");
+            fprintf(stderr, " OK (%jd bytes)", (intmax_t) bytes_out);
         } else if (decompress) {
             display_ratio(bytes_out-(bytes_in-header_bytes), bytes_out,stderr);
         } else {

Shell – How to decompress and print the last few lines of a compressed text file

You can't, as it has been already said, if the files have been compressed with standard gzip. If you have control over the compression, you can use dictzip to compress the files, it compresses the files in separate blocks and you can decompress just the last block (typically 64KB). And it is backward compatible with gzip, meaning the dictzipped file is perfectly legal gzipped file as well.

Other possibility would be if you get the gzipped file as a concatenation of several already gzipped files, you could search for the last gzip signature and decompress everything after that.

Best Answer

Related Solutions

Fastest way of working out uncompressed size of large GZIPPED file

Shell – How to decompress and print the last few lines of a compressed text file

Related Question