Linux – unzip and merge sorted text files in a single operation

linuxsort

I have tens of thousands of gzipped text files, each of which has already been sorted internally.

I've written a script to unzip them (creating a copy) and then using sort -m to merge them.

It all works very well, but in trying to improve the performance I noted that the unzipping operation is about 16-20% of the total run time.

Basically, the merge operation has to scan through the entire bulk of the text files repeatedly, merging 16 files into 1 at a time. Typically, this leads to about 4-5 full scans through the data I have. The unzipping essentially adds one additional scan through it all.

Saving 1 out of 5-6 scans through 50-100 GB of data is a substantial saving.

Thus I started wondering if I could feed the output of gunzip straight into sort (or some other tool) and basically combine the unzipping round with the first round of merging. The output would not be gzipped.

Does anyone know if this can be done? Ideally with the standard tools, but I'm also open to a third party tools if it fits the bill.

Best Answer

You can—the program zcat (or gzip -cd) to write the decompressed data to stdout. From there, you can use bash's process substitution. 1.gz is the numbers 1 through 3; a.gz is the letters a through c; and z.gz is the number 4 and letter d:

$ sort -m <(zcat 1.gz) <(zcat a.gz) <(zcat z.gz)
1
2
3
4
a
b
c
d

Hopefully your data is somewhat more interesting.