GNU sort –compress-program compressing only first temporary

compressiongnusort

I am sorting big files (>100Go), and to reduce time spent on disk writes, I am trying to use GNU sort's --compress-program parameter. (Related: How to sort big files?)

However, it appears in certain cases that only the first temporary is compressed. I would like to know why, and what I could do to compress all temporaries.

I am using:

  • sort (GNU coreutils) 8.25
  • lzop 1.03 / LZO library 2.09

Steps to reproduce the issue:

You will need something like ~15Go free space, ~10Go ram, some time

First, create a 10Go file with the following C code:

#include <stdio.h>
#include <stdlib.h>

int main(void) {
    unsigned long n;
    unsigned char i;
    srand(42);
    for(n = 0; n < 1000000000; n++) {
        for(i = 0; i < 3; i++) {
            printf("%03d", rand() % 1000);
        }
        printf("\n");
    }
    fflush(stdout);
    return 0;
}

And running it:

$ gcc -Wall -O3 -o generate generate.c
$ ./generate > test.input  # takes a few minutes
$ head -n 4 test.input
166740881
241012758
021940535
743874143

Then, start the sort process:

$ LC_ALL=C sort -T . -S 9G --compress-program=lzop test.input -o test.output

After some time, suspend the process, and list the tempararies created in the same folder (due to -T .):

$ ls -s sort*
890308 sortc1JZsR
890308 sorte7O878
378136 sortrK37RZ
$ file sort*
sortc1JZsR: ASCII text
sorte7O878: ASCII text
sortrK37RZ: lzop compressed data - version 1.030, LZO1X-1, os: Unix

It seems that only sortrK37RZ (the first temporary created) has been compressed.

[Edit] Running that same sort command with -S set to 7G is fine (i.e. all temporaries are compressed) while with 8G the issue is present.

[Edit] lzop is not called for the other temporary

I tryied and used the following script as a wrapper for lzop:

#!/bin/bash
set -e
echo "$$: start at $(date)" >> log
lzop $@
echo "$$: end at $(date)" >> log

Here is the content of the log file, when several temporaries are written to disk:

11109: start at Sun Apr 10 22:56:51 CEST 2016
11109: end at Sun Apr 10 22:57:17 CEST 2016

So my guess is that the compress program is not called at all.

Best Answer

Not reproduced here?

$ shuf -i1-10000000 > t.in
$ sort -S50M -T. t.in --compress-program=lzop  # ^z
$ file sort* | tee >(wc -l) > >(grep -v lzop)
7
$ fg   # ^c
$ sort --version | head -n1
sort (GNU coreutils) 8.25

What I'm guessing the issue is, is due to failure to fork() the compression process due to the large mem size, and then falling back to a standard write. I.E. sort(1) is using fork()/exec() when ideally it should be using posix_spawn() to more efficiently fork the compression process. Now fork() is CoW, but there is still overhead in preparation of the associated accounting structures for such a large process. In future versions of sort(1) we'll using posix_spawn() to avoid this overhead (glibc has only just got a usable implementation of posix_spawn() as of version 2.23).

In the meantime I would suggest to use a much smaller -S. Perhaps -S1G and below.

Related Question