I am sorting big files (>100Go), and to reduce time spent on disk writes, I am trying to use GNU sort's --compress-program
parameter. (Related: How to sort big files?)
However, it appears in certain cases that only the first temporary is compressed. I would like to know why, and what I could do to compress all temporaries.
I am using:
sort (GNU coreutils) 8.25
lzop 1.03
/LZO library 2.09
Steps to reproduce the issue:
You will need something like ~15Go free space, ~10Go ram, some time
First, create a 10Go file with the following C code:
#include <stdio.h>
#include <stdlib.h>
int main(void) {
unsigned long n;
unsigned char i;
srand(42);
for(n = 0; n < 1000000000; n++) {
for(i = 0; i < 3; i++) {
printf("%03d", rand() % 1000);
}
printf("\n");
}
fflush(stdout);
return 0;
}
And running it:
$ gcc -Wall -O3 -o generate generate.c
$ ./generate > test.input # takes a few minutes
$ head -n 4 test.input
166740881
241012758
021940535
743874143
Then, start the sort process:
$ LC_ALL=C sort -T . -S 9G --compress-program=lzop test.input -o test.output
After some time, suspend the process, and list the tempararies created in the same folder (due to -T .
):
$ ls -s sort*
890308 sortc1JZsR
890308 sorte7O878
378136 sortrK37RZ
$ file sort*
sortc1JZsR: ASCII text
sorte7O878: ASCII text
sortrK37RZ: lzop compressed data - version 1.030, LZO1X-1, os: Unix
It seems that only sortrK37RZ
(the first temporary created) has been compressed.
[Edit] Running that same sort
command with -S
set to 7G
is fine (i.e. all temporaries are compressed) while with 8G
the issue is present.
[Edit] lzop is not called for the other temporary
I tryied and used the following script as a wrapper for lzop
:
#!/bin/bash
set -e
echo "$$: start at $(date)" >> log
lzop $@
echo "$$: end at $(date)" >> log
Here is the content of the log
file, when several temporaries are written to disk:
11109: start at Sun Apr 10 22:56:51 CEST 2016
11109: end at Sun Apr 10 22:57:17 CEST 2016
So my guess is that the compress program is not called at all.
Best Answer
Not reproduced here?
What I'm guessing the issue is, is due to failure to fork() the compression process due to the large mem size, and then falling back to a standard write. I.E. sort(1) is using fork()/exec() when ideally it should be using posix_spawn() to more efficiently fork the compression process. Now fork() is CoW, but there is still overhead in preparation of the associated accounting structures for such a large process. In future versions of sort(1) we'll using posix_spawn() to avoid this overhead (glibc has only just got a usable implementation of posix_spawn() as of version 2.23).
In the meantime I would suggest to use a much smaller -S. Perhaps -S1G and below.