GNU parallel excessively slow

gnu-parallelgrepxargs

I need to run grep on a couple of million files. Therefore I tried to speed it up, following the two approaches mentioned here: xargs -P -n and GNU parallel. I tried this on a subset of my files (9026 in number), and this was the result:

With xargs -P 8 -n 1000, very fast:

$ time find tex -maxdepth 1 -name "*.json" | \
                xargs -P 8 -n 1000 grep -ohP "'pattern'" > /dev/null

real    0m0.085s
user    0m0.333s
sys     0m0.058s

With parallel, very slow:

$ time find tex -maxdepth 1 -name "*.json" | \
                parallel -j 8 grep -ohP "'pattern'" > /dev/null

real    0m21.566s
user    0m22.021s
sys     0m18.505s

Even sequential xargs is faster than parallel:

$ time find tex -maxdepth 1 -name "*.json" | \
                xargs grep -ohP 'pattern' > /dev/null

real    0m0.242s
user    0m0.209s
sys     0m0.040s

xargs -P n does not work for me because the output from all the processes gets interleaved, which does not happen with parallel. So I would like to use parallel without incurring this huge slowdown.

Any ideas?

UPDATE

Following the answer by Ole Tange, I tried parallel -X, the results are here, for completeness:

$ time find tex -maxdepth 1 -name "*.json" | \
    parallel -X -j 8 grep -ohP "'pattern'" > /dev/null

real    0m0.563s
user    0m0.583s
sys     0m0.110s

Fastest solution: Following the comment by @cas, I tried to grep with -H option (to force printing the filenames), and sorting. Results here:

time find tex -maxdepth 1 -name '*.json' -print0 | \
    xargs -0r -P 9 -n 500 grep --line-buffered -oHP 'pattern' | \
    sort -t: -k1 | cut -d: -f2- > /dev/null

real    0m0.144s
user    0m0.417s
sys     0m0.095s

Best Answer

Try parallel -X. As written in the comments the overhead of starting a new shell and opening files for buffering for each argument is probably the cause.

Be aware that GNU Parallel will never be as fast as xargs because of that. Expect an overhead of 10 ms per job. With -X this overhead is less significant as you process more arguments in one job.

Related Solutions

Linux – Can GNU Parallel execute more parallel processes

Not only is it possible; it is also recommended in some situations.

GNU Parallel takes around 10 ms to run a job. So if you have 8 cores and the jobs you run take less than 70 ms, then you will see GNU Parallel use 100% of a single core, and yet there will be idle time on other cores. Thus you will not use 100% of all cores.

The other situation where it is recommended is if you want to run more jobs than -j0 will do. Currently -j0 will run around 250 jobs in parallel unless you adjust some system limits. It makes perfect sense to run more than 250 jobs if the jobs are not limited by CPU and disk I/O. This is for example true if network latency is the limiting factor.

However, using 2 lists is not the recommended way to split up jobs. The recommended way is to use GNU Parallel to call GNU Parallel:

cat list0 | parallel -j20 --pipe parallel -j100

That will run 2000 jobs in parallel. To run more adjust -j. It is recommended that the outer (the 20) is at least the number of cores, so that there will be at least one GNU Parallel process on each core.

Using this technique you should have no problem starting 20000 jobs in parallel; when you get over 32000 processes things start acting up.

By first running:

echo 4194304 | sudo tee /proc/sys/kernel/pid_max

I was able to run:

seq 1000000 2000000000 |
  parallel -j16 --roundrobin --pipe parallel -j0 --pipe parallel -j0 sleep

which will start 1 million processes in parallel (it takes 300 G RAM on my system).

POSIX-compliant recursive grep with no errors for inaccessible directories

To make grep print only the file name, pass the -l option. To search for a substring rather than a regular expression, pass the -F option.

To search recursively for files whose name matches a certain pattern, use find with the -type f and -name PATTERN primaries. Use -exec to invoke grep.

find . -name '*.sas' -type f -exec grep -F -l 'Carhart' {} +

If you want to avoid errors from directories that you aren't allowed to traverse, you can either use -perm, -user and -group to analyze permissions (which is difficult to get right, and won't work if you have ACL), or call test (which is slowed because it's an external program, but is more reliable).

find . -type d ! -exec test -r {} -a -x {} \; -prune -o \
       -name '*.sas' -type f -exec test -r {} \; -exec grep -F -l 'Carhart' {} +

Best Answer

Related Solutions

Linux – Can GNU Parallel execute more parallel processes

POSIX-compliant recursive grep with no errors for inaccessible directories

Related Question