Shell – How to use GNU parallel effectively

gnu-parallelshell

Suppose I want to find all the matches in compressed text file:

$ gzcat file.txt.gz | pv --rate -i 5 | grep some-pattern

pv --rate used here for measuring pipe throughput. On my machine it's about 420Mb/s (after decompression).

Now I'm trying to do parallel grep using GNU parallel.

$ gzcat documents.json.gz | pv --rate -i 5 | parallel --pipe -j4 --round-robin grep some-pattern

Now throughput is dropped to a ~260Mb/s. And what is more intresting parallel process itself is using a lot of CPU. More than grep processes (but less than gzcat).

EDIT 1: I've tried different block sizes (--block), as well as different values for -N/-L options. Nothing helps me at this point.

What am I doing wrong?

Best Answer

I am really surprised you get 270 MB/s using GNU Parallel's --pipe. My tests usually max out at around 100 MB/s.

Your bottleneck is most likely in GNU Parallel: --pipe is not very efficient. --pipepart, however, is: Here I can get in the order of 1 GB/s per CPU core.

Unfortunately there are a few limitations on using --pipepart:

The file must be seekable (i.e. no pipe)
You must be able to find the start of a record with --recstart/--recend (i.e. no compressed file)
The line number is unknown (so you cannot have a 4-line record).

Example:

parallel --pipepart -a bigfile --block 100M grep somepattern

Related Solutions

Linux – Can GNU Parallel execute more parallel processes

Not only is it possible; it is also recommended in some situations.

GNU Parallel takes around 10 ms to run a job. So if you have 8 cores and the jobs you run take less than 70 ms, then you will see GNU Parallel use 100% of a single core, and yet there will be idle time on other cores. Thus you will not use 100% of all cores.

The other situation where it is recommended is if you want to run more jobs than -j0 will do. Currently -j0 will run around 250 jobs in parallel unless you adjust some system limits. It makes perfect sense to run more than 250 jobs if the jobs are not limited by CPU and disk I/O. This is for example true if network latency is the limiting factor.

However, using 2 lists is not the recommended way to split up jobs. The recommended way is to use GNU Parallel to call GNU Parallel:

cat list0 | parallel -j20 --pipe parallel -j100

That will run 2000 jobs in parallel. To run more adjust -j. It is recommended that the outer (the 20) is at least the number of cores, so that there will be at least one GNU Parallel process on each core.

Using this technique you should have no problem starting 20000 jobs in parallel; when you get over 32000 processes things start acting up.

By first running:

echo 4194304 | sudo tee /proc/sys/kernel/pid_max

I was able to run:

seq 1000000 2000000000 |
  parallel -j16 --roundrobin --pipe parallel -j0 --pipe parallel -j0 sleep

which will start 1 million processes in parallel (it takes 300 G RAM on my system).

GNU Parallel – grepping n lines for m regular expressions

It is due to GNU Parallel --pipe being slow.

cat bigfile |  parallel --pipe -L1000 --round-robin grep -f regexp.txt -

maxes out at around 100 MB/s.

In the man page example you will also find:

parallel --pipepart --block 100M -a bigfile grep -f regexp.txt

which does close to the same, but maxes out at 20 GB/s on a 64 core system.

parallel --pipepart --block 100M -a bigfile -k grep -f regexp.txt

should give exactly the same result as grep -f regexp.txt bigfile

Best Answer

Related Solutions

Linux – Can GNU Parallel execute more parallel processes

GNU Parallel – grepping n lines for m regular expressions

Related Question