GNU Parallel – grepping n lines for m regular expressions

gnu-parallelgreplarge files

The GNU parallel grepping n lines for m regular expressions example states the following:

If the CPU is the limiting factor parallelization should be done on
the regexps:
cat regexp.txt | parallel --pipe -L1000 --round-robin grep -f - bigfile
This will start one grep per CPU and read bigfile one time per CPU,
but as that is done in parallel, all reads except the first will be
cached in RAM

So in this instance GNU parallel round robins regular expressions from regex.txt over parallel grep instances with each grep instance reading bigfile separately. And as the documentation states above, disk caching probably ensures that bigfile is read from disk only once.

My question is this – the approach above appears to be seen as better performance-wise than another that involves having GNU parallel round robin records from bigfileover parallel grep instances that each read regexp.txt, something like

cat bigfile |  parallel --pipe -L1000 --round-robin grep -f regexp.txt -

Why would that be? As I see it assuming disk caching in play, bigfile and regexp.txt would each be read from disk once in either case. The one major difference that I can think of is that the second approach involves significantly more data being passed through pipes.

Best Answer

It is due to GNU Parallel --pipe being slow.

cat bigfile |  parallel --pipe -L1000 --round-robin grep -f regexp.txt -

maxes out at around 100 MB/s.

In the man page example you will also find:

parallel --pipepart --block 100M -a bigfile grep -f regexp.txt

which does close to the same, but maxes out at 20 GB/s on a 64 core system.

parallel --pipepart --block 100M -a bigfile -k grep -f regexp.txt

should give exactly the same result as grep -f regexp.txt bigfile

Related Solutions

Linux – Can GNU Parallel execute more parallel processes

Not only is it possible; it is also recommended in some situations.

GNU Parallel takes around 10 ms to run a job. So if you have 8 cores and the jobs you run take less than 70 ms, then you will see GNU Parallel use 100% of a single core, and yet there will be idle time on other cores. Thus you will not use 100% of all cores.

The other situation where it is recommended is if you want to run more jobs than -j0 will do. Currently -j0 will run around 250 jobs in parallel unless you adjust some system limits. It makes perfect sense to run more than 250 jobs if the jobs are not limited by CPU and disk I/O. This is for example true if network latency is the limiting factor.

However, using 2 lists is not the recommended way to split up jobs. The recommended way is to use GNU Parallel to call GNU Parallel:

cat list0 | parallel -j20 --pipe parallel -j100

That will run 2000 jobs in parallel. To run more adjust -j. It is recommended that the outer (the 20) is at least the number of cores, so that there will be at least one GNU Parallel process on each core.

Using this technique you should have no problem starting 20000 jobs in parallel; when you get over 32000 processes things start acting up.

By first running:

echo 4194304 | sudo tee /proc/sys/kernel/pid_max

I was able to run:

seq 1000000 2000000000 |
  parallel -j16 --roundrobin --pipe parallel -j0 --pipe parallel -j0 sleep

which will start 1 million processes in parallel (it takes 300 G RAM on my system).

GNU parallel excessively slow

Try parallel -X. As written in the comments the overhead of starting a new shell and opening files for buffering for each argument is probably the cause.

Be aware that GNU Parallel will never be as fast as xargs because of that. Expect an overhead of 10 ms per job. With -X this overhead is less significant as you process more arguments in one job.

Best Answer

Related Solutions

Linux – Can GNU Parallel execute more parallel processes

GNU parallel excessively slow

Related Question