GNU Parallel – grepping n lines for m regular expressions

gnu-parallelgreplarge files

The GNU parallel grepping n lines for m regular expressions example states the following:

If the CPU is the limiting factor parallelization should be done on
the regexps:

cat regexp.txt | parallel --pipe -L1000 --round-robin grep -f - bigfile

This will start one grep per CPU and read bigfile one time per CPU,
but as that is done in parallel, all reads except the first will be
cached in RAM

So in this instance GNU parallel round robins regular expressions from regex.txt over parallel grep instances with each grep instance reading bigfile separately. And as the documentation states above, disk caching probably ensures that bigfile is read from disk only once.

My question is this – the approach above appears to be seen as better performance-wise than another that involves having GNU parallel round robin records from bigfileover parallel grep instances that each read regexp.txt, something like

cat bigfile |  parallel --pipe -L1000 --round-robin grep -f regexp.txt -

Why would that be? As I see it assuming disk caching in play, bigfile and regexp.txt would each be read from disk once in either case. The one major difference that I can think of is that the second approach involves significantly more data being passed through pipes.

Best Answer

It is due to GNU Parallel --pipe being slow.

cat bigfile |  parallel --pipe -L1000 --round-robin grep -f regexp.txt -

maxes out at around 100 MB/s.

In the man page example you will also find:

parallel --pipepart --block 100M -a bigfile grep -f regexp.txt

which does close to the same, but maxes out at 20 GB/s on a 64 core system.

parallel --pipepart --block 100M -a bigfile -k grep -f regexp.txt

should give exactly the same result as grep -f regexp.txt bigfile

Related Question