Shell – How to use GNU parallel effectively

gnu-parallelshell

Suppose I want to find all the matches in compressed text file:

$ gzcat file.txt.gz | pv --rate -i 5 | grep some-pattern

pv --rate used here for measuring pipe throughput. On my machine it's about 420Mb/s (after decompression).

Now I'm trying to do parallel grep using GNU parallel.

$ gzcat documents.json.gz | pv --rate -i 5 | parallel --pipe -j4 --round-robin grep some-pattern

Now throughput is dropped to a ~260Mb/s. And what is more intresting parallel process itself is using a lot of CPU. More than grep processes (but less than gzcat).

EDIT 1: I've tried different block sizes (--block), as well as different values for -N/-L options. Nothing helps me at this point.

What am I doing wrong?

Best Answer

I am really surprised you get 270 MB/s using GNU Parallel's --pipe. My tests usually max out at around 100 MB/s.

Your bottleneck is most likely in GNU Parallel: --pipe is not very efficient. --pipepart, however, is: Here I can get in the order of 1 GB/s per CPU core.

Unfortunately there are a few limitations on using --pipepart:

  • The file must be seekable (i.e. no pipe)
  • You must be able to find the start of a record with --recstart/--recend (i.e. no compressed file)
  • The line number is unknown (so you cannot have a 4-line record).

Example:

parallel --pipepart -a bigfile --block 100M grep somepattern
Related Question