Shell Parallelism – Executing Piped Commands in Parallel

parallelismshell

Consider the following scenario. I have two programs A and B. Program A outputs to stdout lines of strings while program B process lines from stdin. The way to use these two programs is of course:

foo@bar:~$ A | B

Now I've noticed that this eats up only one core; hence I am wondering:

Are programs A and B sharing the same computational resources? If so, is there a way to run A and B concurrently?

Another thing that I've noticed is that A runs much much faster than B, hence I am wondering if could somehow run more B programs and let them process the lines that A outputs in parallel.

That is, A would output its lines, and there would be N instances of programs B that would read these lines (whoever reads them first) process them and output them on stdout.

So my final question is:

Is there a way to pipe the output to A among several B processes without having to take care of race conditions and other inconsistencies that could potentially arise?

Best Answer

A problem with split --filter is that the output can be mixed up, so you get half a line from process 1 followed by half a line from process 2.

GNU Parallel guarantees there will be no mixup.

So assume you want to do:

 A | B | C

But that B is terribly slow, and thus you want to parallelize that. Then you can do:

A | parallel --pipe B | C

GNU Parallel by default splits on \n and a block size of 1 MB. This can be adjusted with --recend and --block.

You can find more about GNU Parallel at: http://www.gnu.org/s/parallel/

You can install GNU Parallel in just 10 seconds with:

$ (wget -O - pi.dk/3 || lynx -source pi.dk/3 || curl pi.dk/3/ || \
   fetch -o - http://pi.dk/3 ) > install.sh
$ sha1sum install.sh | grep 67bd7bc7dc20aff99eb8f1266574dadb
12345678 67bd7bc7 dc20aff9 9eb8f126 6574dadb
$ md5sum install.sh | grep b7a15cdbb07fb6e11b0338577bc1780f
b7a15cdb b07fb6e1 1b033857 7bc1780f
$ sha512sum install.sh | grep 186000b62b66969d7506ca4f885e0c80e02a22444
6f25960b d4b90cf6 ba5b76de c1acdf39 f3d24249 72930394 a4164351 93a7668d
21ff9839 6f920be5 186000b6 2b66969d 7506ca4f 885e0c80 e02a2244 40e8a43f
$ bash install.sh

Watch the intro video on http://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Related Solutions

Using parallel to process unique input files to unique output files

GNU Parallel is designed for this kind of tasks:

parallel customScript -c 33 -I -file {} -a -v 55 '>' {.}.output ::: *.input

or:

ls | parallel customScript -c 33 -I -file {} -a -v 55 '>' {.}.output

It will run one jobs per CPU core.

You can install GNU Parallel simply by:

wget https://git.savannah.gnu.org/cgit/parallel.git/plain/src/parallel
chmod 755 parallel
cp parallel sem

Watch the intro videos for GNU Parallel to learn more: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Spreading stdin to parallel processes

For GNU Parallel you can set the block size using --block. It does, however, require you have enough memory to keep 1 block in memory for each of the running processes.

I understand this is not precisely what you are looking for, but it may be an acceptable work-around for now.

If your tasks on average take the same time, then you might be able to use mbuffer:

find . -type f | split -n r/24 -u --filter="mbuffer -m 2G | myjob"

Best Answer

Related Solutions

Using parallel to process unique input files to unique output files

Spreading stdin to parallel processes

Related Question