Bash – Make GNU Parallel not delay before executing arguments from STDIN

bashfifognu-parallelparallelismpipe

GNU Parallel, without any command line options, allows you to easily parallelize a command whose last argument is determined by a line of STDIN:

$ seq 3 | parallel echo
2
1
3

Note that parallel does not wait for EOF on STDIN before it begins executing jobs — running yes | parallel echo will begin printing infinitely many copies of y right away.

This behavior appears to change, however, if STDIN is relatively short:

$ { yes | ghead -n5; sleep 10; } | parallel echo

In this case, no output will be returned before sleep 10 completes.

This is just an illustration — in reality I'm attempting to read from a series of continually generated FIFO pipes where the FIFO-generating process will not continue until the existing pipes start to be consumed. For example, my command will produce a STDOUT stream like:

/var/folders/2b/1g_lwstd5770s29xrzt0bw1m0000gn/T/tmp.PFcggGR55i
/var/folders/2b/1g_lwstd5770s29xrzt0bw1m0000gn/T/tmp.UCpTBzI3J6
/var/folders/2b/1g_lwstd5770s29xrzt0bw1m0000gn/T/tmp.r2EmSLW0t9
/var/folders/2b/1g_lwstd5770s29xrzt0bw1m0000gn/T/tmp.5TRNeeZLmt

Manually cat-ing each of these files one at a time in a new terminal causes the FIFO-generating process to complete successfully. However, running printfifos | parallel cat does not work. Instead, parallel seems to block forever waiting for input on STDIN — if I modify the pipeline to printfifos | head -n4 | parallel cat, the deadlock disappears and the first four pipes are printed successfully.

This behavior seems to be connected to the --jobs|-j parameter. Whereas { yes | ghead -n5; sleep 10; } | parallel cat produces no output for 10 seconds, adding a -j1 option yields four lines of y almost immediately followed by a 10 second wait for the final y. Unfortunately this does not solve my problem — I need every argument to be processed before parallel can get EOF from reading STDIN. Is there any way to achieve this?

Best Answer

A bug in GNU Parallel does, that it only starts processing after having read one job for each jobslot. After that it reads one job at a time.

In older versions the output will also be delayed by the number of jobslots. Newer versions only delay output by a single job.

So if you sent one job per second to parallel -j10 it would read 10 jobs before starting them. Older versions you would then have to wait an additional 10 seconds before seeing the output from job 3.

A workaround the limitation at start is to feed one dummy job per jobslot to parallel:

true >jobqueue; tail -n+0 -f jobqueue | parallel &
seq $(parallel --number-of-threads) | parallel -N0 echo true >> jobqueue
# now add the real jobs to jobqueue

A workound the output is to use --linebuffer (but this will mix full lines from different jobs).

Related Solutions

Prevent GNU parallel from splitting quoted arguments

parallel runs a shell (which exact one depending on the context in which it is called, generally, when called from a shell, it's that same shell) to parse the concatenation of the arguments.

So:

parallel debug-call 'a b' {} ::: 'a b' c

is the same as

parallel 'debug-call a b {}' ::: 'a b' c

parallel will call:

your-shell -c 'debug-call a b <something>'

Where <something> is the arguments (hopefully) properly quoted for that shell. For instance, if that shell is bash, it will run

bash -c 'debug-call a b a\ b'

Here, you want:

parallel 'debug-call "a b" {}' ::: 'a b' c

parallel -q debug-call 'a b' {} ::: 'a b' c

Where parallel will quote the arguments (in the correct (hopefully) syntax for the shell) before concatenating.

To avoid calling a shell in the first place, you could use GNU xargs instead:

xargs -n1 -r0 -P4 -a <(printf '%s\0' 'a b' c) debug-call 'a b'

That won't invoke a shell (nor any of the many commands ran by parallel upon initialisation), but you won't benefit from any of the extra features of parallel, like output reordering with -k.

You may find other approaches at Background execution in parallel

Linux – gnu parallel pair argument with file input arguments

GNU parallel solution:

Sample input.txt (for demonstration):

a   b
c   d
e   f

grep '^[ac]' input.txt will be used to emulate command(or pipeline) acting like input source file

parallel -C '\t' echo :::: <(grep '^[ac]' input.txt) ::: $(seq 1 3)

The output:

a b 1
a b 2
a b 3
c d 1
c d 2
c d 3

:::: argfiles - treat argfiles as input source. ::: and :::: can be mixed.

To aggregate elements from each input source - add --xapply option:

parallel -C '\t' --xapply echo :::: <(grep '^[ac]' input.txt) ::: $(seq 1 2)

The output:

a b 1
c d 2

Best Answer

Related Solutions

Prevent GNU parallel from splitting quoted arguments

Linux – gnu parallel pair argument with file input arguments

Related Question