Bash – Parallel processes: appending outputs to an array in a bash script

bashlinuxshell-script

I have a for loop in which a function task is called. Each call to the function returns a string that is appended to an array. I would like to parallelize this for loop. I tried using & but it does not seem to work.

Here is the code not parallelised.

task (){ sleep 1;echo "hello $1"; }
arr=()

for i in {1..3}; do
    arr+=("$(task $i)")
done

for i in "${arr[@]}"; do
    echo "$i x";
done

The output is:

hello 1 x
hello 2 x
hello 3 x

Great! But now, when I try to parallelise it with

[...]
for i in {1..3}; do
    arr+=("$(task $i)")&
done
wait
[...]

the output is empty.

UPDATE #1

Regarding the task function:

  • The function task takes some time to run and then outputs one string. After all the strings have been gathered, another for loop will loop through the strings and perform some other task.
  • The order does not matter. The output string can consist of a single line string, possibly with multiple words separated by a white space.

Best Answer

You can't send an assignment to the background, since the background process is a fork of the shell, and the changes to the variable aren't visible back in the main shell.

But you could run a bunch of tasks in parallel, have them all output to a pipe, and then read whatever comes out. Or actually, use process substitution, to avoid the issue of commands in a pipe being executed in a subshell (see Why is my variable local in one 'while read' loop, but not in another seemingly similar loop?)

As long as the outputs are single lines written atomically, they won't get intermixed, but might get reordered:

$ task() { sleep 1; echo "$1"; }
$ time while read -r line; do arr+=("$line"); done < <(for x in 1 2 3 ; do task "$x" & done)
real    0m1.006s
$ declare -p arr
declare -a arr=([0]="2" [1]="1" [2]="3")

The above will run all the tasks at the same time. There's also GNU parallel (and -P in GNU xargs), which is meant exactly for running tasks in parallel, and will only run a few at the same time. Parallel also buffers the outputs from the tasks, so you don't get intermixed data, even if the task writes lines in parts.

$ mapfile -t arr < <(parallel -j4 bash ./task.sh ::: {a,b,c})
$ declare -p arr
declare -a arr=([0]="a" [1]="b" [2]="c")

(Bash's mapfile here reads the input lines in to the array, similarly to the while read .. arr+=() loop above.)

Running an external script as above is straightforward, but you can actually have it run an exported function too, though of course all tasks run in independent copies of the shell, so they'll have their own copies of each variable etc.

$ export -f task
$ mapfile -t arr < <(parallel task ::: {a,b,c})

The above example happened to keep a, b, and c in order, but that's a coincidence. Use parallel -k to have it make sure the outputs are kept in order.

Related Question