Creating a single output stream out of three other streams produced in parallel

parallelismpipesplittext processing

I have three kinds of data that are in different formats; for each data type, there is a Python script that transforms it into a single unified format.

This Python script is slow and CPU-bound (to a single core on a multi-core machine), so I want to run three instances of it – one for each data type – and combine their output to pass it into sort. Basically, equivalent to this:

{ ./handle_1.py; ./handle_2.py; ./handle_3.py } | sort -n

But with the three scripts running in parallel.

I found this question where GNU splitwas being used to round-robin some stdout stream between n instances of a script that handles the stream.

From the split man page:

-n, --number=CHUNKS
          generate CHUNKS output files.  See below
CHUNKS  may be:
 N       split into N files based on size of input
 K/N     output Kth of N to stdout
 l/N     split into N files without splitting lines
 l/K/N   output Kth of N to stdout without splitting lines
 r/N     like 'l'  but  use  round  robin  distributio

So the r/N command implies "without splitting lines".

Based on this, it seems like the following solution should be feasible:

split -n r/3 -u --filter="./choose_script" << EOF
> 1
> 2
> 3
> EOF

Where choose_script does this:

#!/bin/bash
{ read x; ./handle_$x.py; }

Unfortunately, I see some intermingling of lines – and lots of newlines that shouldn't be there.

For example, if I replace my Python scripts with some simple bash scripts that do this:

#!/bin/bash
# ./handle_1.sh
while true; echo "1-$RANDOM"; done;

.

#!/bin/bash
# ./handle_2.sh
while true; echo "2-$RANDOM"; done;

.

#!/bin/bash
# ./handle_3.sh
while true; echo "3-$RANDOM"; done;

I see this output:

1-8394

2-11238
2-22757
1-723
2-6669
3-3690
2-892
2-312511-24152
2-9317
3-5981

This is annoying – based on the man page extract I pasted above, it should maintain line integrity.

Obviously it works if I remove the -u argument, but then it's buffered and I'll run out of memory as it buffers the output of all but one of the scripts.

If anyone has some insight here it'd be greatly appreciated. I'm out of my depth here.

Best Answer

Try using the -u option of GNU parallel.

echo "1\n2\n3" | parallel -u -IX ./handle_X.sh

This runs them in parallel, without buffering the entirety of any process.