How to wait for execution of parallelized processes and stitch together the outputs

sedsplit

Quite new to doing things on Unix, looking to make a script that does the following things in order:

  • Take main .tsv file, split into X number of files with Y lines each
  • Run each split file through a program, which outputs a new .tsv file upon completion
  • Wait until ALL split files have completed processing, then stitch output files together into one.

I know about using split and sed for splitting files, and I can't imagine getting the split files to run through a Python script is hard either, but the problem is finding out when all executions of the parallelized programs are complete, and THEN stitching their outputs together into one.

With split I know it auto-increments the names and that you can mass parallelize it as seen in this SO question, so I could figure that part out. Is there a way to check for a group of parallelized Python scripts' execution status? How could I accomplish what I'd like to do?

Best Answer

split -l $Y main.tsv main_part_
for part in main_part_*; do
    program $part &
done
wait
echo "all done"

wait is a bash builtin: check the man page for details

Related Question