Bash – Executing commands consequtively on multiple folders

bashgnu-parallelparallelism

I have parent folder "parent". Inside this folder I have subfolders, and a file named "names.txt". this file includes the names of these subfolders as follow:

Parent_folder
folder1
folder2
folder3
folder4
.
.
.
.
names.txt

The content of the file "names.txt" is as follow:

folder1
folder2
folder3
folder4
.
.
.

Inside every folder I have images and I want to apply consecutively 10 scripts on every image ( every script must finish it is job inside every folder, then the second script must be run). These scripts have different names and they are exist in one folder. I set an environment by sourcing a file then I can call these scripts by its name from terminal .At the same time I want to apply this process on all the folders at once. i.e. when script #1 is running, I want it to be running on all the folders at the same time. When it is done and script #2 will start. I want it to start in all the folders at once and so on…
In order to achieve this I wrote the following code:

#!/bin/bash
path=PATH/TO/THE/PARENT/FOLDER
for i in $(cat $path/names.txt); do
{
script#1
} &
{
script#2
} &
.
.
.

done

This code is not working efficiently as all the commands are running at once. I want the commands to run on all the folders at once but consecutively.
What I am doing wrong?

Best Answer

First, create a wrapper script that changes to the directory given in the first (and only) command-line argument, performs whatever setup/variable-initialisation/etc it needs, and then runs your 10 scripts in sequence with whatever args they need.

For example, if each script processes all .jpg, .png, and .gif files in the directory:

#! /bin/bash
# example-wrapper.sh

cd "$1"

script1 *.{jpg,png,gif}
script2 *.{jpg,png,gif}
script3 *.{jpg,png,gif}
script4 *.{jpg,png,gif}
script5 *.{jpg,png,gif}
script6 *.{jpg,png,gif}
script7 *.{jpg,png,gif}
script8 *.{jpg,png,gif}
script9 *.{jpg,png,gif}
script10 *.{jpg,png,gif}

Next, use find to pipe a list of directories into parallel.

find /path/to/parent/ -mindepth 1 -type -d -print0 | 
  parallel -0 -n 1 ./example-wrapper.sh

(the -mindepth 1 option in find excludes the top level directory, i.e. the parent directory itself)

By default, parallel will run one instance (a "job") of ./example-wrapper.sh for each CPU core you have. Each instance will get ONE (-n 1) directory name. As soon as a job has finished, another is started (if there are any remaining jobs to run).

This makes maximal use of available CPU power, without letting jobs compete with each other for CPU time.

You can use parallel's -j option to tune the number of jobs to run at once. For CPU-intensive tasks, the default of one job per system core is probably what you want.

If your jobs aren't very CPU-intensive but tend to be more I/O bound, you may want to run 2 or 3 jobs for every core you have (depending on how large your input files are, how fast your storage is, and what kind of devices make up that storage - e.g. SSDs don't suffer from seek latency so won't be slowed down by multiple processes seeking data from all over the disk. Hard disks do suffer from seek times and WILL slow down from being made to seek randomly all over the place - Linux's disk buffering/caching will help, but won't eliminate the problem).

If you want to get other work done (e.g. normal desktop usage) while these jobs are running, use -j to tell parallel to use one or two fewer cores than your system has (e.g. -j 6 on an 8-core system).

NOTE: Tuning parallel processes is a fine art and can take some experimenting to get the best results.

Anyway, from man parallel:

--jobs N, -j N, --max-procs N, -P N

Number of jobslots. Run up to N jobs in parallel. 0 means as many as possible. Default is 100% which will run one job per CPU core.

If --semaphore is set default is 1 thus making a mutex.

This is really basic and primitive use of parallel. It can do a lot more. See the man page for details.

BTW, xargs also has a -P option for running jobs in parallel. For simple usage like this, it makes little difference whether you use xargs -P or parallel. But if your requirements are more complicated, use parallel.

parallel should be packaged for most linux distros, otherwise it's available from https://www.gnu.org/software/parallel/

Related Question