I have parent folder "parent". Inside this folder I have subfolders, and a file named "names.txt". this file includes the names of these subfolders as follow:
Parent_folder
folder1
folder2
folder3
folder4
.
.
.
.
names.txt
The content of the file "names.txt" is as follow:
folder1
folder2
folder3
folder4
.
.
.
Inside every folder I have images and I want to apply consecutively 10 scripts on every image ( every script must finish it is job inside every folder, then the second script must be run). These scripts have different names and they are exist in one folder. I set an environment by sourcing a file then I can call these scripts by its name from terminal
.At the same time I want to apply this process on all the folders at once. i.e. when script #1 is running, I want it to be running on all the folders at the same time. When it is done and script #2 will start. I want it to start in all the folders at once and so on…
In order to achieve this I wrote the following code:
#!/bin/bash
path=PATH/TO/THE/PARENT/FOLDER
for i in $(cat $path/names.txt); do
{
script#1
} &
{
script#2
} &
.
.
.
done
This code is not working efficiently as all the commands are running at once. I want the commands to run on all the folders at once but consecutively.
What I am doing wrong?
Best Answer
First, create a wrapper script that changes to the directory given in the first (and only) command-line argument, performs whatever setup/variable-initialisation/etc it needs, and then runs your 10 scripts in sequence with whatever args they need.
For example, if each script processes all .jpg, .png, and .gif files in the directory:
Next, use
find
to pipe a list of directories intoparallel
.(the
-mindepth 1
option infind
excludes the top level directory, i.e. the parent directory itself)By default, parallel will run one instance (a "job") of
./example-wrapper.sh
for each CPU core you have. Each instance will get ONE (-n 1
) directory name. As soon as a job has finished, another is started (if there are any remaining jobs to run).This makes maximal use of available CPU power, without letting jobs compete with each other for CPU time.
You can use
parallel
's-j
option to tune the number of jobs to run at once. For CPU-intensive tasks, the default of one job per system core is probably what you want.If your jobs aren't very CPU-intensive but tend to be more I/O bound, you may want to run 2 or 3 jobs for every core you have (depending on how large your input files are, how fast your storage is, and what kind of devices make up that storage - e.g. SSDs don't suffer from seek latency so won't be slowed down by multiple processes seeking data from all over the disk. Hard disks do suffer from seek times and WILL slow down from being made to seek randomly all over the place - Linux's disk buffering/caching will help, but won't eliminate the problem).
If you want to get other work done (e.g. normal desktop usage) while these jobs are running, use
-j
to tellparallel
to use one or two fewer cores than your system has (e.g.-j 6
on an 8-core system).NOTE: Tuning parallel processes is a fine art and can take some experimenting to get the best results.
Anyway, from
man parallel
:This is really basic and primitive use of
parallel
. It can do a lot more. See the man page for details.BTW,
xargs
also has a-P
option for running jobs in parallel. For simple usage like this, it makes little difference whether you usexargs -P
orparallel
. But if your requirements are more complicated, useparallel
.parallel
should be packaged for most linux distros, otherwise it's available from https://www.gnu.org/software/parallel/