Shell – query r.e running for loop scripts in parallel

parallelismshell-script

I have the following in a shell script:

for file in $local_dir/myfile.log.*; 
    do 
        file_name=$(basename $file); 
        server_name=$(echo $file_name | cut -f 3 -d '.');
        file_location=$(echo $file);

        mv $file_location $local_dir/in_progress1.log

        mysql -hxxx -P3306 -uxxx -pxxx -e "set @server_name='${server_name}'; source ${sql_script};"

        rm $local_dir/in_progress1.log
    done

It basically gets all files in a directory that match the criteria, extracts a servername from the filename, before passing it across to a MySQL script for procesing.

What I am wondering is if I have 10 files that take 60 seconds each to complete, and after 5 minutes I then start a second instance of the shell script:

a) will the second script still see the files that havent been processed
b) will it cause problems for the first instance if it deletes files

or will I be able to run them in parallel without issue?

Best Answer

One would assume that "60 seconds" (and even "5 minutes") is just a good estimate, and that there is a risk that the first batch is still in progress when the second batch is started. If you want to separate the batches (and if there is no problem aside from the log-files in an occasional overlap), a better approach would be to make a batch number as part of the in-progress filenaming convention.

Something like this:

[[ -s ]] $local_dir/batch || echo 0 > $local_dir/batch
batch=$(echo $local_dir/batch)
expr $batch + 1 >$local_dir/batch

before the for-loop, and then at the start of the loop, check that your pattern matches an actual file

[[ -f "$file" ]] || continue

and use the batch number in the filename:

mv $file_location $local_dir/in_progress$batch.log

and for forth. That reduces the risk of collision.

Related Solutions

Using parallel to process unique input files to unique output files

GNU Parallel is designed for this kind of tasks:

parallel customScript -c 33 -I -file {} -a -v 55 '>' {.}.output ::: *.input

or:

ls | parallel customScript -c 33 -I -file {} -a -v 55 '>' {.}.output

It will run one jobs per CPU core.

You can install GNU Parallel simply by:

wget https://git.savannah.gnu.org/cgit/parallel.git/plain/src/parallel
chmod 755 parallel
cp parallel sem

Watch the intro videos for GNU Parallel to learn more: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

How to run parallel processes and combine outputs when both finished

Use wait. For example:

Data1 ... > Data1Res.csv &
Data2 ... > Data2Res.csv &
wait
AnalysisProg

will:

run the Data1 and Data2 pipes as background jobs
wait for them both to finish
run AnalysisProg.

See, e.g., this question.

Best Answer

Related Solutions

Using parallel to process unique input files to unique output files

How to run parallel processes and combine outputs when both finished

Related Question