Shell – query r.e running for loop scripts in parallel

parallelismshell-script

I have the following in a shell script:

for file in $local_dir/myfile.log.*; 
    do 
        file_name=$(basename $file); 
        server_name=$(echo $file_name | cut -f 3 -d '.');
        file_location=$(echo $file);

        mv $file_location $local_dir/in_progress1.log

        mysql -hxxx -P3306 -uxxx -pxxx -e "set @server_name='${server_name}'; source ${sql_script};"

        rm $local_dir/in_progress1.log
    done

It basically gets all files in a directory that match the criteria, extracts a servername from the filename, before passing it across to a MySQL script for procesing.

What I am wondering is if I have 10 files that take 60 seconds each to complete, and after 5 minutes I then start a second instance of the shell script:

  • a) will the second script still see the files that havent been processed
  • b) will it cause problems for the first instance if it deletes files

or will I be able to run them in parallel without issue?

Best Answer

One would assume that "60 seconds" (and even "5 minutes") is just a good estimate, and that there is a risk that the first batch is still in progress when the second batch is started. If you want to separate the batches (and if there is no problem aside from the log-files in an occasional overlap), a better approach would be to make a batch number as part of the in-progress filenaming convention.

Something like this:

[[ -s ]] $local_dir/batch || echo 0 > $local_dir/batch
batch=$(echo $local_dir/batch)
expr $batch + 1 >$local_dir/batch

before the for-loop, and then at the start of the loop, check that your pattern matches an actual file

[[ -f "$file" ]] || continue

and use the batch number in the filename:

mv $file_location $local_dir/in_progress$batch.log

and for forth. That reduces the risk of collision.

Related Question