I have a bash script that takes as input three arrays with equal length: METHODS, INFILES and OUTFILES.

This script will let METHODS[i] solves problem INFILES[i] and saves the result to OUTFILES[i], for all indices i (0 <= i <= length-1).

Each element in METHODSis a string of the form:

$HOME/program/solver -a <method>

where solver is a program that can be called as follows:

$HOME/program/solver -a <method> -m <input file> -o <output file> --timeout <timeout in seconds>

The script solves all the problems in parallel and set the runtime limit for each instance to 1 hour (some methods can solve some problems very quickly though), as follows:

source METHODS
source INFILES

start=`date +%s`

## Solve in PARALLEL
for index in ${!OUTFILES[*]}; do 

    ${!alg}  -m $infile -o $outfile --timeout 3600) &

end=`date +%s`

echo "Total runtime = $runtime (s)"
echo "Total number of processes = ${#OUTFILES[@]}"

In the above I have length = 619. I submitted this bash to a cluster with 70 available processors, which should take at maximum 9 hours to finish all the tasks. This is not the case in reality, however. When using the top command to investigate, I found that only two or three processes are running (state = R) while all the others are sleeping (state = D).

What am I doing wrong please?

Furthermore, I have learnt that GNU parallel would be much better for running parallel jobs. How can I use it for the above task?

Thank you very much for your help!

Update: My first try with GNU parallel:

The idea is to write all the commands to a file and then use GNU parallel to execute them:

source METHODS
source INFILES

start=`date +%s`    

## Write to file
for index in ${!OUTFILES[*]}; do 
    if [ "$firstline" = true ] ; then
        echo "${!alg}  -m $infile -o $outfile --timeout 3600" > commands.txt
        echo "${!alg}  -m $infile -o $outfile --timeout 3600" >> commands.txt

## Solve in PARALLEL
time parallel :::: commands.txt

end=`date +%s`

echo "Total runtime = $runtime (s)"
echo "Total number of processes = ${#OUTFILES[@]}"

What do you think?

Update 2: I'm using GNU parallel and having the same problem. Here's the output of top:

top - 02:05:25 up 178 days,  8:16,  2 users,  load average: 62.59, 59.90, 53.29
Tasks: 596 total,   7 running, 589 sleeping,   0 stopped,   0 zombie
Cpu(s): 12.9%us,  0.9%sy,  0.0%ni, 63.3%id, 22.9%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:  264139632k total, 260564864k used,  3574768k free,     4564k buffers
Swap: 268420092k total, 80593460k used, 187826632k free,    53392k cached

28542 khue     20   0 7012m 5.6g 1816 R  100  2.2  12:50.22 opengm_min_sum
28553 khue     20   0 11.6g  11g 1668 R  100  4.4  17:37.37 opengm_min_sum
28544 khue     20   0 13.6g 8.6g 2004 R  100  3.4  12:41.67 opengm_min_sum
28549 khue     20   0 13.6g 8.7g 2000 R  100  3.5   2:54.36 opengm_min_sum
28551 khue     20   0 11.6g  11g 1668 R  100  4.4  19:48.36 opengm_min_sum
28528 khue     20   0 6934m 4.9g 1732 R   29  1.9   1:01.13 opengm_min_sum
28563 khue     20   0 7722m 6.7g 1680 D    2  2.7   0:56.74 opengm_min_sum
28566 khue     20   0 8764m 7.9g 1680 D    2  3.1   1:00.13 opengm_min_sum
28530 khue     20   0 5686m 4.8g 1732 D    1  1.9   0:56.23 opengm_min_sum
28534 khue     20   0 5776m 4.6g 1744 D    1  1.8   0:53.46 opengm_min_sum
28539 khue     20   0 6742m 5.0g 1732 D    1  2.0   0:58.95 opengm_min_sum
28548 khue     20   0 5776m 4.7g 1744 D    1  1.9   0:55.67 opengm_min_sum
28559 khue     20   0 8258m 7.1g 1680 D    1  2.8   0:57.90 opengm_min_sum
28564 khue     20   0 10.6g  10g 1680 D    1  4.0   1:08.75 opengm_min_sum
28529 khue     20   0 5686m 4.4g 1732 D    1  1.7   1:05.55 opengm_min_sum
28531 khue     20   0 4338m 3.6g 1724 D    1  1.4   0:57.72 opengm_min_sum
28533 khue     20   0 6064m 5.2g 1744 D    1  2.1   1:05.19 opengm_min_sum

(opengm_min_sum is the solver above)

I guess that some processes consume so much resource that the others do not have anything left and enter the D state?

Best Answer

Summary of the comments: The machine is fast but doesn't have enough memory to run everything in parallel. In addition the problem needs to read a lot of data and the disk bandwidth is not enough, so the cpus are idle most of the time waiting for data.

Rearranging the tasks helps.

Not yet investigated compressing the data to see if it can improve the effective disk I/O bandwidth.

