Shell – pipe line with non-deterministic output

awkconcurrencyio-redirectionshell-script

I have happened upon a command that sometimes works and sometimes does not, even when executed multiple times in rapid succession in a bash shell (I have not tested the behavior in other shells). The problem has been localized to the reading of a variable in the BEGIN block of an awk statement at the end of the pipe line. During some executions, the variable is correctly read in the BEGIN block and during other executions, the operation fails. Supposing this aberrant behavior can be reproduced by others (and is not a consequence of some problem with my system), can its inconsistency be explained?

Take as input the following file called tmp:

cat > tmp <<EOF
a   a
b   *
aa  a
aaa a
aa  a
a   a
c   *
aaa a
aaaa    a
d   *
aaa a
a   a
aaaaa   a
e   *
aaaa    a
aaa a
f   *
aa  a
a   a
g   *
EOF

On my system, the pipe line

 awk '{if($2!~/\*/) print $1}' tmp | tee >(wc -l | awk '{print $1}' > n.txt) | sort | uniq -c | sort -k 1,1nr | awk 'BEGIN{getline n < "n.txt"}{print $1 "\t" $1/n*100 "\t" $2}'

will either produce the correct output:

4   28.5714 a
4   28.5714 aaa
3   21.4286 aa
2   14.2857 aaaa
1   7.14286 aaaaa

or the error message:

awk: cmd. line:1: (FILENAME=- FNR=1) fatal: division by zero attempted

How can a command possibly give different output when run twice in succession when no random number generation is involved and no change to the environment is made in the interim?

To demonstrate how absurd the behavior is, consider the output generated by executing the above pipe line ten times consecutively in a loop:

for x in {1..10}; do echo "Iteration ${x}"; awk '{if($2!~/\*/) print $1}' tmp | tee >(wc -l | awk '{print $1}' > n.txt) | sort | uniq -c | sort -k 1,1nr | awk 'BEGIN{getline n < "n.txt"}{print $1 "\t" $1/n*100 "\t" $2}'; done
Iteration 1
awk: cmd. line:1: (FILENAME=- FNR=1) fatal: division by zero attempted
Iteration 2
4   28.5714 a
4   28.5714 aaa
3   21.4286 aa
2   14.2857 aaaa
1   7.14286 aaaaa
Iteration 3
4   28.5714 a
4   28.5714 aaa
3   21.4286 aa
2   14.2857 aaaa
1   7.14286 aaaaa
Iteration 4
awk: cmd. line:1: (FILENAME=- FNR=1) fatal: division by zero attempted
Iteration 5
awk: cmd. line:1: (FILENAME=- FNR=1) fatal: division by zero attempted
Iteration 6
awk: cmd. line:1: (FILENAME=- FNR=1) fatal: division by zero attempted
Iteration 7
4   28.5714 a
4   28.5714 aaa
3   21.4286 aa
2   14.2857 aaaa
1   7.14286 aaaaa
Iteration 8
awk: cmd. line:1: (FILENAME=- FNR=1) fatal: division by zero attempted
Iteration 9
4   28.5714 a
4   28.5714 aaa
3   21.4286 aa
2   14.2857 aaaa
1   7.14286 aaaaa
Iteration 10
awk: cmd. line:1: (FILENAME=- FNR=1) fatal: division by zero attempted

Note: I have also tried closing the file (awk close) after reading the variable, in case the problem relates to the file being left open. However, the inconsistent output remains.

Best Answer

Your redirections have a race condition. This:

>(wc -l | awk '{print $1}' > n.txt)

runs in parallel with:

awk 'BEGIN{getline n < "n.txt"}...'

later in the pipeline. Sometimes, n.txt is still empty when the awk program starts running.

This is (obliquely) documented in the Bash Reference Manual. In a pipeline:

The output of each command in the pipeline is connected via a pipe to the input of the next command. That is, each command reads the previous command’s output. This connection is performed before any redirections specified by the command.

and then:

Each command in a pipeline is executed in its own subshell

(emphasis added). All the processes in the pipeline are started, with their input and output connected together, without waiting for any of the earlier programs to finish or even start doing anything. Before that, process substitution with >(...) is:

performed simultaneously with parameter and variable expansion, command substitution, and arithmetic expansion.

What that means is that the subprocess running the wc -l | awk ... command starts early on, and the redirection empties n.txt just before that, but the awk process that causes the error is started shortly after. Both of those commands execute in parallel - you'll have several processes going at once here.

The error occurs when awk runs its BEGIN block before the wc command's output has been written into n.txt. In that case, the n variable is empty, and so is zero when used as a number. If the BEGIN runs after the file is filled in, everything works.

When that happens depends on the operating system scheduler, and which process gets a slot first, which is essentially random from the user perspective. If the final awk gets to run early, or the wc pipeline gets scheduled a little later, the file will still be empty when awk starts doing its work and the whole thing will break. In all likelihood the processes will run on different cores actually simultaneously, and it's down to which one gets to the point of contention first. The effect you'll get is probably of the command working more often than not, but sometimes failing with the error you post.

In general, pipelines are only safe in so far as they're just pipelines - standard output into standard input is fine, but because the processes execute in parallel it's not reliable to rely on the sequencing of any other communication channels, like files, or of any part of any one process executing before or after any part of another unless they're locked together by reading standard input.

The workaround here is probably to do all your file writing in advance of needing them: at the end of a line, it's guaranteed that an entire pipeline and all of its redirections have completed before the next command runs. This command will never be reliable, but if you really do need it to work in this sort of a structure you can insert a delay (sleep) or loop until n.txt is non-empty before running the final awk command to increase the chances of things working how you want.

Best Answer

Related Solutions

Non-blocking buffered named pipe

Suppress Output Unless Non-Zero Exit Code – Shell Script Tips

Related Question