Bash – Multiple Processes Redirecting to the Same File

bashconcurrencyio-redirection

This is not a "how to append and not overwrite" question. I'm not looking for a file that combines the output of two commands. It's just a mistake I made and I would like to understand why the system did what it did

I use a command (on a remote ssh command line) that takes a long time to complete and outputs data (line by line every few seconds) to stdout so I redirect it to a file:

command > file.out &

Sometimes the remote session disconnects but the command keeps on running in the background. I didn't know this so I run the same command again, before the first one had finished:

command > file.out &

When both processes have finished I would expect to have (after reading some answers at this site) a single file with the lines from both commands messed up, but the output file only has the output from one of the 2 executions.

Why doesn't the file have both outputs intertwined (as warned in the comments here)? Which one of the 2 outputs does the final file belong to?

EDIT:

Removed one of the questions (why is the output file not locked for writing?) as it's explained here

Best Answer

When you open a file for writing using the > redirection, the file is truncated, i.e. it is completely emptied. It is however not deleted and recreated.

If one command starts by truncating the file and then writes something to it, and if another command then does the same, the first command's position within the file will not change. This means that you have two commands writing to the same file at two independent positions, one possibly overwriting the output of the other, depending on the order of writing and the amount of data being written.

So, yes, the data in the file may well be an intertwined mess of the output from both programs, but it will depend on the order of writes into the file, as well as the amount of data written and the timing of the truncations of the file.

Here's an example of intertwining the data from two commands:

#!/bin/sh

( { echo hello; sleep 2; echo world; } | cat >file ) &
sleep 1
echo 123 >file &

wait

This is what happens in this script:

The first command opens the file for writing and truncates it. It writes hello\n to it.
After one second, the second command truncates the file and writes 123\n to it. At this point, the first command's file pointer is still pointing into the file at some offset.
The first command continues writing world\n to the file.

The result is a file with a stretch of nul characters in the middle:

$ hexdump -C file
00000000  31 32 33 0a 00 00 77 6f  72 6c 64 0a              |123...world.|
0000000c

The nuls (00 in the output above) comes from the fact that the first command's file pointer wasn't reset by the second command's truncating of the file, so there was a "hole" created. The second command only wrote 123\n but would have overwritten the nuls if it had written more data:

$ hexdump -C file
00000000  31 32 33 35 36 37 77 6f  72 6c 64 0a              |123567world.|
0000000c

Here I've made the second command echo 1234567890, but only the 1234567 is left in the file. This is due to the first command continuing to write world\n at the point where its file pointer was after the second command had finished writing.

Related Solutions

Shell – pipe line with non-deterministic output

Your redirections have a race condition. This:

>(wc -l | awk '{print $1}' > n.txt)

runs in parallel with:

awk 'BEGIN{getline n < "n.txt"}...'

later in the pipeline. Sometimes, n.txt is still empty when the awk program starts running.

This is (obliquely) documented in the Bash Reference Manual. In a pipeline:

The output of each command in the pipeline is connected via a pipe to the input of the next command. That is, each command reads the previous command’s output. This connection is performed before any redirections specified by the command.

and then:

Each command in a pipeline is executed in its own subshell

(emphasis added). All the processes in the pipeline are started, with their input and output connected together, without waiting for any of the earlier programs to finish or even start doing anything. Before that, process substitution with >(...) is:

performed simultaneously with parameter and variable expansion, command substitution, and arithmetic expansion.

What that means is that the subprocess running the wc -l | awk ... command starts early on, and the redirection empties n.txt just before that, but the awk process that causes the error is started shortly after. Both of those commands execute in parallel - you'll have several processes going at once here.

The error occurs when awk runs its BEGIN block before the wc command's output has been written into n.txt. In that case, the n variable is empty, and so is zero when used as a number. If the BEGIN runs after the file is filled in, everything works.

When that happens depends on the operating system scheduler, and which process gets a slot first, which is essentially random from the user perspective. If the final awk gets to run early, or the wc pipeline gets scheduled a little later, the file will still be empty when awk starts doing its work and the whole thing will break. In all likelihood the processes will run on different cores actually simultaneously, and it's down to which one gets to the point of contention first. The effect you'll get is probably of the command working more often than not, but sometimes failing with the error you post.

In general, pipelines are only safe in so far as they're just pipelines - standard output into standard input is fine, but because the processes execute in parallel it's not reliable to rely on the sequencing of any other communication channels, like files, or of any part of any one process executing before or after any part of another unless they're locked together by reading standard input.

The workaround here is probably to do all your file writing in advance of needing them: at the end of a line, it's guaranteed that an entire pipeline and all of its redirections have completed before the next command runs. This command will never be reliable, but if you really do need it to work in this sort of a structure you can insert a delay (sleep) or loop until n.txt is non-empty before running the final awk command to increase the chances of things working how you want.

bash – How to Make Reading and Writing the Same File in Pipeline Fail

Gilles' answer explains the race condition. I'm just going to answer this part:

Is there any way I can force this script to output always 0 lines (so the I/O redirection to tmp is always prepared first and so the data is always destroyed)? To be clear, I mean changing the system settings

IDK if a tool for this already exists, but I have an idea for how one could be implemented. (But note this wouldn't be always 0 lines, just a useful tester that catches simple races like this easily, and some more complicated races. See @Gilles' comment.) It wouldn't guarantee that a script was safe, but might be a useful tool in testing, similar to testing a multi-threaded program on different CPUs, including weakly-ordered non-x86 CPUs like ARM.

You'd run it as racechecker bash foo.sh

Use the same system-call tracing / intercepting facilities that strace -f and ltrace -f use to attach to every child process. (On Linux, this is the same ptrace system call used by GDB and other debuggers to set breakpoints, single step, and modify memory / registers of another process.)

Instrument the open and openat system calls: when any process running under this tool makes a an open(2) system call (or openat) with O_RDONLY, sleep for maybe 1/2 or 1 second. Let other open system calls (especially ones including O_TRUNC) execute without delay.

This should allow the writer to win the race in nearly every race condition, unless system load was also high, or it was a complicated race condition where the truncation didn't happen until after some other read. So random variation of which open()s (and maybe read()s or writes) are delayed would increase the detection power of this tool, but of course without testing for an infinite amount of time with a delay simulator that will eventually cover all possible situations you can encounter in the real world, you can't be sure your scripts are free from races unless you read them carefully and prove they're not.

You would probably need it to whitelist (not delay open) for files in /usr/bin and /usr/lib so process-startup doesn't take forever. (Runtime dynamic linking has to open() multiple files (look at strace -eopen /bin/true or /bin/ls sometime), although if the parent shell itself is doing the truncation, that will be ok. But it will still be good for this tool to not make scripts unreasonably slow).

Or maybe whitelist every file the calling process doesn't have permission to truncate in the first place. i.e. the tracing process can make an access(2) system call before actually suspending the process that wanted to open() a file.

racechecker itself would have to be written in C, not in shell, but could maybe use strace's code as a starting point and might not take much work to implement.

You could maybe get the same functionality with a FUSE filesystem. There's probably a FUSE example of a pure passthrough filesystem, so you could add checks to the open() function in that which make it sleep for read-only opens but let truncation happen right away.

Best Answer

Related Solutions

Shell – pipe line with non-deterministic output

bash – How to Make Reading and Writing the Same File in Pipeline Fail

Related Question