Shell – Behaviour of “|” pipe in linux with “>” output redirection

io-redirectionpipeshell

I am executing command

ls > a.txt | sort > b.txt

This command is doing the below things :

executing ls
sorting it
creating a.txt and storing sorted output to a.txt
creating b.txt , but its empty.

Can anyone explain this ?

I am implementing my own shell for which I need to understand this behavior & simulate it.

Best Answer

The | will take the output of the command on the left and give it to the input of the command on the right. The > operator will take the output of the command and put it into a file. That means, in your example, by the time it gets to the | there is no output left; it's all gone into a.txt. So the sort on the right operates on an empty string and saves that to b.txt

What you would probably like is to use the tee command which will both write to a file and stdout like

ls | tee a.txt | sort > b.txt

Though I'm really curious what you're trying to do, since ls can/will sort things for you as well.

Related Solutions

Shell – pipe line with non-deterministic output

Your redirections have a race condition. This:

>(wc -l | awk '{print $1}' > n.txt)

runs in parallel with:

awk 'BEGIN{getline n < "n.txt"}...'

later in the pipeline. Sometimes, n.txt is still empty when the awk program starts running.

This is (obliquely) documented in the Bash Reference Manual. In a pipeline:

The output of each command in the pipeline is connected via a pipe to the input of the next command. That is, each command reads the previous command’s output. This connection is performed before any redirections specified by the command.

and then:

Each command in a pipeline is executed in its own subshell

(emphasis added). All the processes in the pipeline are started, with their input and output connected together, without waiting for any of the earlier programs to finish or even start doing anything. Before that, process substitution with >(...) is:

performed simultaneously with parameter and variable expansion, command substitution, and arithmetic expansion.

What that means is that the subprocess running the wc -l | awk ... command starts early on, and the redirection empties n.txt just before that, but the awk process that causes the error is started shortly after. Both of those commands execute in parallel - you'll have several processes going at once here.

The error occurs when awk runs its BEGIN block before the wc command's output has been written into n.txt. In that case, the n variable is empty, and so is zero when used as a number. If the BEGIN runs after the file is filled in, everything works.

When that happens depends on the operating system scheduler, and which process gets a slot first, which is essentially random from the user perspective. If the final awk gets to run early, or the wc pipeline gets scheduled a little later, the file will still be empty when awk starts doing its work and the whole thing will break. In all likelihood the processes will run on different cores actually simultaneously, and it's down to which one gets to the point of contention first. The effect you'll get is probably of the command working more often than not, but sometimes failing with the error you post.

In general, pipelines are only safe in so far as they're just pipelines - standard output into standard input is fine, but because the processes execute in parallel it's not reliable to rely on the sequencing of any other communication channels, like files, or of any part of any one process executing before or after any part of another unless they're locked together by reading standard input.

The workaround here is probably to do all your file writing in advance of needing them: at the end of a line, it's guaranteed that an entire pipeline and all of its redirections have completed before the next command runs. This command will never be reliable, but if you really do need it to work in this sort of a structure you can insert a delay (sleep) or loop until n.txt is non-empty before running the final awk command to increase the chances of things working how you want.

Command Execution – Difference Between () and $() in Shell

Okay, let's break this down. A subshell executes its contents in a chain (i.e., it groups them). This actually makes intuitive sense as a subshell is created simply by surrounding the chain of commands with (). But, aside from the contents of the subshell being grouped together in execution, you can still use a subshell as if it were a single command. That is, a subshell still has an stdin, stdout and stderr so you can pipe things to and from a subshell.

On the other hand, command substitution is not the same thing as simply chaining commands together. Rather, command substitution is meant to act a bit like a variable access but with a function call. Variables, unlike commands, do not have the standard file descriptors so you cannot pipe anything to or from a variable (generally speaking), and the same is true of command substitutions.

To try to make this more clear, what follows are a set of maybe-unclear (but accurate) examples and a set of, what I think may be, more easily-understood examples.

Let's say the date -u command gives the following:

Thu Jul  2 13:42:27 UTC 2015

But, we want to manipulate the output of this command. So, let's pipe it into something like sed:

user@host~> date -u | sed -e 's/ /    /g'
Thu    Jul        2    13:42:27    UTC    2015

Wow, that was fun! The following is completely equivalent to above (barring some environment differences that you can read about in the man pages about your shell):

user@host~> (date -u) | sed -e 's/ /    /g'
Thu    Jul        2    13:42:27    UTC    2015

That should be no surprise since all we did was group date -u. However, if we do the following, we are going to get something that may seem a bit odd at first:

user@host~> $(date -u) | sed -e 's/ /    /g'
command not found: Thu

This is because $(date -u) is equivalent to typing out exactly what date -u outputs. So the above is equivalent to the following:

user@host~> Thu Jul  2 13:42:27 UTC 2015 | sed -e 's/ /    /g'

Which will, of course, error out because Thu is not a command (at least not one I know of); and it certainly doesn't pipe anything to stdout (so sed will never get any input).

But, since we know that command substitutions act like variables, we can easily fix this problem because we know how to pipe the value of a variable into another command:

user@host~> echo $(date -u) | sed -e 's/ /    /g'
Thu    Jul        2    13:42:27    UTC    2015

But, as with any variable in bash, you should probably quote command substitutions with "".

Now, for the perhaps-simpler example; consider the following:

user@host~> pwd
/home/hypothetical
user@host~> echo pwd
pwd
user@host~> echo "$(pwd)"
/home/hypothetical
user@host~> echo "$HOME"
/home/hypothetical
user@host~> echo (pwd)
error: your shell will tell you something weird that roughly means “Whoa! you tried to have me echo something that isn't text!”
user@host~> (pwd)
/home/hypothetical

I am not sure how to describe it any simpler than that. The command substitution works just like a variable access where the subshell still operates like a command.

Best Answer

Related Solutions

Shell – pipe line with non-deterministic output

Command Execution – Difference Between () and $() in Shell

Related Question