Bash – How Process Substitution is Implemented

bashprocess-substitution

I was researching the other question, when I realized I don't understand what's happening under the hood, what are those /dev/fd/* files and how come child processes can open them.

Best Answer

Well, there are many aspects to it.

File descriptors

For each process, the kernel maintains a table of open files (well, it might be implemented differently, but since you are not able to see it anyways, you can just assume it's a simple table). That table contains information about which file it is/where it can be found, in which mode you opened it, at which position you are currently reading/writing, and whatever else is needed to actually perform I/O operations on that file. Now the process never gets to read (or even write) that table. When the process opens a file, it gets back a so-called file descriptor. Which is simply an index into the table.

The directory /dev/fd and its content

On Linux dev/fd is actually a symbolic link to /proc/self/fd. /proc is a pseudo file system in which the kernel maps several internal data structures to be accessed with the file API (so they just look like regular files/directories/symlinks to the programs). Especially there's information about all processes (which is what gave it the name). The symbolic link /proc/self always refers to the directory associated with currently running process (that is, the process requesting it; different processes therefore will see different values). In the process's directory, there's a subdirectory fd which for each open file contains a symbolic link whose name is just the decimal representation of file descriptor (the index into the process's file table, see previous section), and whose target is the file it corresponds to.

File descriptors when creating child processes

A child process is created by a fork. A fork makes a copy of the file descriptors, which means that the child process created has the very same list of open files as the parent process does. So unless one of the open files is closed by the child, accessing an inherited file descriptor in the child will access the very same file as accessing the original file descriptor in the parent process.

Note that after a fork, you initially have two copies of the same process which differ only in the return value from the fork call (the parent gets the PID of the child, the child gets 0). Normally, a fork is followed by an exec to replace one of the copies by another executable. The open file descriptors survive that exec. Note also that before the exec, the process can do other manipulations (like closing files that the new process should not get, or opening other files).

Unnamed pipes

An unnamed pipe is just a pair of file descriptors created on request by the kernel, so that everything written to the first file descriptor is passed to the second. The most common use is for the piping construct foo | bar of bash, where the standard output of foo is replaced by the write part of the pipe, and the standard input is replaces by the read part. Standard input and standard output are just the first two entries in the file table (entry 0 and 1; 2 is standard error), and therefore replacing it means just rewriting that table entry with the data corresponding to the other file descriptor (again, the actual implementation may differ). Since the process cannot access the table directly, there's a kernel function to do that.

Process substitution

Now we have everything together to understand how the process substitution works:

The bash process creates an unnamed pipe for communication between the two processes created later.
Bash forks for the echo process. The child process (which is an exact copy of the original bash process) closes the reading end of the pipe and replaces its own standard output with the writing end of the pipe. Given that echo is a shell builtin, bash might spare itself the exec call, but it doesn't matter anyway (the shell builtin might also be disabled, in which case it execs /bin/echo).
Bash (the original, parent one) replaces the expression <(echo 1) by the pseudo file link in /dev/fd referring to the reading end of the unnamed pipe.
Bash execs for the PHP process (note that after the fork, we are still inside [a copy of] bash). The new process closes the inherited write end of the unnamed pipe (and does some other preparatory steps), but leaves the read end open. Then it executed PHP.
The PHP program receives the name in /dev/fd/. Since the the corresponding file descriptor is still open, it still corresponds to the reading end of the pipe. Therefore if the PHP program opens the given file for reading, what it actually does is to create a second file descriptor for the reading end of the unnamed pipe. But that's no problem, it could read from either.
Now the PHP program can read the reading end of the pipe through the new file descriptor, and thus receive the standard output of the echo command which goes to the writing end of the same pipe.

Related Solutions

Shell – Process Substitution and Pipe

A good way to grok the difference between them is to do a little experimenting on the command line. In spite of the visual similarity in use of the < character, it does something very different than a redirect or pipe.

Let's use the date command for testing.

$ date | cat
Thu Jul 21 12:39:18 EEST 2011

This is a pointless example but it shows that cat accepted the output of date on STDIN and spit it back out. The same results can be achieved by process substitution:

$ cat <(date)
Thu Jul 21 12:40:53 EEST 2011

However what just happened behind the scenes was different. Instead of being given a STDIN stream, cat was actually passed the name of a file that it needed to go open and read. You can see this step by using echo instead of cat.

$ echo <(date)
/proc/self/fd/11

When cat received the file name, it read the file's content for us. On the other hand, echo just showed us the file's name that it was passed. This difference becomes more obvious if you add more substitutions:

$ cat <(date) <(date) <(date)
Thu Jul 21 12:44:45 EEST 2011
Thu Jul 21 12:44:45 EEST 2011
Thu Jul 21 12:44:45 EEST 2011

$ echo <(date) <(date) <(date)
/proc/self/fd/11 /proc/self/fd/12 /proc/self/fd/13

It is possible to combine process substitution (which generates a file) and input redirection (which connects a file to STDIN):

$ cat < <(date)
Thu Jul 21 12:46:22 EEST 2011

It looks pretty much the same but this time cat was passed STDIN stream instead of a file name. You can see this by trying it with echo:

$ echo < <(date)
<blank>

Since echo doesn't read STDIN and no argument was passed, we get nothing.

Pipes and input redirects shove content onto the STDIN stream. Process substitution runs the commands, saves their output to a special temporary file and then passes that file name in place of the command. Whatever command you are using treats it as a file name. Note that the file created is not a regular file but a named pipe that gets removed automatically once it is no longer needed.

Shell – Subshell and process substitution

Process substitution is a feature that originated in the Korn shell in the 80s (in ksh86). At the time, it was only available on systems that had support for /dev/fd/<n> files.

Later, the feature was added to zsh (from the start: 1990) and bash (in 1993). zsh was using temporary named pipes to implement it, while bash was using /dev/fd/<n> where available and named pipes otherwise. zsh switched to using /dev/fd/<n> where available in 2.6-beta17 in 1996.

Support for process substitution via named pipes on systems without /dev/fd was only added to ksh in ksh93u+ in 2012. The public domain clone of ksh doesn't support it.

To my knowledge, no other Bourne-like shell supports it (rc, es, fish, non-Bourne-like shells support it but with a different syntax). yash has a <(...) construct, but that's for process redirection.

While quite useful, the feature was never standardized by POSIX. So, one can't expect to find it in sh, so shouldn't use it in a sh script.

Though the behaviour for <(...) is unspecified in POSIX, (so there would be no harm in retaining it), bash disables the feature when called as sh or when called with POSIXLY_CORRECT=1 in its environment.

So, if you have a script that uses <(...), you should use a shell that supports the feature to interpret it like zsh, bash or AT&T ksh (of course, you need to make sure the rest of the syntax of script is also compatible with that shell).

In any case:

cat <(cmd)

Can be written:

cmd | cat

Or just

cmd

For a command other than cat (that needs to be passed data via a file given as argument), on systems with /dev/fd/x, you can always do:

something | that-cmd /dev/stdin

Or if you need that-cmd's stdin to be preserved:

{ something 3<&- | that-cmd /dev/fd/4 4<&0 <&3 3<&-; } 3<&0

Best Answer

Related Solutions

Shell – Process Substitution and Pipe

Shell – Subshell and process substitution

Related Question