Bash – Split output and rejoin again with named pipes on linux

bashfifolinux

My question is related to https://serverfault.com/questions/171095/how-do-i-join-two-named-pipes-into-single-input-stream-in-linux but with a slightly more convoluted setup.

I have three programs, cmd1, cmd2 and cmd3;

cmd1 takes no input and writes to stdout

cmd2 reads stdin or a given file and writes to stdout

cmd3 reads two files

The dataflow for these programs is the following: cmd2 consumes data produced by cmd1, and cmd3 consumes data produced by both cmd1 and cmd2:

cmd1 ---+-----> cmd2 --->
        |                  cmd3
        +--------------->

How can I achieve this dataflow with a single command line using >(), pipes and tee?

My best guess is cmd1 | tee >(cmd2) > >(cmd3).

Best Answer

mkfifo thepipe
cmd3 <( cmd1 | tee thepipe ) <( cmd2 thepipe )

This uses a named pipe, thepipe, to transfer data between tee and cmd2.

Using your diagram:

cmd1 ---(tee)---(thepipe)--- cmd2 --->
          |                            cmd3
          +-------------------------->

Example with

cmd1 = echo 'hello world', writes a string to standard output.
cmd2 = rev, reverses the order of characters on each line, reads a file or from standard input.
cmd3 = paste, takes input from two files (in this case) and produces two columns.

mkfifo thepipe
paste <( echo 'hello world' | tee thepipe ) <( rev thepipe )

Result:

hello world     dlrow olleh

The same thing, but putting the named pipe on the other branch in your diagram:

cmd1 ---(tee)--------------- cmd2 --->
          |                            cmd3
          +-----(thepipe)------------>

cmd3 thepipe <( cmd1 | tee thepipe | cmd2 )

With our example commands:

paste thepipe <( echo 'hello world' | tee thepipe | rev )

This produces the same output as above.

There are obviously other possibilities, such as

cmd1 | tee >( cmd2 >thepipe ) | cmd3 /dev/stdin thepipe

but I don't think you can get away from having to use a named pipe unless you're happy writing intermediate results to a temporary file and breaking it down into two sets of commands.

Related Solutions

Shell – How to forward between processes with named pipes

If you get rid of the killing and shutdown stuff (which is unsafe and you may, in an extreme, but not unfathomable case when child.py dies before the (head -n 1 shutdown; kill -9 $parent) & subshell does end up kill -9ing some innocent process), then child.py won't be terminating because your parent.py isn't behaving like a good UNIX citizen.

The cat std_out & subprocess will have finished by the time you send the quit message, because the writer to std_out is child_original.py, which finishes upon receiving quit at which moment it closes its stdout, which is the std_out pipe and that close will make the cat subprocess finish.

The cat > std_in isn't finishing because it's reading from a pipe originating in the parent.py process and the parent.py process didn't bother to close that pipe. If it did, cat > stdin_in and consequently the whole child.py would finish by itself and you wouldn't need the shutdown pipe or the killing part (killing a process that isn't your child on UNIX is always a potential security hole if a race condition caused due to rapid PID recycling should occur).

Processes at the right end of a pipeline generally only finish once they're done reading their stdin, but since you're not closing that (child.stdin), you're implicitly telling the child process "wait, I have more input for you" and then you go kill it because it does wait for more input from you as it should.

In short, make parent.py behave reasonably:

from __future__ import print_function
from subprocess import Popen, PIPE
import os

child = Popen('./child.py', stdin=PIPE, stdout=PIPE)

for letter in 'abcde':
    print('Parent writes to child: ', letter)
    child.stdin.write(letter+'\n')
    child.stdin.flush()
    response = child.stdout.readline()
    print('Response from the child:', response)
    assert response.rstrip() == letter.upper(), 'Wrong response'

child.stdin.write('quit\n')
child.stdin.flush()
child.stdin.close()
print('Waiting for the child to terminate...')
child.wait()
print('Done!')

And your child.py can be as simple as

#!/bin/sh
cat std_out &
cat > std_in
wait #basically to assert that cat std_out has finished at this point

(Note that I got rid of that fd dup calls because otherwise you'd need to close both child.stdin and the child_stdin duplicate).

Since parent.py operates in line-oriented fashion, gnu cat is unbuffered (as mikeserv pointed out) and child_original.py operates in a line oriented fashion, you've effectively got the whole thing line-buffered.

Note on Cat: Unbufferred might not be the luckiest term, as gnu cat does use a buffer. What it doesn't do is try to get the whole buffer full before writing things out (unlike stdio). Basically it makes read requests to the os for a specific size (its buffer size), and writes whatever it receives without waiting to get a whole line or the whole buffer. (read(2) can be lazy and give you only what it can give you at the moment rather than the whole buffer you've asked for.)

(You can inspect the source code at http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/cat.c ; safe_read (used instead of plain read) is in the gnulib submodule and it's a very simple wrapper around read(2) that abstracts away EINTR (see the man page)).

Named pipes, file descriptors and EOF

It has to do with the closing of the file descriptor.

In your first example, echo writes to its standard output stream which the shell opens to connect it with f, and when it terminates, its descriptor is closed (by the shell). On the receiving end, the shell, which reads input from its standard input stream (connected to f) reads ls, runs ls and then terminates due to the end-of-file condition on its standard input.

The end-of-file condition occurs because all writers to the named pipe (only one in this example) have closed their end of the pipe.

In your second example, exec 3>f opens file descriptor 3 for writing to f, then echo writes ls to it. It's the shell that now has the file descriptor opened, not the echo command. The descriptor remains open until you do exec 3>&-. On the receiving end, the shell, which reads input from its standard input stream (connected to f) reads ls, runs ls and then waits for more input (since the stream is still open).

The stream remains open because all writers to it (the shell, via exec 3>f, and echo) have not closed their end of the pipe (exec 3>f is still in effect).

I have written about echo above as if it was an external command. It's most likely is built into the shell. The effect is the same nonetheless.

Best Answer

Related Solutions

Shell – How to forward between processes with named pipes

Named pipes, file descriptors and EOF

Related Question