If I call some command, for instance an echo
I can use the results from that command in several other commands with tee
. Example:
echo "Hello world!" | tee >(command1) >(command2) >(command3)
With cat I can collect the results of several commands. Example:
cat <(command1) <(command2) <(command3)
I would like to be able to do both things at the same time, so that I can use tee
to call those commands on the output of something else (for instance the echo
I've written) and then collect all their results on a single output with cat
.
It's important to keep the results in order, this means the lines in the output of command1
, command2
and command3
should not be intertwined, but ordered as the commands are (as it happens with cat
).
There may be better options than cat
and tee
but those are the ones I know so far.
I want to avoid using temporary files because the size of the input and output may be large.
How could I do this?
PD: another problem is that this happens in a loop, that makes harder handling temporary files. This is the current code I have and it works for small testcases, but it creates infinite loops when reading and writing from the auxfile in some way I don't understand.
somefunction()
{
if [ $1 -eq 1 ]
then
echo "Hello world!"
else
somefunction $(( $1 - 1 )) > auxfile
cat <(command1 < auxfile) \
<(command2 < auxfile) \
<(command3 < auxfile)
fi
}
Readings and writings in auxfile seem to be overlapping, causing everything to explode.
Best Answer
You could use a combination of GNU stdbuf and
pee
from moreutils:pee
popen(3)
s those 3 shell command lines and thenfread
s the input andfwrite
s it all three, which will be buffered to up to 1M.The idea is to have a buffer at least as big as the input. This way even though the three commands are started at the same time, they will only see input coming in when
pee
pclose
s the three commands sequentially.Upon each
pclose
,pee
flushes the buffer to the command and waits for its termination. That guarantees that as long as thosecmdx
commands don't start outputting anything before they've received any input (and don't fork a process that may continue outputting after their parent has returned), the output of the three commands won't be interleaved.In effect, that's a bit like using a temp file in memory, with the drawback that the 3 commands are started concurrently.
To avoid starting the commands concurrently, you could write
pee
as a shell function:But beware that shells other than
zsh
would fail for binary input with NUL characters.That avoids using temporary files, but that means the whole input is stored in memory.
In any case, you'll have to store the input somewhere, in memory or a temp file.
Actually, it's quite an interesting question, as it shows us the limit of the Unix idea of having several simple tools cooperate to a single task.
Here, we'd like to have several tools cooperate to the task:
echo
)tee
)cmd1
,cmd2
,cmd3
)cat
).It would be nice if they could all run together at the same time and do their hard work on the data that they're meant to process as soon as it's available.
In the case of one filter command, it's easy:
All commands are run concurrently,
cmd1
starts to munch data fromsrc
as soon as it's available.Now, with three filter commands, we can still do the same: start them concurrently and connect them with pipes:
Which we can do relatively easily with named pipes:
(above the
} 3<&0
is to work around the fact that&
redirectsstdin
from/dev/null
, and we use<>
to avoid the opening of the pipes to block until the other end (cat
) has opened as well)Or to avoid named pipes, a bit more painfully with
zsh
coproc:Now, the question is: once all the programs are started and connected, will the data flow?
We've got two contraints:
tee
feeds all its outputs at the same rate, so it can only dispatch data at the rate of its slowest output pipe.cat
will only start reading from the second pipe (pipe 6 in the drawing above) when all data has been read from the first (5).What that means is that data will not flow in pipe 6 until
cmd1
has finished. And, like in the case of thetr b B
above, that may mean that data will not flow in pipe 3 either, which means it will not flow in any of pipes 2, 3 or 4 sincetee
feeds at the slowest rate of all 3.In practice those pipes have a non-null size, so some data will manage to get through, and on my system at least, I can get it to work up to:
Beyond that, with
We've got a deadlock, where we're in this situation:
We've filled pipes 3 and 6 (64kiB each).
tee
has read that extra byte, it has fed it tocmd1
, butcmd2
to empty itcmd2
can't empty it because it's blocked writing on pipe 6, waiting forcat
to empty itcat
can't empty it because it's waiting until there's no more input on pipe 5.cmd1
can't tellcat
there's no more input because it is waiting itself for more input fromtee
.tee
can't tellcmd1
there's no more input because it's blocked... and so on.We've got a dependency loop and thus a deadlock.
Now, what's the solution? Bigger pipes 3 and 4 (big enough to contain all of
src
's output) would do it. We could do that for instance by insertingpv -qB 1G
betweentee
andcmd2/3
wherepv
could store up to 1G of data waiting forcmd2
andcmd3
to read them. That would mean two things though:cmd2
would in reality only start to process data when cmd1 has finished.A solution to the second problem would be to make pipes 6 and 7 bigger as well. Assuming that
cmd2
andcmd3
produce as much output as they consume, that would not consume more memory.The only way to avoid duplicating the data (in the first problem) would be to implement the retention of data in the dispatcher itself, that is implement a variation on
tee
that can feed data at the rate of the fastest output (holding data to feed the slower ones at their own pace). Not really trivial.So, in the end, the best we can reasonably get without programming is probably something like (Zsh syntax):