The data doesn’t need to be stored in RAM. Pipes block their writers if the readers aren’t there or can’t keep up; under Linux (and most other implementations, I imagine) there’s some buffering but that’s not required. As mentioned by mtraceur and JdeBP (see the latter’s answer), early versions of Unix buffered pipes to disk, and this is how they helped limit memory usage: a processing pipeline could be split up into small programs, each of which would process some data, within the limits of the disk buffers. Small programs take less memory, and the use of pipes meant that processing could be serialised: the first program would run, fill its output buffer, be suspended, then the second program would be scheduled, process the buffer, etc. Modern systems are orders of magnitude larger than the early Unix systems, and can run many pipes in parallel; but for huge amounts of data you’d still see a similar effect (and variants of this kind of technique are used for “big data” processing).
In your example,
sed 'simplesubstitution' file | sort | uniq > file2
sed
reads data from file
as necessary, then writes it as long as sort
is ready to read it; if sort
isn’t ready, the write blocks. The data does indeed live in memory eventually, but that’s specific to sort
, and sort
is prepared to deal with any issues (it will use temporary files it the amount of data to sort is too large).
You can see the blocking behaviour by running
strace seq 1000000 -1 1 | (sleep 120; sort -n)
This produces a fair amount of data and pipes it to a process which isn’t ready to read anything for the first two minutes. You’ll see a number of write
operations go through, but very quickly seq
will stop and wait for the two minutes to elapse, blocked by the kernel (the write
system call waits).
Since the accepted answer is using perl
, you can just as well do the whole thing in perl
, without other non-standard tools and non-standard shell features, and without loading unpredictably long chunks of data in the memory, or other such horrible misfeatures.
The ytee
script from the end of this answer, when used in this manner:
ytee command filter1 filter2 filter3 ...
will work just like
command <(filter1) <(filter2) <(filter3) ...
with its standard input piped to filter1
, filter2
, filter3
, ... in parallel, as if it were with
tee >(filter1) >(filter2) >(filter3) ...
Example:
echo 'Line 1
Line B
Line iii' | ytee 'paste' 'sed s/B/b/g | nl' 'sed s/iii/III/ | nl'
1 Line 1 1 Line 1
2 Line b 2 Line B
3 Line iii 3 Line III
This is also an answer for the two very similar questions: here and here.
ytee:
#! /usr/bin/perl
# usage: ytee [-r irs] { command | - } [filter ..]
use strict;
if($ARGV[0] =~ /^-r(.+)?/){ shift; $/ = eval($1 // shift); die $@ if $@ }
elsif(! -t STDIN){ $/ = \0x8000 }
my $cmd = shift;
my @cl;
for(@ARGV){
use IPC::Open2;
my $pid = open2 my $from, my $to, $_;
push @cl, [$from, $to, $pid];
}
defined(my $pid = fork) or die "fork: $!";
if($pid){
delete $$_[0] for @cl;
$SIG{PIPE} = 'IGNORE';
my ($s, $n);
while(<STDIN>){
for my $c (@cl){
next unless exists $$c[1];
syswrite($$c[1], $_) ? $n++ : delete $$c[1]
}
last unless $n;
}
delete $$_[1] for @cl;
while((my $p = wait) > 0){ $s += !!$? << ($p != $pid) }
exit $s;
}
delete $$_[1] for @cl;
if($cmd eq '-'){
my $n; do {
$n = 0; for my $c (@cl){
next unless exists $$c[0];
if(my $d = readline $$c[0]){ print $d; $n++ }
else{ delete $$c[0] }
}
} while $n;
}else{
exec join ' ', $cmd, map {
use Fcntl;
fcntl $$_[0], F_SETFD, fcntl($$_[0], F_GETFD, 0) & ~FD_CLOEXEC;
'/dev/fd/'.fileno $$_[0]
} @cl;
die "exec $cmd: $!";
}
notes:
code like delete $$_[1] for @cl
will not only remove the file handles from the array, but will also close them immediately, because there's no other reference pointing to them; this is different from (properly) garbage collected languages like javascript
.
the exit status of ytee
will reflect the exit statuses of the command and filters; this could be changed/simplified.
Best Answer
I'm going to walk you through a somewhat complex example, based on a real life scenario.
Problem
Let's say the command
conky
stopped responding on my desktop, and I want to kill it manually. I know a little bit of Unix, so I know that what I need to do is execute the commandkill <PID>
. In order to retrieve the PID, I can useps
ortop
or whatever tool my Unix distribution has given me. But how can I do this in one command?Answer
DISCLAIMER: This command only works in certain cases. Don't copy/paste it in your terminal and start using it, it could kill processes unsuspectingly. Rather learn how to build it.
How it works
1-
ps aux
This command will output the list of running processes and some info about them. The interesting info is that it'll output the PID of each process in its 2nd column. Here's an extract from the output of the command on my box:
2-
grep conky
I'm only interested in one process, so I use
grep
to find the entry corresponding to my programconky
.3-
grep -v grep
As you can see in step 2, the command
ps
outputs thegrep conky
process in its list (it's a running process after all). In order to filter it, I can rungrep -v grep
. The option-v
tellsgrep
to match all the lines excluding the ones containing the pattern.NB: I would love to know a way to do steps 2 and 3 in a single
grep
call.4-
awk '{print $2}'
Now that I have isolated my target process. I want to retrieve its PID. In other words I want to retrieve the 2nd word of the output. Lucky for me, most (all?) modern unices will provide some version of
awk
, a scripting language that does wonders with tabular data. Our task becomes as easy asprint $2
.5-
xargs kill
I have the PID. All I need is to pass it to
kill
. To do this, I will usexargs
.xargs kill
will read from the input (in our case from the pipe), form a command consisting ofkill <items>
(<items>
are whatever it read from the input), and then execute the command created. In our case it will executekill 1948
. Mission accomplished.Final words
Note that depending on what version of unix you're using, certain programs may behave a little differently (for example,
ps
might output the PID in column $3). If something seems wrong or different, read your vendor's documentation (or better, theman
pages). Also be careful as long pipes can be dangerous. Don't make any assumptions especially when using commands likekill
orrm
. For example, if there was another user named 'conky' (or 'Aconkyous') my command may kill all his running processes too!What I'm saying is be careful, especially for long pipes. It's always better to build it interactively as we did here, than make assumptions and feel sorry later.