I ran sed on a large file, and used the pv utility to see how quickly it's reading input and writing output. Although pv showed that sed read the input and wrote the output within about 5 seconds, sed did not exit for another 20-30 seconds. Why is this?
Here's the output I saw:
pv -cN source input.txt | sed "24629045,24629162d" | pv -cN output > output.txt
source: 2.34GB 0:00:06 [ 388MB/s] [==========================================================================================================>] 100%
output: 2.34GB 0:00:05 [ 401MB/s] [ <=> ]
Best Answer
There are two reasons. In the first place, you don't tell it to
q
uit.Consider:
In that case, though it only
p
rints the first half of input lines, it must still read the rest of them through to EOF. Instead:It will quit right away there.
You're also working with a delay between each process. So if
pv
buffers at 4kb, andsed
buffers 4kb, then the lastpv
is 8kb behind input all the while. It is quite likely that the numbers are higher than that.You can try the
-u
switch w/ a GNU/BSD/ASTsed
but that's almost definitely not going to help performance on large inputs. If you call a GNUsed
with-u
it willread()
for every byte of input. I haven't looked at what the others do in that situation, but I have no reason to believe they would do any differently. All three document-u
to mean unbuffered - and that's a pretty generally understood concept where streams are concerned.Another thing you might do is explicitly line-buffer
sed
's output with thew
rite command and one-or-more namedw
rite-file[s]. It will still slow things a little, but it probably will be better than the alternative.You can do this w/ any
sed
like:sed
'sw
rite command is always immediate - it is unbuffered output. And because (by default)sed
applies commands once per line-cycle,sed
can be easily used to effectively line-buffer i/o even within the middle of a pipeline. That way, at least, you can keep the secondpv
pretty much up to date w/sed
the whole time like:...though that assumes a system which provides the
/dev/fd/[num]
links (which is to say: practically any linux-based system - excepting Android - and many others besides). Failing said links' availability, to do the same thing you could just explicitly create your own pipe withmkfifo
and use it as the lastpv
's stdin and name it assed
'sw
rite file.