Why doesn’t sed exit immediately after writing the output

pvsed

I ran sed on a large file, and used the pv utility to see how quickly it's reading input and writing output. Although pv showed that sed read the input and wrote the output within about 5 seconds, sed did not exit for another 20-30 seconds. Why is this?

Here's the output I saw:

pv -cN source input.txt | sed "24629045,24629162d" | pv -cN output > output.txt
   source: 2.34GB 0:00:06 [ 388MB/s] [==========================================================================================================>] 100%            
   output: 2.34GB 0:00:05 [ 401MB/s] [              <=>                                                                                                           ]

Best Answer

There are two reasons. In the first place, you don't tell it to quit.

Consider:

seq 10 | sed -ne1,5p

In that case, though it only prints the first half of input lines, it must still read the rest of them through to EOF. Instead:

seq 10|sed 5q

It will quit right away there.

You're also working with a delay between each process. So if pv buffers at 4kb, and sed buffers 4kb, then the last pv is 8kb behind input all the while. It is quite likely that the numbers are higher than that.

You can try the -u switch w/ a GNU/BSD/AST sed but that's almost definitely not going to help performance on large inputs. If you call a GNU sed with -u it will read() for every byte of input. I haven't looked at what the others do in that situation, but I have no reason to believe they would do any differently. All three document -u to mean unbuffered - and that's a pretty generally understood concept where streams are concerned.

Another thing you might do is explicitly line-buffer sed's output with the write command and one-or-more named write-file[s]. It will still slow things a little, but it probably will be better than the alternative.

You can do this w/ any sed like:

sed -n 'w outfile'

sed's write command is always immediate - it is unbuffered output. And because (by default) sed applies commands once per line-cycle, sed can be easily used to effectively line-buffer i/o even within the middle of a pipeline. That way, at least, you can keep the second pv pretty much up to date w/ sed the whole time like:

pv ... | sed -n '24629045,24629162!w /dev/fd/1' | pv ...

...though that assumes a system which provides the /dev/fd/[num] links (which is to say: practically any linux-based system - excepting Android - and many others besides). Failing said links' availability, to do the same thing you could just explicitly create your own pipe with mkfifo and use it as the last pv's stdin and name it as sed's write file.

Related Question