Why doesn’t sed exit immediately after writing the output

pvsed

I ran sed on a large file, and used the pv utility to see how quickly it's reading input and writing output. Although pv showed that sed read the input and wrote the output within about 5 seconds, sed did not exit for another 20-30 seconds. Why is this?

Here's the output I saw:

pv -cN source input.txt | sed "24629045,24629162d" | pv -cN output > output.txt
   source: 2.34GB 0:00:06 [ 388MB/s] [==========================================================================================================>] 100%            
   output: 2.34GB 0:00:05 [ 401MB/s] [              <=>                                                                                                           ]

Best Answer

There are two reasons. In the first place, you don't tell it to quit.

Consider:

seq 10 | sed -ne1,5p

In that case, though it only prints the first half of input lines, it must still read the rest of them through to EOF. Instead:

seq 10|sed 5q

It will quit right away there.

You're also working with a delay between each process. So if pv buffers at 4kb, and sed buffers 4kb, then the last pv is 8kb behind input all the while. It is quite likely that the numbers are higher than that.

You can try the -u switch w/ a GNU/BSD/AST sed but that's almost definitely not going to help performance on large inputs. If you call a GNU sed with -u it will read() for every byte of input. I haven't looked at what the others do in that situation, but I have no reason to believe they would do any differently. All three document -u to mean unbuffered - and that's a pretty generally understood concept where streams are concerned.

Another thing you might do is explicitly line-buffer sed's output with the write command and one-or-more named write-file[s]. It will still slow things a little, but it probably will be better than the alternative.

You can do this w/ any sed like:

sed -n 'w outfile'

sed's write command is always immediate - it is unbuffered output. And because (by default) sed applies commands once per line-cycle, sed can be easily used to effectively line-buffer i/o even within the middle of a pipeline. That way, at least, you can keep the second pv pretty much up to date w/ sed the whole time like:

pv ... | sed -n '24629045,24629162!w /dev/fd/1' | pv ...

...though that assumes a system which provides the /dev/fd/[num] links (which is to say: practically any linux-based system - excepting Android - and many others besides). Failing said links' availability, to do the same thing you could just explicitly create your own pipe with mkfifo and use it as the last pv's stdin and name it as sed's write file.

Related Solutions

Shell – Best Way to Find Multiple Strings in Large Text File

grep supports getting patterns from a file -f, and becomes more efficient if you also specify fixed strings (-F):

grep -F -f patterns.txt "//'MVS.DATASET'"

Python – case sensitive substitution; same target ids

You can't really do this kind of thing with sed, it's just a text stream editor. Try this Perl scriptlet:

#!/usr/bin/env perl 

## Set the record separator to \n\n to
## read multiple lines as a single record
$/="\n\n";
## This array will contain all lines of the file
my @lines=<>;

## The list of suffixes
@suffix=(a..z); 

## For each line of the input file
foreach (@lines) {
    ## If the current line (lines are now the actual multiline records
    ## because we set $/ to consecutive newlines) is one we are interested in.
    if (/isoforms.*?Target=(\S+)/s){
    ## Keep a list of seen targets
    $seen{$1}++;
    }

}
## Now that we have processed the entire file
## go back and print each line.
foreach (@lines) {

    ## If this line is one of the ones we're interested in
    if(/Name=(.+?);.*?isoforms=.*?Target=(\S+)/s){
    $name=$1; $target=$2;
    ## This is needed so we can know whether
    ## how many times we've seen this target so far.
    $newseen{$target}++;
    ## If this target exists more than once in the input file
    if ($seen{$target}>1) {
        ## Use the %newseen hash to choose the right letter.
        ## The -1 is needed because the first element of an
        ## array is 0, not 1.
        s/$name/$target.$suffix[$newseen{$target}-1]/;
    }
    else {
        s/$name/$target/;
    }
    }
    print;
}

Save the script above as foo.pl, make it executable (chmod a+x foo.pl) and run on your input file:

./foo.pl input.txt > output.txt

Best Answer

Related Solutions

Shell – Best Way to Find Multiple Strings in Large Text File

Python – case sensitive substitution; same target ids

Related Question