A related question is here.
I often have to edit a large file by removing a few lines from the middle of it. I know which lines I wish to remove and I typically do the following:
sed "linenum1,linenum2 d" input.txt > input.temp
or in-line by adding the -i option. Since I know the line numbers, is there a command to avoid stream-editing and just remove the particular lines? input.txt can be as large as 50 GB.
Best Answer
What you could do to avoid writing a copy of the file is to write the file over itself like:
Dangerous as you've no backup copy there.
Or avoiding
sed
, stealing part of manatwork's idea:That could still be improved because you're overwriting the first l1 - 1 lines over themselves while you don't need to, but avoiding it would mean a bit more involved programming, and for instance do everything in
perl
which may end up less efficient:Some timings for removing lines 1000000 to 1000050 from the output of
seq 1e7
:sed -i "$l1,$l2 d" file
: 16.2sThey all work on the same principle: we open two file descriptors to the file, one in read-only mode (0) using
< file
short for0< file
and one in read-write mode (1) using1<> file
(<> file
would be0<> file
). Those file descriptors point to two open file descriptions that will have each a current cursor position within the file associated with them.In the second solution for instance, the first
head -n "$(($l1 - 1))"
will read$l1 - 1
lines worth of data from fd 0 and write that data to fd 1. So at the end of that command, the cursor on both open file descriptions associated with fds 0 and 1 will be at the start of the$l1
th line.Then, in
head -n "$(($l2 - $l1 + 1))" > /dev/null
,head
will read$l2 - $l1 + 1
lines from the same open file description through its fd 0 which is still associated to it, so the cursor on fd 0 will move to the beginning of the line after the$l2
one.But its fd 1 has been redirected to
/dev/null
, so upon writing to fd 1, it will not move the cursor in the open file description pointed to by{...}
's fd 1.So, upon starting
cat
, the cursor on the open file description pointed to by fd 0 will be at the start of the next line after$l2
, while the cursor on fd 1 will still be at the beginning of the$l1
th line. Or said otherwise, that secondhead
will have skipped those lines to remove on input but not on output. Nowcat
will overwrite the$l1
th line with the next line after$l2
and so on.cat
will return when it reaches the end of file on fd 0. But fd 1 will point to somewhere in the file that has not been overwritten yet. That part has to go away, it corresponds to the space occupied by the deleted lines now shifted to the end of the file. What we need is to truncate the file at the exact location where that fd 1 points to now.That's done with the
ftruncate
system call. Unfortunately, there's no standard Unix utility to do that, so we resort onperl
.tell STDOUT
gives us the current cursor position associated with fd 1. And we truncate the file at that offset using perl's interface to theftruncate
system call:truncate
.In the third solution, we replace the writing to fd 1 of the first
head
command with onelseek
system call.