Fast way to extract lines from a large file based on line numbers stored in another file

awkperlsedtext processing

I have a large file with 80 billion lines. Now I want to extract a few lines (around 10000) which I know the line number, what is the fastest way to deal with it.

Is it possible to extract those lines from using another file which contains the line numbers? The line numbers in the file of line numbers would not always be consecutive.

For example, the original file is:

0.1
0.2
0.3
0.4
...

the line number file:

1
3
4

the output:

0.1
0.3
0.4

Best Answer

Here are an alternative method and a bit of benchmarking, adding to that in Weijun Zhou's answer.

`join`

Assuming you have a data file you want to extract rows from and a line_numbers file that lists the numbers of the rows you want to extract, if the sorting order of the output is not important you can use:

join <(sort padded_line_numbers) <(nl -w 12 -n rz data) | cut -d ' ' -f 2-

This will number the lines of your data file, join it with the padded_line_numbers file on the first field (the default) and print out the common lines (excluding the join field itself, that is cut away).

join needs the input files to be sorted alphabetically. The aforementioned padded_line_numbers file has to be prepared by left-padding each line of your line_numbers file. E.g.:

while read rownum; do
    printf '%.12d\n' "$rownum"
done <line_numbers >padded_line_numbers

The -w 12 -n rz options and arguments instruct nl to output 12 digits long numbers with leading zeros.

If the sorting order of the output has to match that of your line_numbers file, you can use:

join -1 2 -2 1 <(nl padded_line_numbers | sort -k 2,2) \
    <(nl -w 12 -n rz data) |
    sort -k 2,2n |
    cut -d ' ' -f 3-

Where we are numbering the padded_line_numbers file, sorting the result alphabetically by its second field, joining it with the numbered data file and numerically sorting the result by the original sorting order of padded_line_numbers.

Process substitution is here used for convenience. If you can not or do not want to rely on it and, as it is likely, you are not willing to waste the storage needed for creating regular files to hold intermediate results, you can leverage named pipes:

mkfifo padded_line_numbers
mkfifo numbered_data

while read rownum; do
    printf '%.12d\n' "$rownum"
done <line_numbers | nl | sort -k 2,2 >padded_line_numbers &

nl -w 12 -n rz data >numbered_data &

join -1 2 -2 1 padded_line_numbers numbered_data | sort -k 2,2n | cut -d ' ' -f 3-

Benchmarking

Since the peculiarity of your question is the number of rows in your data file, I thought it could be useful to test alternative approaches with a comparable amount of data.

For my tests I used a 3.2 billion lines data file. Each line is just 2 bytes of garbage coming from openssl enc, hex-encoded using od -An -tx1 -w2 and with spaces removed with tr -d ' ':

$ head -n 3 data
c15d
061d
5787

$ wc -l data
3221254963 data

The line_numbers file has been created by randomly choosing 10,000 numbers between 1 and 3,221,254,963, without repetitions, using shuf from GNU Coreutils:

shuf -i 1-"$(wc -l <data)" -n 10000 >line_numbers

The testing environment was a laptop with a i7-2670QM Intel quad-core processor, 16 GiB of memory, SSD storage, GNU/Linux, bash 5.0 and GNU tools.
The only dimension I measured has been the execution time, by means of the time shell builtin.

Here I'm considering:

The sed solution from Weijun Zhou's answer.
The awk solution from Micha's answer.
The perl solution from wurtel's answer.
The join solution above.

perl seems to be the fastest:

$ time perl_script line_numbers data | wc -l
10000

real    14m51.597s
user    14m41.878s
sys     0m9.299s

awk's performance looks comparable:

$ time awk 'FNR==NR { seen[$0]++ }; FNR!=NR && FNR in seen' line_numbers data | wc -l
10000

real    29m3.808s
user    28m52.616s
sys     0m10.709s

join, too, appears to be comparable:

$ time join <(sort padded_line_numbers) <(nl -w 12 -n rz data) | wc -l
10000

real    28m24.053s
user    27m52.857s
sys     0m28.958s

Note that the sorted version mentioned above has roughly no performance penalty over this one.

Finally, sed appears to be significantly slower: I killed it after approximately nine hours:

$ time sed -nf <(sed 's/$/p/' line_numbers) data | wc -l
^C

real    551m12.747s
user    550m53.390s
sys     0m15.624s

Related Solutions

Text Processing – Faster Way to Remove a Line by Line Number

What you could do to avoid writing a copy of the file is to write the file over itself like:

{
  sed "$l1,$l2 d" < file
  perl -le 'truncate STDOUT, tell STDOUT'
} 1<> file

Dangerous as you've no backup copy there.

Or avoiding sed, stealing part of manatwork's idea:

{
  head -n "$(($l1 - 1))"
  head -n "$(($l2 - $l1 + 1))" > /dev/null
  cat
  perl -le 'truncate STDOUT, tell STDOUT'
} < file 1<> file

That could still be improved because you're overwriting the first l1 - 1 lines over themselves while you don't need to, but avoiding it would mean a bit more involved programming, and for instance do everything in perl which may end up less efficient:

perl -ne 'BEGIN{($l1,$l2) = ($ENV{"l1"}, $ENV{"l2"})}
    if ($. == $l1) {$s = tell(STDIN) - length; next}
    if ($. == $l2) {seek STDOUT, $s, 0; $/ = \32768; next}
    if ($. > $l2) {print}
    END {truncate STDOUT, tell STDOUT}' < file 1<> file

Some timings for removing lines 1000000 to 1000050 from the output of seq 1e7:

sed -i "$l1,$l2 d" file: 16.2s
1st solution: 1.25s
2nd solution: 0.057s
3rd solution: 0.48s

They all work on the same principle: we open two file descriptors to the file, one in read-only mode (0) using < file short for 0< file and one in read-write mode (1) using 1<> file (<> file would be 0<> file). Those file descriptors point to two open file descriptions that will have each a current cursor position within the file associated with them.

In the second solution for instance, the first head -n "$(($l1 - 1))" will read $l1 - 1 lines worth of data from fd 0 and write that data to fd 1. So at the end of that command, the cursor on both open file descriptions associated with fds 0 and 1 will be at the start of the $l1th line.

Then, in head -n "$(($l2 - $l1 + 1))" > /dev/null, head will read $l2 - $l1 + 1 lines from the same open file description through its fd 0 which is still associated to it, so the cursor on fd 0 will move to the beginning of the line after the $l2 one.

But its fd 1 has been redirected to /dev/null, so upon writing to fd 1, it will not move the cursor in the open file description pointed to by {...}'s fd 1.

So, upon starting cat, the cursor on the open file description pointed to by fd 0 will be at the start of the next line after $l2, while the cursor on fd 1 will still be at the beginning of the $l1th line. Or said otherwise, that second head will have skipped those lines to remove on input but not on output. Now cat will overwrite the $l1th line with the next line after $l2 and so on.

cat will return when it reaches the end of file on fd 0. But fd 1 will point to somewhere in the file that has not been overwritten yet. That part has to go away, it corresponds to the space occupied by the deleted lines now shifted to the end of the file. What we need is to truncate the file at the exact location where that fd 1 points to now.

That's done with the ftruncate system call. Unfortunately, there's no standard Unix utility to do that, so we resort on perl. tell STDOUT gives us the current cursor position associated with fd 1. And we truncate the file at that offset using perl's interface to the ftruncate system call: truncate.

In the third solution, we replace the writing to fd 1 of the first head command with one lseek system call.

Pattern Search between specific lines and print line numbers

When working w/ sed I typically find it easiest to consistently narrow my possible outcome. This is why I sometimes lean on the !negation operator. It is very often more simple to prune uninteresting input away than it is to pointedly select the interesting kind - at least, this is my opinion on the matter.

I find this method more inline with sed's default behavior - which is to auto-print pattern-space at script's end. For simple things such as this it can also more easily result in a robust script - a script that does not depend on certain implementations' syntax extensions in order to operate (as is commonly seen with sed {functions}).

This is why I recommended you do:

sed '10,15!d;/pattern/!d;=' <input

...which first prunes any line not within the range of lines 10 & 15, and from among those that remain prunes any line which does not match pattern. If you find you'd rather have the line number sed prints on the same line as its matched line, I would probably look to paste in that case. Maybe...

sed '10,15!d;/pattern/!d;=' <input |
paste -sd:\\n -

...which will just alternate replacing input \newlines with either a : character or another \newline.

For example:

seq 150 |
sed '10,50!d;/5/!d;=' |
paste -sd:\\n -

...prints...