Text Processing – How to cat Line X to Line Y in a Huge File

catheadlarge filestail

Say I have a huge text file (>2GB) and I just want to cat the lines X to Y (e.g. 57890000 to 57890010).

From what I understand I can do this by piping head into tail or viceversa, i.e.

head -A /path/to/file | tail -B

or alternatively

tail -C /path/to/file | head -D

where A,B,C and D can be computed from the number of lines in the file, X and Y.

But there are two problems with this approach:

You have to compute A,B,C and D.
The commands could pipe to each other many more lines than I am interested in reading (e.g. if I am reading just a few lines in the middle of a huge file)

Is there a way to have the shell just work with and output the lines I want? (while providing only X and Y)?

Best Answer

I suggest the sed solution, but for the sake of completeness,

awk 'NR >= 57890000 && NR <= 57890010' /path/to/file

To cut out after the last line:

awk 'NR < 57890000 { next } { print } NR == 57890010 { exit }' /path/to/file

Speed test (here on macOS, YMMV on other systems):

100,000,000-line file generated by seq 100000000 > test.in
Reading lines 50,000,000-50,000,010
Tests in no particular order
real time as reported by bash's builtin time

 4.373  4.418  4.395    tail -n+50000000 test.in | head -n10
 5.210  5.179  6.181    sed -n '50000000,50000010p;57890010q' test.in
 5.525  5.475  5.488    head -n50000010 test.in | tail -n10
 8.497  8.352  8.438    sed -n '50000000,50000010p' test.in
22.826 23.154 23.195    tail -n50000001 test.in | head -n10
25.694 25.908 27.638    ed -s test.in <<<"50000000,50000010p"
31.348 28.140 30.574    awk 'NR<57890000{next}1;NR==57890010{exit}' test.in
51.359 50.919 51.127    awk 'NR >= 57890000 && NR <= 57890010' test.in

These are by no means precise benchmarks, but the difference is clear and repeatable enough* to give a good sense of the relative speed of each of these commands.

*: Except between the first two, sed -n p;q and head|tail, which seem to be essentially the same.

Related Solutions

Emacs: Open a buffer with all lines between lines X to Y from a huge file

If you want to open the whole file (which requires ), but show only part of it in the editor window, use narrowing. Select the part of the buffer you want to work on and press C-x n n (narrow-to-region). Say “yes” if you get a prompt about a disabled command. Press C-x n w (widen) to see the whole buffer again. If you save the buffer, the complete file is selected: all the data is still there, narrowing only restricts what you see.

If you want to view a part of a file, you can insert it into the current buffer with shell-command with a prefix argument (M-1 M-!); run the appropriate command to extract the desired lines, e.g. <huge.txt tail -n +57890001 | head -n 11.

There is also a Lisp function insert-file-contents which can take a byte range. You can invoke it with M-: (eval-expression):

(insert-file-contents "huge.txt" nil 456789000 456791000)

Note that you may run into the integer size limit (version- and platform-dependent, check the value of most-positive-fixnum).

In theory it would be possible to write an Emacs mode that loads and saves parts of files transparently as needed (though the limit on integer sizes would make using actual file offsets impossible on 32-bit machines). The only effort in that direction that I know of is VLF (GitHub link here).

Only cat from specific line X (with a pattern) to other specific line Y (with a pattern)

sed -n '/foo/,/goo/p;/goo/q' <bigfile

That would print only those lines. If you wanted the line numbers you'd add an =.

sed -n '/foo/=;/goo/=;//q' <bigfile

The q is important because it quits the input when it is called - else sed will continue to read the infile through to the end.

If you don't want to print foo/goo lines you can do instead:

With GNU sed:

sed -n '/foo/,/goo/!d;//!p;/goo/q
' <<\DATA
line1
foo 
line3
line4
line5
goo 
line7
DATA

OUTPUT

line3
line4
line5

And with any other:

sed -n '/foo/G;/\n/,/goo/!d;//q;/\n/!p 
' <<\DATA
line1
foo 
line3
line4
line5
goo 
line7
DATA

OUTPUT

line3
line4
line5

Either way, though, this also quits its input as soon as it encounters the last line in your search.

Best Answer

Related Solutions

Emacs: Open a buffer with all lines between lines X to Y from a huge file

Only cat from specific line X (with a pattern) to other specific line Y (with a pattern)

OUTPUT

OUTPUT

Related Question