Text Processing – How to cat Line X to Line Y in a Huge File

catheadlarge filestail

Say I have a huge text file (>2GB) and I just want to cat the lines X to Y (e.g. 57890000 to 57890010).

From what I understand I can do this by piping head into tail or viceversa, i.e.

head -A /path/to/file | tail -B

or alternatively

tail -C /path/to/file | head -D

where A,B,C and D can be computed from the number of lines in the file, X and Y.

But there are two problems with this approach:

  1. You have to compute A,B,C and D.
  2. The commands could pipe to each other many more lines than I am interested in reading (e.g. if I am reading just a few lines in the middle of a huge file)

Is there a way to have the shell just work with and output the lines I want? (while providing only X and Y)?

Best Answer

I suggest the sed solution, but for the sake of completeness,

awk 'NR >= 57890000 && NR <= 57890010' /path/to/file

To cut out after the last line:

awk 'NR < 57890000 { next } { print } NR == 57890010 { exit }' /path/to/file

Speed test (here on macOS, YMMV on other systems):

  • 100,000,000-line file generated by seq 100000000 > test.in
  • Reading lines 50,000,000-50,000,010
  • Tests in no particular order
  • real time as reported by bash's builtin time
 4.373  4.418  4.395    tail -n+50000000 test.in | head -n10
 5.210  5.179  6.181    sed -n '50000000,50000010p;57890010q' test.in
 5.525  5.475  5.488    head -n50000010 test.in | tail -n10
 8.497  8.352  8.438    sed -n '50000000,50000010p' test.in
22.826 23.154 23.195    tail -n50000001 test.in | head -n10
25.694 25.908 27.638    ed -s test.in <<<"50000000,50000010p"
31.348 28.140 30.574    awk 'NR<57890000{next}1;NR==57890010{exit}' test.in
51.359 50.919 51.127    awk 'NR >= 57890000 && NR <= 57890010' test.in

These are by no means precise benchmarks, but the difference is clear and repeatable enough* to give a good sense of the relative speed of each of these commands.

*: Except between the first two, sed -n p;q and head|tail, which seem to be essentially the same.

Related Question