Linux – How to truncate file by lines

bash-scriptinglinux

I have a large number of file, some of which are very long. I would like to truncate them to a certain size if they are larger by removing the end of the file. But I only want to remove whole lines. How can I do this? It feels like the kind of thing that would be handled by the Linux toolchain but I don't know of the right command.

For example, say I have a 120,000 byte file with 300-byte lines and I'm trying to truncate it to 10,000 bytes. The first 33 lines should stay (9900 bytes) and the remainder should be cut. I don't want to cut at 10,000 bytes exactly, since that would leave a partial line.

Of course the files are of differing lengths and the lines are not all the same length.

Ideally the resulting files would be made slightly shorter rather than slightly longer (if the breakpoint is on a long line) but that's not too important, it could be a little longer if that' easier. I would like the changes to be made directly to files (well, possibly the new file copied elsewhere, the original deleted, and the new file moved, but that's the same from the user's POV). A solution that redirects data to a bunch of places and then back invites the possibility of corrupting the file and I'd like to avoid that…

Best Answer

The sed/wc complexity can be avoided in previous answers if awk is used. Using example provided from OP (showing complete lines before 10000 bytes):

awk '{i += (length() + 1); if (i <= 10000) print $ALL}' myfile.txt

Also showing the complete line containing 10000th byte if that byte is not at end of line:

awk '{i += (length() + 1); print $ALL; if (i >= 10000) exit}' myfile.txt

The answer above assumes:

  1. Text file are of Unix line terminator (\n). For Dos/Windows text files (\r\n), change length() + 1 to length() + 2
  2. Text file only contains single byte character. If there's multibyte character (such as under unicode environment), set environment LC_CTYPE=C to force interpretation on byte level.
Related Question