Linux – How to truncate file by lines

bash-scriptinglinux

I have a large number of file, some of which are very long. I would like to truncate them to a certain size if they are larger by removing the end of the file. But I only want to remove whole lines. How can I do this? It feels like the kind of thing that would be handled by the Linux toolchain but I don't know of the right command.

For example, say I have a 120,000 byte file with 300-byte lines and I'm trying to truncate it to 10,000 bytes. The first 33 lines should stay (9900 bytes) and the remainder should be cut. I don't want to cut at 10,000 bytes exactly, since that would leave a partial line.

Of course the files are of differing lengths and the lines are not all the same length.

Ideally the resulting files would be made slightly shorter rather than slightly longer (if the breakpoint is on a long line) but that's not too important, it could be a little longer if that' easier. I would like the changes to be made directly to files (well, possibly the new file copied elsewhere, the original deleted, and the new file moved, but that's the same from the user's POV). A solution that redirects data to a bunch of places and then back invites the possibility of corrupting the file and I'd like to avoid that…

Best Answer

The sed/wc complexity can be avoided in previous answers if awk is used. Using example provided from OP (showing complete lines before 10000 bytes):

awk '{i += (length() + 1); if (i <= 10000) print $ALL}' myfile.txt

Also showing the complete line containing 10000th byte if that byte is not at end of line:

awk '{i += (length() + 1); print $ALL; if (i >= 10000) exit}' myfile.txt

The answer above assumes:

Text file are of Unix line terminator (\n). For Dos/Windows text files (\r\n), change length() + 1 to length() + 2
Text file only contains single byte character. If there's multibyte character (such as under unicode environment), set environment LC_CTYPE=C to force interpretation on byte level.

Related Solutions

Linux – Remove lines matching string in grep

If the file is named foo.conf:

grep -E '^[^#].*' foo.conf should do it.

Explanation:

-E: Support extended regular expressions!

'^[^#].*': A regular expression surrounded by single quotes.

^[^#].*: The regular expression itself.

^(at position 0 of the regular expression): Says "Match starting at the beginning of a line / the first character immediately following a newline, or the first character of the file itself."

[^#]: Says "Match exactly one character that is not the character #."

.*: Says "Match zero or more of any characters except a newline, for the rest of the line."

The net effect is that, if you have a file with contents like the following:

#foo
bar
#baz
fly

This regular expression will fail to match the first and third lines, because the first character at the start of lines 1 and 3 is in fact a #, so the part of the regular expression that requires exactly one non-# ([^#]) fails to match, so grep excludes that line.

The regular expression will then succeed to match the remainder of the lines, because the first character at the start of lines 2 and 4 is indeed not a #.

Building on our success so far, you can also match lines such as:

    #I am tricky!

(Notice that there is whitespace (tabs or spaces) in front of the comment, and since it's still a comment, we don't want it!)

by using the following:

grep -P '^\s*+[^#].*' foo.conf

(but the .* is not strictly required; I know, I know.)

So now we have:

-P: Support Perl-compatible regular expressions! (Hint: may not be available universally in all versions/implementations of grep, but it at least is available for a while now in GNU Grep.)

\s*+: The little bit of added regex that says, "Match zero or more whitespace characters, meaning spaces or tabs, and if you do see them, you MUST eat them." This latter part is very important, because without the possessive quantifier *+, the space could match as part of the [^#], which would trick the regular expression. Possessive quantifiers do not exist in POSIX Compatible (Basic or Extended) regular expression flavors, so we have to use PCRE here. There may be a way to do this without a possessive quantifier, but I am not aware of it.

Linux – Write n bytes from a file into another in Bash

Yes. Per the dd man page, you are looking for something like:

dd bs=1 count=60 if=_filename_1_ of=_filename_2_
dd bs=1 skip=60 count=40 if=_filename_1_ of=_filename_2_

where _filename_n_ is replaced with an actual filename.

bs=1 means that count and skip are byte counts. skip is how many to skip; count is how many to copy. Edit byte counts start from 0, not 1. Therefore, to start at the first byte, use skip=0 (or leave skip unspecified).

As a bash function, you could use:

# copy_nk(n, k, infile, outfile)
copy_nk() {
    dd bs=1 count="$1" skip="$2" ${3:+if="$3"} ${4:+of="$4"}
}

and then call it as

copy_nk 60 0 infile.txt outfile.txt

(with k=0 because byte numbers start at zero).

With the ${3:+...}, you can leave off the output file or the input file. E.g.,

cat infile.txt | copy_nk 60 0 > outfile.txt

Best Answer

Related Solutions

Linux – Remove lines matching string in grep

Linux – Write n bytes from a file into another in Bash

Related Question