Select Lines Based on Length Using Text Processing

greptext processing

I would like to use grep or another suitable tool to look for (and print) lines not based on some pattern, but on the length.

Assume I have a file that contains two lines, where

Line 1: length = 300 characters
Line 2: length = 120 characters

I am looking for a command that would only output line 2.

Best Answer

exactly 120 characters

With grep:

grep -xE '.{120}' < your-file
grep -x '.\{120\}' < your-file # more portable

With awk:

awk 'length == 120' < your-file

from 0 to 120 characters

With grep:

grep -xE '.{0,120}' < your-file
grep -x '.\{0,120\}' < your-file # more portable

With awk:

awk 'length <= 120' < your-file

For strictly less than 120, replace 120 with 119 or <= with <.

120 characters or over:

With grep:

grep -E '.{120}' < your-file # lines that contain a sequence of 120 characters
grep '.\{120\}' < your-file # more portable

And some more alternatives:

grep -E '^.{120}' < your-file # lines that start with a sequence of 120 characters
grep '^.\{120\}' < your-file # more portable

grep -xE '.{120,}' < your-file # lines that have 120 or more characters
                               # between start and end.
grep -x '.\{120,\}' < your-file # more portable

With awk:

awk 'length >= 120' < your-file

For strictly more than 120, replace 120 with 121 or >= with >.

Those assume that the input is valid text properly encoded as per the locale's charmap. If the input contains NUL characters, sequences of bytes that don't form valid characters, lines larger than LINE_MAX (in number of bytes), or a non-delimited last line (in the case of grep; awk would add the missing delimiter), your mileage may vary.

If you want to do that filtering based on the number of bytes instead of characters, set the locale to C or POSIX (LC_ALL=C grep...).

To do the filtering based on number of grapheme clusters instead of characters and if your grep supports a -P option, you can replace the E with P above and . with \X.

Compare:

$ locale charmap
UTF-8
$ echo $'e\u0301te\u0301' | grep -xP '\X{3}'
été
$ echo $'e\u0301te\u0301' | grep -xE '.{5}'
été
$ echo $'e\u0301te\u0301' | LC_ALL=C grep -xE '.{7}'
été

(that été is 3 grapheme clusters, 5 characters, 7 bytes).

Not all grep -P implementations support \X. Some only support the UTF-8 multibyte charmap.

Note that filtering based on display width is yet another matter, and display width for a given string of characters depends on the display device. See Get the display width of a string of characters for more on that.

Related Solutions

Grep for words of no more than a certain length

grep -o -w '\w\{1,3\}' data

Options are:

-o (a GNU extension) prints only matched words
-w (an extension from BSD, but now widely supported) matches only whole words.

It matches only words (in grep, \w (a GNU extension) short for standard [[:alnum:]_] (same as [A-Za-z0-9_] in the C locale)) of length from 1 to 3 (specified by {1,3})

Search a pattern and print preceding lines starting with another pattern

Here's a solution in Perl:

perl -nlE '
    if    (/a/)   { @buffer = ($_) }
    elsif (/xyz/) { push @buffer,$_; say for @buffer }
    else          { push @buffer,$_}
' your_file

How this works

It reads through the file line-by-line and does one of three things:

If the current line matches the pattern a, it assigns the current line to the @buffer array.
If the current line matches the pattern xyz, it pushes the current line onto the buffer and prints the contents of the buffer
If none of the two cases above is true, it simply appends the current line to the @buffer array.

Thus, whenever a new line matches the pattern a, the contents of the @buffer are erased and replaced by the current line only. This guarantees you will find the closest a preceding xyz.

You should of course replace the regexes I used with the actual regexes relevant to your case.

Best Answer

exactly 120 characters

from 0 to 120 characters

120 characters or over:

Related Solutions

Grep for words of no more than a certain length

Search a pattern and print preceding lines starting with another pattern

Related Question