Select Lines Based on Length Using Text Processing

greptext processing

I would like to use grep or another suitable tool to look for (and print) lines not based on some pattern, but on the length.

Assume I have a file that contains two lines, where

  • Line 1: length = 300 characters
  • Line 2: length = 120 characters

I am looking for a command that would only output line 2.

Best Answer

exactly 120 characters

With grep:

grep -xE '.{120}' < your-file
grep -x '.\{120\}' < your-file # more portable

With awk:

awk 'length == 120' < your-file

from 0 to 120 characters

With grep:

grep -xE '.{0,120}' < your-file
grep -x '.\{0,120\}' < your-file # more portable

With awk:

awk 'length <= 120' < your-file

For strictly less than 120, replace 120 with 119 or <= with <.

120 characters or over:

With grep:

grep -E '.{120}' < your-file # lines that contain a sequence of 120 characters
grep '.\{120\}' < your-file # more portable

And some more alternatives:

grep -E '^.{120}' < your-file # lines that start with a sequence of 120 characters
grep '^.\{120\}' < your-file # more portable
grep -xE '.{120,}' < your-file # lines that have 120 or more characters
                               # between start and end.
grep -x '.\{120,\}' < your-file # more portable

With awk:

awk 'length >= 120' < your-file

For strictly more than 120, replace 120 with 121 or >= with >.


Those assume that the input is valid text properly encoded as per the locale's charmap. If the input contains NUL characters, sequences of bytes that don't form valid characters, lines larger than LINE_MAX (in number of bytes), or a non-delimited last line (in the case of grep; awk would add the missing delimiter), your mileage may vary.

If you want to do that filtering based on the number of bytes instead of characters, set the locale to C or POSIX (LC_ALL=C grep...).

To do the filtering based on number of grapheme clusters instead of characters and if your grep supports a -P option, you can replace the E with P above and . with \X.

Compare:

$ locale charmap
UTF-8
$ echo $'e\u0301te\u0301' | grep -xP '\X{3}'
été
$ echo $'e\u0301te\u0301' | grep -xE '.{5}'
été
$ echo $'e\u0301te\u0301' | LC_ALL=C grep -xE '.{7}'
été

(that été is 3 grapheme clusters, 5 characters, 7 bytes).

Not all grep -P implementations support \X. Some only support the UTF-8 multibyte charmap.

Note that filtering based on display width is yet another matter, and display width for a given string of characters depends on the display device. See Get the display width of a string of characters for more on that.