How to delete all lines in a text file which have less than ‘x’ characters

awksedtext processing

How can I delete all lines in a text file which have fewer than 'x' letters OR numbers OR symbols? I can't use awk 'length($0)>' as it will include spaces.

Best Answer

Assuming you want to delete lines that contain less than n graphical symbols:

awk -v n=5 '{ line = $0; gsub("[^[:graph:]]", "") } length >= n { print line }'

This deletes all characters that does not match [[:graph:]]. If the length of the string that remains is greater than or equal to n, the (unmodified) line is printed.

The value of n is given on the command line.

[[:graph:]] is equivalent to [[:alnum:][:punct:]], which in turn is the same as [[:alpha:][:digit:][:punct:]]. It is roughly the same as [[:print:]] but does not match spaces.

Instead of [^[:graph:]], you could possibly use [[:blank:]] to delete all tabs or spaces.

With sed, following the above awk code almost literally,

sed -e 'h; s/[^[:graph:]]//g' \
    -e '/.\{5\}/!d; g'

or, simplified (only counting non-blank characters),

sed -e 'h; s/[[:blank:]]//g' \
    -e '/...../!d; g'

This first saves the current line into the hold space with h. It then deletes all non-graph characters (or blank characters in the second variation) on the line with s///g. If the line then contains less than 5 characters (change this to whatever number you want, or change the number of dots in the second variation), the line is deleted. Else, the stored line is fetched from the hold space with g and (implicitly) printed.

Related Question