Ubuntu – How to remove all lines in a file that are less than 6 characters

command linetext processing

I have a file containing approximately 10 million lines.

I want to remove all lines in the file that are less than six characters.

How do I do this?

Best Answer

There are many ways to do this.

Using grep:

grep -E '^.{6,}$' file.txt >out.txt

Now out.txt will contain lines having six or more characters.

Reverse way:

grep -vE '^.{,5}$' file.txt >out.txt

Using sed, removing lines of length 5 or less:

sed -r '/^.{,5}$/d' file.txt

Reverse way, printing lines of length six or more:

sed -nr '/^.{6,}$/p' file.txt

You can save the output in a different file using > operator like grep or edit the file in-place using -i option of sed:

sed -ri.bak '/^.{6,}$/' file.txt

The original file will be backed up as file.txt.bak and the modified file will be file.txt.

If you do not want to keep a backup:

sed -ri '/^.{6,}$/' file.txt

Using shell, Slower, Don't do this, this is just for the sake of showing another method:

while IFS= read -r line; do [ "${#line}" -ge 6 ] && echo "$line"; done <file.txt

Using python,even slower than grep, sed:

#!/usr/bin/env python2
with open('file.txt') as f:
    for line in f:
        if len(line.rstrip('\n')) >= 6:
            print line.rstrip('\n')

Better use list comprehension to be more Pythonic:

#!/usr/bin/env python2
with open('file.txt') as f:
     strip = str.rstrip
     print '\n'.join([line for line in f if len(strip(line, '\n')) >= 6]).rstrip('\n')

`grep` approach

To create a copy of the file without lines matching "cat" or "rat", one can use grep in reverse (-v) and with the whole-word option (-w).

grep -vwE "(cat|rat)" sourcefile > destinationfile

The whole-word option makes sure it won't match cats or grateful for example. Output redirection of your shell is used (>) to write it to a new file. We need the -E option to enable the extended regular expressions for the (one|other) syntax.

`sed` approach

Alternatively, to remove the lines in-place one can use sed -i:

sed -i "/\b\(cat\|rat\)\b/d" filename

The \b sets word boundaries and the d operation deletes the line matching the expression between the forward slashes. cat and rat are both being matched by the (one|other) syntax we apparently need to escape with backslashes.

Tip: use sed without the -i operator to test the output of the command before overwriting the file.

(Based on Sed - Delete a line containing a specific string)

Ubuntu – Remove any trailing blank lines or lines with whitespaces from end of file

Your script should work if fixed like so:

while
 last_line=$(tail -1 "./file.txt")
 [[ "$last_line" =~ ^$ ]] || [[ "$last_line" =~ ^[[:space:]]+$ ]]
do
 sed -i '$d' "./file.txt"
done

Your script had two main problems: (1) you never updated $last_line, so the loop's guard would always evaluate the same thing; (2) your [[ "$last_line" =~ $ ]] test matched any line, since any line has an end. (This is the reason why your script emptied your file completely.) You probably want to match against ^$ instead, which matches only empty lines. Additionally, I simplified the sed command to delete the last line in the loop's body (simply $d does the job).

However, this script is unnecessarily complicated. sed is there for just that kind of thing! This one-liner will do the same thing as the above script:

sed -i ':a;/^[ \n]*$/{$d;N;ba}' ./file.txt

Roughly,

Match current line against ^[ \n]*$. (i.e, can only contain whitespaces and newlines)
If it doesn't match, just print it. Read in next line and continue with step 1.
If it does match,
- If we are at the end of the file, delete it.
- If we are not at the end of the file, append the next line to the current line, inserting a newline character between the two, and go back to step 1 with this new, longer line.

There are lots of awesome sed tutorials on the Internet. For example, I can recommend this one. Happy learning! :-)

Update: And of course, if you additionally want to remove the last (non-blank) line of the file after having truncated the trailing blank lines, you can just use another sed -i '$d' ./file.txt after either your script or the above one-liner. I intentionally did not want to include that in the sed one-liner since I thought that removing trailing blank lines is quite a reusable piece of code that may be interesting for other people; but removing the last non-blank line is really specific to your use case, and trivial anyway once you removed the trailing blank lines.

Best Answer

Related Solutions

Ubuntu – How to remove lines from the text file containing specific words through terminal

grep approach

sed approach

Ubuntu – Remove any trailing blank lines or lines with whitespaces from end of file

Related Question

`grep` approach

`sed` approach