Ubuntu – Using sed to remove all lines containing special characters , numbers, and spaces

bashcommand linescriptssed

So I'm fairly new to using shell–bash specifically–and I'm in the process of writing a script which will translate files containing DNA sequences into a more useful format. Unfortunately, many of these files will contain extraneous lines used for labeling information, etc. I need a sed command that will exclude these lines containing special character, numbers, or spaces. I've found that it is fairly straightforwards removing lines with spaces by using

sed '/ /d' infile

and I imagine that removing lines containing numbers will be a similar strategy using regex. I just haven't really found any way of approaching special characters in sed.

Thanks

Best Answer

To delete any line that is not composed entirely of alphabetic characters, you'd need to add start (^) and end ($) anchors

sed '/^[[:alpha:]]*$/!d' file

Instead, you could delete any line that contains at least one non-alphabetic character

sed '/[^[:alpha:]]/d' file

Note that the caret ^ is acting as a negation operator here rather than as an anchor as in the previous expression.

Alternatively, using grep's whole-line (-x or --line-regexp) option

grep -x '[[:alpha:]]*' file

(equivalent of the first sed expression) or using an inverse match (-v)

grep -v '[^[:alpha:]]' file

(equivalent of the second sed expression).

Related Solutions

Ubuntu – Replacing dots (.) in sed

First I would start by testing with echo and piping that into sed, than using a real file. Secondly, you can use a {n} in the extended regex model to denote multiples and limits.

You were pretty much there but your regex expected a leading space.

$ echo 'cheese . . . muffins' | sed -r 's/(\s?\.){3}/ dot dot dot/g'
cheese dot dot dot muffins

Note the \s? is still greedy enough to ruin the output, so I've added a space to the output. You might not want that. I've also made the space optional, so it'll match all of the following:

...
. ..
.. .
. . .
 . . .

Just remove the optional ? flag.

Given your problem with unicode (in the comments) you can force the data to its ASCII equivalence with iconv and then sed it:

$ iconv -f utf-8 -t ascii//translit sed-dots | sed -r 's/(\s?\.){3}/ dot dot dot/g'
Lorem ipsum dot dot dot
Some dot dot dot more text

Ubuntu – How to remove special ‘M-BM-‘ character with sed

The M-BM- characters are an ASCII representation of byte sequence 0xc2 0xa0, which is the UTF8 encoding of unicode character A0 - a non-breaking space character. This character can be inserted in both LibreOffice and Microsoft Word documents using the key sequence Ctrl+Shift+SPACE.

For example if we create a new .odt document in LibreOffice and type ABCCtrl+Shift+SPACEDEF, then Save As... Text (ignoring the warning that the document may contain features that cannot be saved in that format), then view the resulting .txt file with cat:

$ cat nbsp.txt 
ABC DEF

and then again with the -v switch to show non-printing characters

$ cat -v nbsp.txt 
M-oM-;M-?ABCM-BM- DEF

Note that we also get an initial sequence M-oM-;M-? or hexadecimal 0xef 0xbb 0xbf which is the UTF8 byte order mark (BOM) consistent with the file type reported by the file command i.e.

$ file nbsp.txt 
nbsp.txt: UTF-8 Unicode (with BOM) text

Using od to print the hexadecimal values in byte order we see

$ od -tx1 nbsp.txt
0000000 ef bb bf 41 42 43 c2 a0 44 45 46 0a
0000014

It is possible to manipulate these characters using standard tools like sed or tr by specifying the hex codes as escape sequences e.g. to replace the non-breaking space with a plain ASCII space

$ sed 's/\xc2\xa0/ /g' nbsp.txt
ABC DEF

Checking again with od confirms the replacement by an ordinary ASCII space 0x20 (decimal 32)

$ sed 's/\xc2\xa0/ /g' nbsp.txt | od -tx1
0000000 ef bb bf 41 42 43 20 44 45 46 0a
0000013

In gnome-terminal (and maybe other UTF8-aware terminal emulators), it's also possible to enter the unicode code point value directly using the key sequence Ctrl+Shift+u followed by a hexidecimal value then the Enter key - the sequence shows up initially as u̲.̲.̲.̲ but then the character should compose when you hit Enter e.g. for the same non-breaking space replacement we can do

$ sed 's/Ctrl+Shift+ua0

which displays as

$ sed 's/̲/̲u̲a̲0̲

and then completes as

$ sed 's/ / /g' nbsp.txt
ABC DEF

Using cat -v we can confirm the M-BM- sequence has become an ordinary space

$ sed 's/ / /g' nbsp.txt | cat -v
M-oM-;M-?ABC DEF

You may want to look at more generic encoding converters such as iconv and uconv as well.

Best Answer

Related Solutions

Ubuntu – Replacing dots (.) in sed

Ubuntu – How to remove special ‘M-BM-‘ character with sed

Related Question