I have file that was created by copying content from DOCX document with LibreOffice into text file. I have modified file with sed
to remove additional spaces and other stuff but then I noticed space that was immune to regular command:
sed -r 's:some-text :some-text:g' -i file
After using cat -A file
I found out that this looks like this:
<p>M-BM- Lorem ipsum</p>
How to remove it?
Best Answer
The
M-BM-
characters are an ASCII representation of byte sequence0xc2 0xa0
, which is the UTF8 encoding of unicode characterA0
- a non-breaking space character. This character can be inserted in both LibreOffice and Microsoft Word documents using the key sequence Ctrl+Shift+SPACE.For example if we create a new .odt document in LibreOffice and type ABCCtrl+Shift+SPACEDEF, then
Save As... Text
(ignoring the warning that the document may contain features that cannot be saved in that format), then view the resulting .txt file withcat
:and then again with the
-v
switch to show non-printing charactersNote that we also get an initial sequence
M-oM-;M-?
or hexadecimal0xef 0xbb 0xbf
which is the UTF8 byte order mark (BOM) consistent with the file type reported by thefile
command i.e.Using
od
to print the hexadecimal values in byte order we seeIt is possible to manipulate these characters using standard tools like
sed
ortr
by specifying the hex codes as escape sequences e.g. to replace the non-breaking space with a plain ASCII spaceChecking again with
od
confirms the replacement by an ordinary ASCII space 0x20 (decimal 32)In gnome-terminal (and maybe other UTF8-aware terminal emulators), it's also possible to enter the unicode code point value directly using the key sequence Ctrl+Shift+u followed by a hexidecimal value then the Enter key - the sequence shows up initially as u̲.̲.̲.̲ but then the character should compose when you hit Enter e.g. for the same non-breaking space replacement we can do
which displays as
and then completes as
Using
cat -v
we can confirm theM-BM-
sequence has become an ordinary spaceYou may want to look at more generic encoding converters such as iconv and uconv as well.