Octals 302 240 together seem to correspond to non-breaking space

character encodingodwhitespace

By looking at a particular line of a text file (say, the 1123th, see below), it seems that there is a non-breaking space, but I am not sure:

$ cat myfile.csv | sed -n 1123p | cut -f2
Lisztes feher

$ cat myfile.csv | sed -n 1123p | cut -f2 | od -An -c -b
   L   i   s   z   t   e   s 302 240   f   e   h   e   r  \n
 114 151 163 172 164 145 163 302 240 146 145 150 145 162 012

However, the ASCII code in octal indicates that a non-breaking space is 240. So what does the 302 correspond to? Is it something particular to this given file?

I am asking the question in order to understand. I already know how to use sed to fix my problem, following this answer:

$ cat myfile.csv | sed -n 1123p | cut -f2 | sed 's/\xC2\xA0/ /g' | od -An -c -b
   L   i   s   z   t   e   s       f   e   h   e   r  \n
 114 151 163 172 164 145 163 040 146 145 150 145 162 012

For information, the original file is in the .xlsx (Excel) format. As my computer runs Xubuntu, I opened it with LibreOffice Calc (v5.1). Then, I saved it as "Text CSV" with "Character set = Unicode (UTF-8)" and tab as field separator:

$ file myfile.csv
myfile.csv: UTF-8 Unicode text

Best Answer

It's the UTF-8 encoding of the U+00A0 Unicode character:

$ unicode U+00A0
U+00A0 NO-BREAK SPACE
UTF-8: c2 a0 UTF-16BE: 00a0 Decimal:   Octal: \0240
 
Category: Zs (Separator, Space)
Bidi: CS (Common Number Separator)
Decomposition: <noBreak> 0020

$ locale charmap
UTF-8
$ printf '\ua0' | od -to1
0000000 302 240
0000002

UTF-8 is an encoding of Unicode with a variable number of bytes per character. Unicode as a charset is a superset of iso8859-1 (aka latin1) itself a superset of ASCII.

While in iso8859-1, the non-breaking-space character (codepoint 0xa0 in iso8859-1 like in Unicode) would be expressed as a one 0xa0 byte, in UTF-8, only code points 0 to 127 are expressed as one byte (which makes UTF-8 a superset of ASCII or in other words ASCII files are also UTF-8 files).

Code points over 128 are encoded with more bytes per characters. See Wikipedia for details of the UTF-8 encoding algorithm.