By looking at a particular line of a text file (say, the 1123th, see below), it seems that there is a non-breaking space, but I am not sure:
$ cat myfile.csv | sed -n 1123p | cut -f2
Lisztes feher
$ cat myfile.csv | sed -n 1123p | cut -f2 | od -An -c -b
L i s z t e s 302 240 f e h e r \n
114 151 163 172 164 145 163 302 240 146 145 150 145 162 012
However, the ASCII code in octal indicates that a non-breaking space is 240. So what does the 302 correspond to? Is it something particular to this given file?
I am asking the question in order to understand. I already know how to use sed
to fix my problem, following this answer:
$ cat myfile.csv | sed -n 1123p | cut -f2 | sed 's/\xC2\xA0/ /g' | od -An -c -b
L i s z t e s f e h e r \n
114 151 163 172 164 145 163 040 146 145 150 145 162 012
For information, the original file is in the .xlsx (Excel) format. As my computer runs Xubuntu, I opened it with LibreOffice Calc (v5.1). Then, I saved it as "Text CSV" with "Character set = Unicode (UTF-8)" and tab as field separator:
$ file myfile.csv
myfile.csv: UTF-8 Unicode text
Best Answer
It's the UTF-8 encoding of the U+00A0 Unicode character:
UTF-8 is an encoding of Unicode with a variable number of bytes per character. Unicode as a charset is a superset of iso8859-1 (aka latin1) itself a superset of ASCII.
While in iso8859-1, the non-breaking-space character (codepoint 0xa0 in iso8859-1 like in Unicode) would be expressed as a one 0xa0 byte, in UTF-8, only code points 0 to 127 are expressed as one byte (which makes UTF-8 a superset of ASCII or in other words ASCII files are also UTF-8 files).
Code points over 128 are encoded with more bytes per characters. See Wikipedia for details of the UTF-8 encoding algorithm.