Unix character set conversion

character encoding

I'm confused by character-sets in Unix. I have a CSV file downloaded via SFTP:

$ file -ib myfile
text/plain; charset=us-ascii

The purpose for this character-set quest is that the data within file is seen like:

Flyers: Video Center

While I want:

Flyers: Video Center

I tried:

iconv -f us-ascii -t utf-8 myfile

Which is throwing the following error:

iconv: illegal input sequence at position 528666

Please clarify what's going on regarding character-sets? Can I download in UTF-8 while getting a file via SFTP? How do we usually decide on what is junk within a character set?

$Locale  
LANG=en_US.UTF-8  
LC_CTYPE="en_US.UTF-8"  
LC_NUMERIC="en_US.UTF-8"  
LC_TIME="en_US.UTF-8"  
LC_COLLATE="en_US.UTF-8"  
LC_MONETARY="en_US.UTF-8"  
LC_MESSAGES="en_US.UTF-8"  
LC_PAPER="en_US.UTF-8"  
LC_NAME="en_US.UTF-8"  
LC_ADDRESS="en_US.UTF-8"  
LC_TELEPHONE="en_US.UTF-8"  
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=  

$  LC_ALL=C sed -n l  
Zimbabwe,175,Unknown Network,-1,Unknown,-1,Unknown,-1,US: Flyers: Video Center:,854088,Standard Display,-998,10/28/2014

$ iconv -f utf-8 -t l1   
iconv: illegal input sequence at position 1228354  

When set Terminal (Under Transalation, character set to UTF-8), I am able to see clean data.
But, when I read this with UTF-8 encoding using a ETL tool; the data is read as junk.

When I grep my file for data

"Flyers: Video Center" 

I don't see result for the fact that data is stored as

"Flyers: Video Center"

Can the file coding be changed so as to see what I want?

hexdump for junk characters:

0000000: 4e42 4353 3a20 4e48 4c2e 636f 6d3a 2055  NBCS: NHL.com: U  
0000010: 533a 2046 6c79 6572 733a c2a0 5669 6465  S: Flyers:..Vide  
0000020: 6fc2 a043 656e 7465 723a 2057 6861 7427  o..Center: What'  
0000030: 7320 486f 740a                           s Hot.  


$dd bs=1 skip=1228300 count=100 < temp1.csv | xxd  
100+0 records in  
100+0 records out  
100 bytes (100 B) copied, 0.000141 seconds, 709 kB/s  
0000000: 3031 342c 320a 556e 6b6e 6f77 6e20 436f  014,2.Unknown Co  
0000010: 756e 7472 792c 2d31 2c48 756c 7520 4c69  untry,-1,Hulu Li  
0000020: 7665 2c33 3738 3834 312c 4e42 433a 2041  ve,378841,NBC: A  
0000030: 6d65 7269 6361 e280 9973 2047 6f74 2054  merica...s Got T  
0000040: 616c 656e 743a 2053 686f 7274 666f 726d  alent: Shortform    
0000050: 2c33 3230 3631 3332 2c55 6e6b 6e6f 776e  ,3206132,Unknown  
0000060: 2053 6974                                 Sit  

Some garbled text:

Junk Americaâs   

must have been (Note that apostrophe is not this ' but ’)

America’s

And

BMW â Golden  

must have been (Note that hyphen is long hyphen not this -):

BMW – Golden 

Best Answer

Issue #1: grepping "Flyers: Video Center"... I don't see the result :

In the hexadecimal dump of the file, notice the two bytes C2A0 between the words Flyers: and Video. This is a the UTF8 encoding for Non-breaking space. grepping NBSP is known to fail For more information, read How to remove special 'M-BM-' character with sed and use sed to replace ...Hex c2a0. Short answer is:

sed -i.bak -e 's/\xc2\xa0/ /' /path/to/file

Issue #2 `America’s' shows as 'Americaâs' (??):

Here, the dump contains three bytes e28099, known as RIGHT SINGLE QUOTATION MARK (’). Actually, there should be no problem here ! You probably got distracted by the problem above (could you confirm?)

If you use grep, sed and other tools with expression that respect your locale (UTF8!), then it will work:

printf 'America\xe2\x80\x99s\n' | grep --only-matching "[[:punct:]]"
printf 'America\xe2\x80\x99s\n' | sed -e "s/[[:punct:]]/?/"

If you want to get rid of all those UTF-8 "special" characters, use can use the tips above or iconv (but nowadays, there are few excuses not to support UTF8).

Drop all non-ascii chars:

type a.txt | iconv -f utf8 -t ASCII//TRANSLIT

Or to preserve chars from one locale:

type a.txt | iconv -f utf8 -t iso8859-15//TRANSLIT | iconv -f iso8859-15 -t utf8
Related Question