I'm confused by character-sets in Unix. I have a CSV file downloaded via SFTP:
$ file -ib myfile
text/plain; charset=us-ascii
The purpose for this character-set quest is that the data within file is seen like:
Flyers: Video Center
While I want:
Flyers: Video Center
I tried:
iconv -f us-ascii -t utf-8 myfile
Which is throwing the following error:
iconv: illegal input sequence at position 528666
Please clarify what's going on regarding character-sets? Can I download in UTF-8 while getting a file via SFTP? How do we usually decide on what is junk within a character set?
$Locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
$ LC_ALL=C sed -n l
Zimbabwe,175,Unknown Network,-1,Unknown,-1,Unknown,-1,US: Flyers: Video Center:,854088,Standard Display,-998,10/28/2014
$ iconv -f utf-8 -t l1
iconv: illegal input sequence at position 1228354
When set Terminal (Under Transalation, character set to UTF-8), I am able to see clean data.
But, when I read this with UTF-8 encoding using a ETL tool; the data is read as junk.
When I grep my file for data
"Flyers: Video Center"
I don't see result for the fact that data is stored as
"Flyers: Video Center"
Can the file coding be changed so as to see what I want?
hexdump for junk characters:
0000000: 4e42 4353 3a20 4e48 4c2e 636f 6d3a 2055 NBCS: NHL.com: U
0000010: 533a 2046 6c79 6572 733a c2a0 5669 6465 S: Flyers:..Vide
0000020: 6fc2 a043 656e 7465 723a 2057 6861 7427 o..Center: What'
0000030: 7320 486f 740a s Hot.
$dd bs=1 skip=1228300 count=100 < temp1.csv | xxd
100+0 records in
100+0 records out
100 bytes (100 B) copied, 0.000141 seconds, 709 kB/s
0000000: 3031 342c 320a 556e 6b6e 6f77 6e20 436f 014,2.Unknown Co
0000010: 756e 7472 792c 2d31 2c48 756c 7520 4c69 untry,-1,Hulu Li
0000020: 7665 2c33 3738 3834 312c 4e42 433a 2041 ve,378841,NBC: A
0000030: 6d65 7269 6361 e280 9973 2047 6f74 2054 merica...s Got T
0000040: 616c 656e 743a 2053 686f 7274 666f 726d alent: Shortform
0000050: 2c33 3230 3631 3332 2c55 6e6b 6e6f 776e ,3206132,Unknown
0000060: 2053 6974 Sit
Some garbled text:
Junk Americaâs
must have been (Note that apostrophe is not this ' but ’)
America’s
And
BMW â Golden
must have been (Note that hyphen is long hyphen not this -):
BMW – Golden
Best Answer
Issue #1: grepping "Flyers: Video Center"... I don't see the result :
In the hexadecimal dump of the file, notice the two bytes C2A0 between the words Flyers: and Video. This is a the UTF8 encoding for Non-breaking space. grepping NBSP is known to fail For more information, read How to remove special 'M-BM-' character with sed and use sed to replace ...Hex c2a0. Short answer is:
Issue #2 `America’s' shows as 'Americaâs' (??):
Here, the dump contains three bytes e28099, known as RIGHT SINGLE QUOTATION MARK (’). Actually, there should be no problem here ! You probably got distracted by the problem above (could you confirm?)
If you use
grep
,sed
and other tools with expression that respect your locale (UTF8!), then it will work:If you want to get rid of all those UTF-8 "special" characters, use can use the tips above or
iconv
(but nowadays, there are few excuses not to support UTF8).Drop all non-ascii chars:
Or to preserve chars from one locale: