Unix command to convert character encoding in a .csv file

character encodingcommand

I need a Unix command to convert a .csv file that is in UNICODE format to ANSI format.

The file is imported from the Cognos environment and I am unable to make any changes with the format in cognos.

Best Answer

You can use iconv to convert between encodings

iconv -f utf-8 -t ascii oldfile > newfile

1. Is it possible to config linux filesystem use fixed character encoding to store file names regardless of LANG/LC_ALL environment?

No, this is not possible: as you mention in your question, a UNIX file name is just a sequence of bytes; the kernel knows nothing about the encoding, which entirely a user-space (i.e., application-level) concept.

In other words, the kernel knows nothing about LANG/LC_*, so it cannot translate.

2. Is it possible to let different file names refer to same file?

You can have multiple directory entries referring to the same file; you can make that through hard links or symbolic links.

Be aware, however, that the file names that are not valid in the current encoding (e.g., your GBK character string when you're working in a UTF-8 locale) will display badly, if at all.

3. Is it possible to patch the kernel to translate character encoding between file-system and current environment?

You cannot patch the kernel to do this (see 1.), but you could -in theory- patch the C library (e.g., glibc) to perform this translation, and always convert file names to UTF-8 when it calls the kernel, and convert them back to the current encoding when it reads a file name from the kernel.

A simpler approach could be to write an overlay filesystem with FUSE, that just redirects any filesystem request to another location after converting the file name to/from UTF-8. Ideally you could mount this filesystem in ~/trans, and when an access is made to ~/trans/a/GBK/encoded/path then the FUSE filesystem really accesses /a/UTF-8/encoded/path.

However, the problem with these approaches is: what do you do with files that already exist on your filesystem and are not UTF-8 encoded? You cannot just simply pass them untranslated, because then you don't know how to convert them; you cannot mangle them by translating invalid character sequences to ? because that could create conflicts...

Change Text File Encoding without knowning the source encoding

You can use iconv or recode to convert the file. But you will need to specify the source encoding.

The information about the source encoding has to come from somewhere. A plain text file doesn't contain any information about its encoding. Some types of formatted text contain an indication (for example headers in HTML or in LaTeX), but in general, you're on your own. It's up to the environment to know what encoding it uses for its text file.

You can try to guess the source encoding. This only has a chance of working if you have some information about the file — either you know what language it's in (e.g. you know it's in Polish or English), or there's only a small number of potential encodings (e.g. it's either UTF-8 or Latin-1). See How can I test the encoding of a text file... Is it valid, and what is it? and How do I re-encode a mixed encoded text file for some possibilities, including Enca and Perl Encode::Guess. You'll need to work out based on your data set whether one of these tools can work for you.

Best Answer

Related Solutions

Linux Filesystems – Questions About Character Encoding

1. Is it possible to config linux filesystem use fixed character encoding to store file names regardless of LANG/LC_ALL environment?

2. Is it possible to let different file names refer to same file?

3. Is it possible to patch the kernel to translate character encoding between file-system and current environment?

Change Text File Encoding without knowning the source encoding

Related Question