Remove all type of special characters in unix .csv file

escape-charactersspecial characters

I am processing a huge stream of CSV data coming from a source which includes special characters, such as the following:

`÷ Þ Ÿ ³ Ù ÷`

Here is an example row from a data set which includes these characters:

'÷ÞW' , 'ŸŸŸŸŸŸŸ', '³ŸŸÙ÷'

Here is another example taken from a different data set:

WCP16,2013-06-04 20:06:24,2013-06-04,CPU,PrimeNumberGenerationTest,PASS,USA,H‌F0SXV1,,,N,9999
WCP06,2013-06-04 20:06:24,2013-06-04,CPU,RegisterTest,PASS,USA,HF0SXV1,,,N,99‌99
WCD42,2013-06-04 20:06:24,2013-06-04,DVDMINUSRW,MainICTest,PASS,USA,HF0SXV1,,‌,N,9999
WCP09,2013-06-05 01:52:53,2013-06-05,CPU,SSE3Test,PASS,,?÷ÞQ»,,,N,9999
WCP10,2013-06-05 01:52:53,2013-06-05,CPU,SSE4_1Test,PASS,,?÷ÞQ»,,,N,9999

If I knew what type of characters to expect then I could handle that in Informatica when I the read file.

But in my situation I am not sure what type of data I will get on any given day, and as a result my jobs are failing. So I need a way to remove all special characters from the data.

Best Answer

I'm not sure exactly what you mean by "special characters", so I'm going to assume that you want to get rid of non-ASCII characters. There are a few different tools that might work for you. The first few that come to mind for me are:

iconv (internationalization conversion)
tr (translate)
sed (stream editor)

`iconv` (internationalization conversion)

Here is a solution using iconv:

iconv -c -f utf-8 -t ascii input_file.csv

The -f flag (from) specifies an input format, the -t flag (to) specifies an output format, and the -c flag tells iconv to discard characters that cannot be converted to the target. This writes the results to standard output (i.e. to your console). If you want to write the results to a new file you would do something like this instead:

iconv -c -f utf-8 -t ascii input_file.csv -o output_file.csv

Then, if you want, you can replace the original file with the new file:

mv -i output_file.csv input_file.csv

Here is how iconv handles your first example string:

$ echo "'÷ÞW' , 'ŸŸŸŸŸŸŸ', '³ŸŸÙ÷'" | iconv -c -f utf8 -t ascii
'W' , '', ''

`tr` (translate)

Here is a solution using the tr (translate) command:

cat input_file.csv | tr -cd '\000-\177'

The \000-\177 pattern specifies the numerical range 0-127 using octal notation. This is the range of values for ASCII characters. The -c flag tells tr to match values in the complement of this range (i.e. to match non-ASCII characters) and the -d flag tells tr perform deletion (instead of translation).

To write the results to a file you would use output redirection:

cat input_file.csv | tr -cd '\000-\177' > output_file.csv

Here is how tr handles your first example string:

$ echo "'÷ÞW' , 'ŸŸŸŸŸŸŸ', '³ŸŸÙ÷'" | tr -cd '\000-\177'
'W' , '', ''

`sed` (stream editor)

Here is a solution using sed:

sed 's/[\d128-\d255]//g' input_file.csv

The s prefix tells sed to perform substitution, the g suffix tells sed to match patterns globally (by default only the first occurrence is matched), the pattern [\d128-\d255] tells sed to match characters with decimal values in the range 128-255 (i.e. non-ASCII characters), and the empty string between the second and third forward-slashes tells sed to replace matched patterns with the empty string (i.e. to remove them).

Unlike many other programs, sed has an option to update the file in-place (instead of manually writing to a different file and then replacing the original):

sed -i 's/[\d128-\d255]//g' input_file.csv

Here is how sed handles your first example string:

$ echo "'÷ÞW' , 'ŸŸŸŸŸŸŸ', '³ŸŸÙ÷'" | sed 's/[\d128-\d255]//g'
'W' , '', ''

Related Solutions

Matlab gets “crazy” after type special characters

Check if your locale settings match on the local and remote machine: run echo $LC_CTYPE in a local terminal and in an ssh session. If they don't match, try again with Matlab with LC_CTYPE set to the correct value. If that doesn't work, try with export LC_CTYPE=C or with export LC_CTYPE=en_US (shot in the dark, I don't know the nature of the bug with Matlab).

If you're in a UTF-8 locale, try in an 8-bit locale such as latin1:

LC_CTYPE=en_US luit ssh redhat-host

If fiddling with locales doesn't help, try something more radical: run stty istrip in the terminal before starting Matlab. This strips the 8th bit off the characters you type. When you enter an accented letter, Matlab will receive a garbage ASCII character, but nothing that should confuse it. Note that you should do this in addition to switching to a latin-1 locale such as en_US; in other locales, in particular in UTF-8, the garbage characters can be control characters.

There are a lot of stty settings, but I can't think of one that would just cause non-ASCII characters to be ignored. If you experiment with stty, you can revert to sane defaults with stty sane. The changes are local to a terminal.

Best Answer

iconv (internationalization conversion)

tr (translate)

sed (stream editor)

Related Solutions

Matlab gets “crazy” after type special characters

Related Question

`iconv` (internationalization conversion)

`tr` (translate)

`sed` (stream editor)