Remove all type of special characters in unix .csv file

escape-charactersspecial characters

I am processing a huge stream of CSV data coming from a source which includes special characters, such as the following:

`÷ Þ Ÿ ³ Ù ÷`

Here is an example row from a data set which includes these characters:

'÷ÞW' , 'ŸŸŸŸŸŸŸ', '³ŸŸÙ÷'

Here is another example taken from a different data set:

WCP16,2013-06-04 20:06:24,2013-06-04,CPU,PrimeNumberGenerationTest,PASS,USA,H‌​F0SXV1,,,N,9999
WCP06,2013-06-04 20:06:24,2013-06-04,CPU,RegisterTest,PASS,USA,HF0SXV1,,,N,99‌​99
WCD42,2013-06-04 20:06:24,2013-06-04,DVDMINUSRW,MainICTest,PASS,USA,HF0SXV1,,‌​,N,9999
WCP09,2013-06-05 01:52:53,2013-06-05,CPU,SSE3Test,PASS,,?÷ÞQ»,,,N,9999
WCP10,2013-06-05 01:52:53,2013-06-05,CPU,SSE4_1Test,PASS,,?÷ÞQ»,,,N,9999

If I knew what type of characters to expect then I could handle that in Informatica when I the read file.

But in my situation I am not sure what type of data I will get on any given day, and as a result my jobs are failing. So I need a way to remove all special characters from the data.

Best Answer

I'm not sure exactly what you mean by "special characters", so I'm going to assume that you want to get rid of non-ASCII characters. There are a few different tools that might work for you. The first few that come to mind for me are:

  1. iconv (internationalization conversion)
  2. tr (translate)
  3. sed (stream editor)

iconv (internationalization conversion)

Here is a solution using iconv:

iconv -c -f utf-8 -t ascii input_file.csv

The -f flag (from) specifies an input format, the -t flag (to) specifies an output format, and the -c flag tells iconv to discard characters that cannot be converted to the target. This writes the results to standard output (i.e. to your console). If you want to write the results to a new file you would do something like this instead:

iconv -c -f utf-8 -t ascii input_file.csv -o output_file.csv

Then, if you want, you can replace the original file with the new file:

mv -i output_file.csv input_file.csv

Here is how iconv handles your first example string:

$ echo "'÷ÞW' , 'ŸŸŸŸŸŸŸ', '³ŸŸÙ÷'" | iconv -c -f utf8 -t ascii
'W' , '', ''

tr (translate)

Here is a solution using the tr (translate) command:

cat input_file.csv | tr -cd '\000-\177'

The \000-\177 pattern specifies the numerical range 0-127 using octal notation. This is the range of values for ASCII characters. The -c flag tells tr to match values in the complement of this range (i.e. to match non-ASCII characters) and the -d flag tells tr perform deletion (instead of translation).

To write the results to a file you would use output redirection:

cat input_file.csv | tr -cd '\000-\177' > output_file.csv

Here is how tr handles your first example string:

$ echo "'÷ÞW' , 'ŸŸŸŸŸŸŸ', '³ŸŸÙ÷'" | tr -cd '\000-\177'
'W' , '', ''

sed (stream editor)

Here is a solution using sed:

sed 's/[\d128-\d255]//g' input_file.csv

The s prefix tells sed to perform substitution, the g suffix tells sed to match patterns globally (by default only the first occurrence is matched), the pattern [\d128-\d255] tells sed to match characters with decimal values in the range 128-255 (i.e. non-ASCII characters), and the empty string between the second and third forward-slashes tells sed to replace matched patterns with the empty string (i.e. to remove them).

Unlike many other programs, sed has an option to update the file in-place (instead of manually writing to a different file and then replacing the original):

sed -i 's/[\d128-\d255]//g' input_file.csv

Here is how sed handles your first example string:

$ echo "'÷ÞW' , 'ŸŸŸŸŸŸŸ', '³ŸŸÙ÷'" | sed 's/[\d128-\d255]//g'
'W' , '', ''
Related Question