I am processing a huge stream of CSV data coming from a source which includes special characters, such as the following:
`÷ Þ Ÿ ³ Ù ÷`
Here is an example row from a data set which includes these characters:
'÷ÞW' , 'ŸŸŸŸŸŸŸ', '³ŸŸÙ÷'
Here is another example taken from a different data set:
WCP16,2013-06-04 20:06:24,2013-06-04,CPU,PrimeNumberGenerationTest,PASS,USA,HF0SXV1,,,N,9999
WCP06,2013-06-04 20:06:24,2013-06-04,CPU,RegisterTest,PASS,USA,HF0SXV1,,,N,9999
WCD42,2013-06-04 20:06:24,2013-06-04,DVDMINUSRW,MainICTest,PASS,USA,HF0SXV1,,,N,9999
WCP09,2013-06-05 01:52:53,2013-06-05,CPU,SSE3Test,PASS,,?÷ÞQ»,,,N,9999
WCP10,2013-06-05 01:52:53,2013-06-05,CPU,SSE4_1Test,PASS,,?÷ÞQ»,,,N,9999
If I knew what type of characters to expect then I could handle that in Informatica when I the read file.
But in my situation I am not sure what type of data I will get on any given day, and as a result my jobs are failing. So I need a way to remove all special characters from the data.
Best Answer
I'm not sure exactly what you mean by "special characters", so I'm going to assume that you want to get rid of non-ASCII characters. There are a few different tools that might work for you. The first few that come to mind for me are:
iconv
(internationalization conversion)tr
(translate)sed
(stream editor)iconv
(internationalization conversion)Here is a solution using iconv:
The
-f
flag (from) specifies an input format, the-t
flag (to) specifies an output format, and the-c
flag tellsiconv
to discard characters that cannot be converted to the target. This writes the results to standard output (i.e. to your console). If you want to write the results to a new file you would do something like this instead:Then, if you want, you can replace the original file with the new file:
Here is how
iconv
handles your first example string:tr
(translate)Here is a solution using the tr (translate) command:
The
\000-\177
pattern specifies the numerical range 0-127 using octal notation. This is the range of values for ASCII characters. The-c
flag tellstr
to match values in the complement of this range (i.e. to match non-ASCII characters) and the-d
flag tellstr
perform deletion (instead of translation).To write the results to a file you would use output redirection:
Here is how
tr
handles your first example string:sed
(stream editor)Here is a solution using sed:
The
s
prefix tellssed
to perform substitution, theg
suffix tellssed
to match patterns globally (by default only the first occurrence is matched), the pattern[\d128-\d255]
tellssed
to match characters with decimal values in the range 128-255 (i.e. non-ASCII characters), and the empty string between the second and third forward-slashes tellssed
to replace matched patterns with the empty string (i.e. to remove them).Unlike many other programs,
sed
has an option to update the file in-place (instead of manually writing to a different file and then replacing the original):Here is how
sed
handles your first example string: