Sql-server – Replace special characters in a column with space

replacesql serversql-server-2008-r2t-sql

I'm trying to write a query which replaces the special characters with space. Below code helps to identify the rows. (alpha-numeric characters, comma and space is valid):

SELECT columnA
FROM tableA
WHERE columnA like '%[^a-Z0-9, ]%'

How can I integrate the replace function into the select statement so that all characters other than alphanumeric, comma and space in the result set are replaced by ' ' (space). This one won't work:

SELECT replace(columnA,'%[^a-Z0-9, ]%',' ')
FROM tableA
WHERE columnA like '%[^a-Z0-9, ]%'

Best Answer

If you are guaranteed to only ever use the 26 letters of the US English alphabet (both upper-case and lower-case versions) then sure, you can get away with using LIKE and/or PATINDEX with the simple range notation of [a-z] (you wouldn't need to use an upper-case "Z" when using a case-insensitive Collation).

But, if you might get characters not found in the en-US alphabet yet available in various Code Pages / Collations for VARCHAR data (e.g. Þ = Latin capital "Thorn" = SELECT CHAR(0xDE)), then you might need to include those in the character class: [a-z0-9, Þ]. Of course, what those extra characters would be is on a per-Code Page basis.

Also, please be aware that both Collation type (SQL Server vs Windows) and sensitivity settings (case, accent, etc sensitive vs insensitive) will affect which characters are included in a particular range. For example, the SQL Server Collations sort upper-case and lower-case letters in the opposite order as the Windows Collations. Meaning, assuming a case-sensitive Collation for both types of Collations, one will do AaBb... and the other will do aAbB.... The effect will be that a will be within the range of A-Z for one of them, but not the other. And the range of a-Z won't match any characters in a binary Collation (one ending in either _BIN or _BIN2, but don't use _BIN) given that the value of A is 65 and a is 97, hence it is an invalid range of 97 through 65 ;-). There are far too many variations to give examples for here so I will try to post a detailed explanation on my blog sometime soon (and then will update this with the link to it). However, if you are going to be strict about only accepting US English characters (even if you might get valid letters from other languages) then your best option will probably be to use the following pattern and Collation:

LIKE '%[^A-Za-z0-9, ]%' COLLATE Latin1_General_100_BIN2

Now, if you are supporting NVARCHAR data and can get "word" characters from various languages, then T-SQL will not be of much help as it has no real way to differentiate these things. In this case, you should use a Regular Expression (RegEx) -- specifically the Replace method / function -- and those are only available through SQLCLR. The following shows an example of replacing several "special" characters, yet leaving all that are valid letters in at least one language:

DECLARE @Test NVARCHAR(500);
SET @Test = N'this$is%a<>TEST,;to}⌡↕strip╞╟╚══¶out_ç_ƒ▀ special-ij-೫-chars-舛-დ-א-B';
SELECT SQL#.RegEx_Replace4k(@Test, N'[\W\p{Pc}-[,]]', N' ', -1, 1, NULL); 

Returns:

this is a  TEST, to   strip      out ç ƒ  special ij ೫ chars 舛 დ א B

The RegEx expression means:

  • \W = a RegEx "escape" meaning "any non-word character"
  • \p{Pc} = a Unicode "category" of "Punctuation, Connector" (this is needed for the match only because this "category" is specifically excluded by the \W escape)
  • -[,] = class subtraction (this is needed to exclude commas from matching as "special" since they are included in the \W escape)

You can do an update of a table simply by issuing:

UPDATE tbl
SET    tbl.field = SQL#.RegEx_Replace4k(tbl.field, N'[\W\p{Pc}-[,]]', N' ', -1, 1, NULL)
FROM   tbl
WHERE  SQL#.RegEx_IsMatch4k(tbl.field, N'[\W\p{Pc}-[,]]', 1, NULL) = 1;

Please note that for these examples, I used two functions available in the Free version SQL# library of SQLCLR functions, which I created (but again, these are free). Also note that I used the "4k" versions which are faster due to using NVARCHAR(4000) instead of NVARCHAR(MAX) parameter types. If your data is using NVARCHAR(MAX), then just remove the "4k" from the function names.

Please also see: