Postgresql – Cleaning up bad data with REGEXP_REPLACE too aggressive with non latin chars

collationdata-cleansingpostgresqlregexreplace

Trying to use REGEXP_REPLACE to clean up a text field in a table.

Here's my query:

select title original ,REGEXP_REPLACE(title, '[^a-zA-Z0-9]+', ' ') clean
from tmp_table2
WHERE txn_title like '%?%'

Here are my rows:

original                                              clean
�� ���� ?���?� ����� Lionel Casson ��    Lionel Casson 
��  �?� ?���?� ?���� �?� �. �?����   
� Les phobies � l'enfant : impasse ou passage ?      Les phobies chez l enfant impasse ou passage 
- Snälla du! Kan du sätta? : om vägledni i förskolan   Sn lla du Kan du s tta dig om v gledning i f rskolan

The first three rows above are ok for my needs, but the 4th row is not, I want to keep those non-latin characters (words with umlauts, etc…)

How can I keep those character types (non-latin) by using one regex in the replace function above, while keeping the rest of the results as shown in the 'clean' column?

Best Answer

Given proper lc_ctype (not C) and encoding (especially not SQL_ASCII), character classes could be used in regular expressions to include accents without having to specify ranges.

Example:

=# show lc_ctype;
  lc_ctype   
-------------
 fr_FR.UTF-8

=# show server_encoding ;
 server_encoding 
-----------------
 UTF8


=# \set raw_text '- Snälla du! Kan du sätta? : om vägledni i förskolan'

=# select regexp_replace(:'raw_text', '\w', ' ', 'g')
                    regexp_replace                    
------------------------------------------------------
   Snälla du  Kan du sätta    om vägledni i förskolan

If you wanted to keep punctuation signs:

=# select regexp_replace(:'raw_text', '[^\w[:punct:]]', ' ', 'g');
                    regexp_replace                    
------------------------------------------------------
 - Snälla du! Kan du sätta? : om vägledni i förskolan

Another possilibity to clean up could be to filter out non-printable characters:

=# select regexp_replace(:'raw_text', '[^[:print:]]', ' ', 'g');
                    regexp_replace                    
------------------------------------------------------
 - Snälla du! Kan du sätta? : om vägledni i förskolan