PostgreSQL encoding conversion error

encodingpostgresql

I have a Postgresql database in LATIN9 encoding.

I recently migrated my web applications (PHP) from iso_8859_15 (latin9) to UTF8.

So, now, customers are able to specify UTF-8 characters in web forms like comment form or contact form.

I open PG connections in UTF-8 encoding from PHP (PHP function pg_set_client_encoding('UTF8') or so…)

That works well for standard characters like "éàù"… PG is able to convert them from UTF-8 to LATIN9 (the database encoding). But if someone inserts some exotic character like japanese characters "日本語" PostgreSQL throws an error like this:

ERROR:  character with byte sequence 0xe6 0x97 0xa5 in encoding "UTF8" has no equivalent in encoding "LATIN9"

I know that I should migrate my database to utf8 to solve this problem, but for some reasons, I can not do that for the moment.

In my case, I'd rather PostgreSQL saves my string removing characters it can not convert or for example replacing them with some symbol like "?" rather than throwing an error…

The only way I figured to make things work like this is to open my connections in LATIN9 and use utf8_decode() PHP function that works like this (replacing unknown characters with "?" symbol)

Is it possible to do it in a better way? Maybe a PG parameter (but I don't see such parameter in postgresql.conf)

Or anyone have another idea on how to handle this?

Thanks,

Best Answer

know that I should migrate my database to utf8 to solve this problem, but for some reasons, I can not do that for the moment.

In my case, I'd rather PostgreSQL saves my string removing characters it can not convert or for example replacing them with some symbol like "?" rather than throwing an error...

PostgreSQL does not support this. It's requested periodically, but nobody who requests it does the work to actually implement it in the system and convince the dev team it's an appropriate option to offer.

You will need to do your text-mangling client-side. In PHP, before you send the text to PostgreSQL, you will need to filter out characters that doesn't match the database encoding. How to do that is entirely PHP-specific (start with iconv support, probably). You have described one way to do this, using utf8_decode, already.

Using utf8_decode is actually incorrect, because the function (per the docs) actually assumes the input is ISO-8859-1, i.e. Latin-1. You're using latin-9, i.e. ISO-8859-15. So it'll mangle some of your input characters, in particular the Euro sign. See changes from ISO-8859-1. Instead, use the iconv function. See the surprisingly useful comments on the utf8_decode function documentation.

If in the process of filtering the text you convert it to LATIN9 inside PHP, remember that you must set your client_encoding to latin9, since that's the encoding of the text you'll be sending to PostgreSQL. That means the results will be in latin-9 too, so you must convert all results from PostgreSQL from latin-9 back to PHP's native utf-8.

If you use utf8_encode to convert your latin-9 output from PostgreSQL for consumption in PHP, you'll have the same problem with latin-1 vs latin-9 as you do on utf8_decode.

For that reason, if possible, try to use a filter that replaces characters not supported in latin-9 without actually converting the string to latin-9. It'll save you a bunch of hassle if you can keep client_encoding set to utf-8 and just mangle your strings instead of converting them.

All this said, I strongly recommend upgrading the database to utf-8 instead. The only reason to keep it in latin-9 would be if you have other client applications that can't cope with chars outside the latin-9 range (i.e. they rely on a latin-9 client_encoding)