The encoding
defines the very basic rules how characters are represented in binary format (like @a_horse explains in his comment). It should be mentioned that the server encoding has to match the client encoding for successful communication. Postgres can translate if necessary, there is a dedicated setting client_encoding
for this.
The locale
is a superset of settings, which can be split up for PostgreSQL into
LC_COLLATE
LC_CTYPE
LC_MESSAGES
LC_MONETARY
LC_NUMERIC
LC_TIME
The settings of particular interest for you are LC_COLLATE
(defines how strings are sorted) and LC_TYPE
(defines the type of characters).
In older versions, these two settings could not be changed after a database had been initialized. Since Postgres 9.1 you can at least override the collation setting when needed.
know that I should migrate my database to utf8 to solve this problem, but for some reasons, I can not do that for the moment.
In my case, I'd rather PostgreSQL saves my string removing characters it can not convert or for example replacing them with some symbol like "?" rather than throwing an error...
PostgreSQL does not support this. It's requested periodically, but nobody who requests it does the work to actually implement it in the system and convince the dev team it's an appropriate option to offer.
You will need to do your text-mangling client-side. In PHP, before you send the text to PostgreSQL, you will need to filter out characters that doesn't match the database encoding. How to do that is entirely PHP-specific (start with iconv support, probably). You have described one way to do this, using utf8_decode
, already.
Using utf8_decode
is actually incorrect, because the function (per the docs) actually assumes the input is ISO-8859-1, i.e. Latin-1. You're using latin-9, i.e. ISO-8859-15. So it'll mangle some of your input characters, in particular the Euro sign. See changes from ISO-8859-1. Instead, use the iconv
function. See the surprisingly useful comments on the utf8_decode
function documentation.
If in the process of filtering the text you convert it to LATIN9 inside PHP, remember that you must set your client_encoding
to latin9, since that's the encoding of the text you'll be sending to PostgreSQL. That means the results will be in latin-9 too, so you must convert all results from PostgreSQL from latin-9 back to PHP's native utf-8.
If you use utf8_encode
to convert your latin-9 output from PostgreSQL for consumption in PHP, you'll have the same problem with latin-1 vs latin-9 as you do on utf8_decode
.
For that reason, if possible, try to use a filter that replaces characters not supported in latin-9 without actually converting the string to latin-9. It'll save you a bunch of hassle if you can keep client_encoding
set to utf-8
and just mangle your strings instead of converting them.
All this said, I strongly recommend upgrading the database to utf-8 instead. The only reason to keep it in latin-9 would be if you have other client applications that can't cope with chars outside the latin-9 range (i.e. they rely on a latin-9 client_encoding
)
Best Answer
Yes. Encoding is the character-to-byte conversion algorithm. For mono-byte encoding such as LATIN1, it's trivially byte value = character number, but for UTF-8 it's more complicated.
Yes, collate specifies how strings are compared. The collation service is provided by the operating system or optionally by the ICU library for PostgreSQL 10 or newer. Sorting arbitrary strings with linguistic rules is a complicated business. The Unicode standard and ISO provide rules but they're heavily customizable and not all C libraries implement them fully anyway. See for instance https://en.wikipedia.org/wiki/Unicode_collation_algorithm for more.
LC_CTYPE in POSIX is related to the functions in ctype.h Postgres is strongly influenced by POSIX in that matter and uses a per-database
lc_ctype
more or less like POSIX usesLC_CTYPE
.Aside from upper() and lower(), it's also relevant for regular expressions and full text search (case-folding of tokens).
I think that Postgres could do without an independant lc_ctype (using the same value as lc_collate) except for one thing: it's interesting to have
lc_ctype
tosomething.UTF-8
for instance, to have the full range of characters properly supported, andlc_collate
toC
because C as a collation is much faster than any linguistic sort implied bylc_collate=lang_country.UTF-8
.Robert Haas (postgres committer) wrote an interesting post about essentially that subject when the COLLATE support was enhanced in Postgres: "The Perils of Collation-Aware Comparisons"
This is a Windows-specific thing, due to Windows not following the POSIX model with locales and collations. On Windows and with non-ICU collations, Postgres will convert the strings from the db encoding to UTF-16 (
wchar_t
) and call wcscoll_l(). This is why the encoding is decorrelated from the collations.With ICU collations, either Postgres passes directly UTF-8 contents if it can, or it converts the strings to UTF-16, so again collations are not tied to a particular database encoding, contrary to the POSIX model and its family of strcoll functions.