Postgresql – Was using `set datcollate=’C’` a mistake

collationindexpostgresqlpostgresql-9.1

In a pre-production environment running django 1.6 on PostgreSQL 9.1 we had a problem with database records arriving in a different order from what the application was expecting. So we ran this command as the postgres user:

update pg_database set datcollate='C' where datname='appname';

About a week later it was suddenly impossible to log into the application as a user which did exist in the database and had been unchanged for years. Attempts at looking up the user by username fails but looking up the same user by primary key worked, and the returned username was in fact correct.

That problem was fixed by reindexing the database like this:

REINDEX DATABASE appname;

Is it possible that the update command has broken the indexes? If that update command is indeed unsafe, then what is the safe way to change collation on a PostgreSQL database?

In case it is relevant, these are the queries that django generates for looking up the user by username or primary key:

SELECT "auth_user"."id", "auth_user"."password", "auth_user"."last_login", "auth_user"."is_superuser", "auth_user"."username", "auth_user"."first_name", "auth_user"."last_name", "auth_user"."email", "auth_user"."is_staff", "auth_user"."is_active", "auth_user"."date_joined" FROM "auth_user" WHERE "auth_user"."username" = username 
SELECT "auth_user"."id", "auth_user"."password", "auth_user"."last_login", "auth_user"."is_superuser", "auth_user"."username", "auth_user"."first_name", "auth_user"."last_name", "auth_user"."email", "auth_user"."is_staff", "auth_user"."is_active", "auth_user"."date_joined" FROM "auth_user" WHERE "auth_user"."id" = 1 

Best Answer

The datcollate column of pg_database stores

LC_COLLATE for this database

An other page of the documentation about collations says:

The collation feature allows specifying the sort order and character classification behavior of data per-column, or even per-operation. This alleviates the restriction that the LC_COLLATE and LC_CTYPE settings of a database cannot be changed after its creation.

That means that by updating that column you managed to cross some river that should not be crossed under any circumstances.

On the other hand, as collations affect the ordering of text, varchar and varchar typed columns (and possibly some others, check the documentation page for details). This also means that the existing indexes on such columns will be ordered differently from what the LC_COLLATE dictates. This will result of some weird behaviour, and as you've seen, at least a REINDEX will be needed. I would expect that doing a

SET enable_indexscan TO off;

would help you finding the user by username, because it would do a sequential scan instead of an index scan, so not trying to use the unusable index.

I cannot tell if this is enough to fix your DB, but I would do the following instead:

  1. set up a new cluster with the desired collation
  2. take a dump from your DB
  3. restore it on the new cluster, watching for errors
  4. test it thoroughly and then start using it instead of the old one

For the future, you cannot change the DB level collation, but you can define collation on the column level.

One can even change the collation on a column like

ALTER TABLE t ALTER COLUMN col1 TYPE text COLLATE "en_US";

where you keep the type as it is, just set a different collation.

Just a final notice: changing the system catalogs directly should always be a last resort. Before doing that (probably except adding new enum values in pre-9.1 versions), one should carefully examine the possible outcomes from the documentation, blog or mailing list posts and testing.