Postgresql – ny performance issue with including the “COLLATE” as part of the query as opposed to setting it for the column/table

collationpostgresqlquery-performance

PostgreSQL has this "collate" concept. You can either tell a table's column to always have a "collate", or you can do it in the query.

If I do it in the query, does PG punish me somehow performance-wise? I mean, it makes sense that it can "prepare itself" if I tell the table/column to have a specific "collate", rather than it having to figure it out "on the fly", but is it the case?

Also, any general tips on whether or not I should be having the collate in the query or table/column definition?

Best Answer

While not an expert on PostgreSQL collations, I have used PostgreSQL a little bit (I work mostly with SQL Server) and do not see how this particular info would/could be different than what I am about to describe:

Short answer: It Depends (standard answer).

Actual answer: specifying a collation at the column level (assuming that the collation is completely independent of the encoding / character set, which is the case for PostgreSQL and others, but in SQL Server the encoding is part of the collation) is merely a default that will be used for all sorting and comparison operations that do not explicitly provide a collation via a COLLATE clause. The storage of the data is impacted by the encoding / charset (i.e. Windows-1252 vs UTF-8 vs UTF-16 vs etc), but collation has no impact on data at rest because collation is merely rules for working with the data.

That said, the "depends" part is that if you create an index, it will physically store that data (i.e. the index keys in a separate structure) using the rules of the column's collation (unless you specify the COLLATE clause when creating the index, and not all RDBMSs allow for that). If you then specify a COLLATE clause in the query that is different than the collation used for the index, you will have a performance penalty because you won't be using that index. But, to specify a collation in a query that is different than the column's collation when an index is not available for the column's collation is not a performance issue as the data was not sorted in either collation prior to the query.