Postgresql – Case Insensitive ORDER BY clause using COLLATE

collationorder-bypostgresql

I have spent a long time looking for this, and I am getting mixed messages.

In other DBMSs (tested in SQLite, Oracle, MariaDB, MSSQL) I can override the default sort order using the COLLATE clause:

SELECT *
FROM orderby
ORDER BY string COLLATE … ;

--  SQLite:     BINARY | NOCASE
--  MariaDB:    utf8mb4_bin | utf8mb4_general_ci
--  Oracle:     BINARY | BINARY_CI
--  MSSQL:      Latin1_General_BIN | Latin1_General_CI_AS

I have pored over the documentation and searched high and low, but I can’t find anything so straightforward for PostgreSQL.

Is there a COLLATE clause value that would sort Case Insensitive?

I know there are many questions regarding case sensitivity, but (a) most of them are old and (b) none that I have seen relate to the COLLATE clause.

FWIW, I am testing on PostgreSQL 11.8. I have a test fiddle on http://sqlfiddle.com/#!17/05cab/1, but it’s only for PostgreSQL 9.6.

MySQL/MariaDB and SQL Server default to case insensitive, and that would certainly make sense when sorting most text. Oracle and SQLite default to case sensitive, but have a relatively simple solution for a case insensitive sort. The default collation for my database is en_US.UTF-8. I’m trying to fill in a few gaps here.

Best Answer

If you were using PostgreSQL 12 or newer you would be able to create a new collation via the CREATE COLLATION command that specifies deterministic = false:

CREATE COLLATION ci (provider = icu, locale = 'en-US-u-ks-level2', deterministic = false);

Instead of sqlfiddle, I used db<>fiddle, which offers several versions of PostgreSQL, for the example:

https://dbfiddle.uk/?rdbms=postgres_12&fiddle=a8f4b330d04266947fb11e5c04fa4891&hide=2

This solution is similar to what worked for this related answer (also on DBA.SE):
PostgreSQL nondeterministic collations are not supported for LIKE

The first four db<>fiddle example queries with an ORDER BY clause show three different sort results:

  1. the ORDER BY "string" COLLATE "POSIX" query returns rows in closer to a binary / ordinal ordering with all upper-case (A - Z) grouped before all lower-case (a - z).
  2. the ORDER BY "string" and ORDER BY "string" COLLATE "en_US.utf8" queries return rows in what is effectively case-sensitive order with "A" and "a" grouped together, and lower-case of each letter coming before the upper-case of the same letter (aA bB ...).
  3. the ORDER BY "string" COLLATE "ci" query returns rows in a truly case-insensitive order with "A" and "a" grouped together, but this time the cases of each particular letter are not in a predefined order and the two cases can even be intermixed within a letter. Since it is not guarantee to produce the same order each time, here is a sample of the result set that I'm currently seeing for this query:
    -8    banana
    -15   Banana
    -9    bANana
    1116  Banana
    

The following statement from the Collation Support documentation for version 12 explains that deterministic needs to be set to false:

Note that while this system allows creating collations that “ignore case” or “ignore accents” or similar (using the ks key), in order for such collations to act in a truly case- or accent-insensitive manner, they also need to be declared as not deterministic in CREATE COLLATION;... Otherwise, any strings that compare equal according to the collation but are not byte-wise equal will be sorted according to their byte values.

But the same note from the same Collation Support documentation for version 11 explains that sorting is still effectively case-sensitive:

Note that while this system allows creating collations that “ignore case” or “ignore accents” or similar (using the ks key), PostgreSQL does not at the moment allow such collations to act in a truly case- or accent-insensitive manner. Any strings that compare equal according to the collation but are not byte-wise equal will be sorted according to their byte values.

Taking a closer look at the CREATE COLLATION documentation, it seems that the deterministic property (which defaults to true but needs to be set to false to do insensitive comparisons and sorting) was introduced in Version 12, which doesn't really help you. (@LaurenzAlbe recommended an alternate syntax in a comment that might work on older versions of libicu, and adjusting the db<>fiddle test linked above for that syntax and to run on PostgreSQL 11, it does not appear to work: https://dbfiddle.uk/?rdbms=postgres_11&fiddle=c3c48e111ed1837987524fee2c54a183&hide=2 )

That leaves you with two options:

  1. lower() function
  2. citext datatype (as mentioned by @a_horse_with_no_name in a comment on the question)

The citext datatype calls the lower() function internally, but has the following benefits (according to the documentation, linked directly above):

  1. it applies to implicit indexes created by UNIQUE and PRIMARY KEY constraints
  2. don't need to remember to add lower() to both sides of a comparison

The two main considerations are:

  • citext is not as efficient as text because the operator functions and the B-tree comparison functions must make copies of the data and convert it to lower case for comparisons. It is, however, slightly more efficient than using lower to get case-insensitive matching.

  • citext doesn't help much if you need data to compare case-sensitively in some contexts and case-insensitively in other contexts. The standard answer is to use the text type and manually use the lower function when you need to compare case-insensitively; this works all right if case-insensitive comparison is needed only infrequently. If you need case-insensitive behavior most of the time and case-sensitive infrequently, consider storing the data as citext and explicitly casting the column to text when you want case-sensitive comparison. In either situation, you will need two indexes if you want both types of searches to be fast.

It sounds like citext creates a lower-case copy of the data that takes up more space but is slightly faster due to the pre-calculation.

Also, citext needs to be installed. I modified the earlier example queries to make use of citext to show what can be done on PostgreSQL 11, but it seems that citext is not installed on db<>fiddle. I will put in a request for that. But here is are the test queries that should show how it works on your system if you have citext installed:

https://dbfiddle.uk/?rdbms=postgres_11&fiddle=7dd95430aad8f646bfa1b7e79e21871c&hide=2

All of that being said, it is important to note that case-insensitive ordering, for the only or right-most / final column, isn't much different than case-sensitive ordering: the same letters across both cases are going to group together. It's just that with case-sensitive ordering, each case within each letter will group together (i.e. all "A"s before or after all "a"s), while case-insensitive ordering allows for "A"s and "a"s to be intermixed (since neither has priority). Allowing different cases of the same letter to intermix becomes important when there is at least one additional sort criteria in the ORDER BY list, specified to the right of (i.e. after) the string column in question. The final two example queries in the db<>fiddle linked at the top of this answer shows the difference.

With all of this in mind, the main question is: what precisely are you trying to accomplish? The request, as it is currently stated in the question, is for a case-insensitive sort. Does this mean that the comparisons should still be case-sensitive? That would seem like an odd combination given that it is fairly common to have/want a case-insensitive comparison, but the sort is less important as there is often little difference between case-sensitive and insensitive sorting (case-sensitive is consistent, but case-insensitive will have random ordering of strings that differ only in casing, and that randomness sometimes results in the same ordering as what case-sensitive would return), unless there are additional fields being sorted on.

Comment from the O.P.:

MySQL/MariaDB and SQL Server default to case insensitive, and that would certainly make sense when sorting most text. Oracle and SQLite default to case sensitive, but have a relatively simple solution for a case insensitive sort. The default collation for my database is en_US.UTF-8. I’m trying to fill in a few gaps here.

It's still not clear why any "gaps" need to be filled, but it sounds like there is a slight issue with terminology: the term "sort" / "sorting" is being used, but I suspect that the main issue is with comparisons, given that there's often no visible difference between case -sensitive and -insensitive when it comes to sorting. And, without overriding a predicate or ORDER BY field/expression with a COLLATE clause, the same sensitivity (or lack thereof) would be applied to both. This is important because, if I'm correct that O.P. truly means both, then:

  1. Performance-wise, I would guess that lower might actually be slightly better than casting to citext in this specific context because citext is going to call lower anyway but is also doing a datatype conversion. To be fair, I haven't done any testing around this so it's merely an edumacated guess, assuming that when used as a datatype for a column, citext is slightly better due to the values being pre-calculated, but in this case there's no opportunity to precalculate.
  2. Using either citext or lower in the ORDER BY will only solve half the issue (and the less important half given, again, that the distinction is barely noticable)
  3. Most likely you need to stop focusing on the ORDER BY clause and use citext as the datatype for the columns, which will take care of both comparisons and sorting.