I have spent a long time looking for this, and I am getting mixed messages.
In other DBMSs (tested in SQLite, Oracle, MariaDB, MSSQL) I can override the default sort order using the COLLATE
clause:
SELECT *
FROM orderby
ORDER BY string COLLATE … ;
-- SQLite: BINARY | NOCASE
-- MariaDB: utf8mb4_bin | utf8mb4_general_ci
-- Oracle: BINARY | BINARY_CI
-- MSSQL: Latin1_General_BIN | Latin1_General_CI_AS
I have pored over the documentation and searched high and low, but I can’t find anything so straightforward for PostgreSQL.
Is there a COLLATE
clause value that would sort Case Insensitive?
I know there are many questions regarding case sensitivity, but (a) most of them are old and (b) none that I have seen relate to the COLLATE
clause.
FWIW, I am testing on PostgreSQL 11.8. I have a test fiddle on http://sqlfiddle.com/#!17/05cab/1, but it’s only for PostgreSQL 9.6.
MySQL/MariaDB and SQL Server default to case insensitive, and that would certainly make sense when sorting most text. Oracle and SQLite default to case sensitive, but have a relatively simple solution for a case insensitive sort. The default collation for my database is en_US.UTF-8
. I’m trying to fill in a few gaps here.
Best Answer
If you were using PostgreSQL 12 or newer you would be able to create a new collation via the CREATE COLLATION command that specifies
deterministic = false
:Instead of sqlfiddle, I used db<>fiddle, which offers several versions of PostgreSQL, for the example:
https://dbfiddle.uk/?rdbms=postgres_12&fiddle=a8f4b330d04266947fb11e5c04fa4891&hide=2
This solution is similar to what worked for this related answer (also on DBA.SE):
PostgreSQL nondeterministic collations are not supported for LIKE
The first four db<>fiddle example queries with an
ORDER BY
clause show three different sort results:ORDER BY "string" COLLATE "POSIX"
query returns rows in closer to a binary / ordinal ordering with all upper-case (A - Z) grouped before all lower-case (a - z).ORDER BY "string"
andORDER BY "string" COLLATE "en_US.utf8"
queries return rows in what is effectively case-sensitive order with "A" and "a" grouped together, and lower-case of each letter coming before the upper-case of the same letter (aA bB ...).ORDER BY "string" COLLATE "ci"
query returns rows in a truly case-insensitive order with "A" and "a" grouped together, but this time the cases of each particular letter are not in a predefined order and the two cases can even be intermixed within a letter. Since it is not guarantee to produce the same order each time, here is a sample of the result set that I'm currently seeing for this query:The following statement from the Collation Support documentation for version 12 explains that
deterministic
needs to be set tofalse
:But the same note from the same Collation Support documentation for version 11 explains that sorting is still effectively case-sensitive:
Taking a closer look at the
CREATE COLLATION
documentation, it seems that thedeterministic
property (which defaults totrue
but needs to be set tofalse
to do insensitive comparisons and sorting) was introduced in Version 12, which doesn't really help you. (@LaurenzAlbe recommended an alternate syntax in a comment that might work on older versions of libicu, and adjusting the db<>fiddle test linked above for that syntax and to run on PostgreSQL 11, it does not appear to work: https://dbfiddle.uk/?rdbms=postgres_11&fiddle=c3c48e111ed1837987524fee2c54a183&hide=2 )That leaves you with two options:
lower()
functionThe
citext
datatype calls thelower()
function internally, but has the following benefits (according to the documentation, linked directly above):UNIQUE
andPRIMARY KEY
constraintslower()
to both sides of a comparisonThe two main considerations are:
It sounds like
citext
creates a lower-case copy of the data that takes up more space but is slightly faster due to the pre-calculation.Also,
citext
needs to be installed. I modified the earlier example queries to make use ofcitext
to show what can be done on PostgreSQL 11, but it seems thatcitext
is not installed on db<>fiddle. I will put in a request for that. But here is are the test queries that should show how it works on your system if you havecitext
installed:https://dbfiddle.uk/?rdbms=postgres_11&fiddle=7dd95430aad8f646bfa1b7e79e21871c&hide=2
All of that being said, it is important to note that case-insensitive ordering, for the only or right-most / final column, isn't much different than case-sensitive ordering: the same letters across both cases are going to group together. It's just that with case-sensitive ordering, each case within each letter will group together (i.e. all "A"s before or after all "a"s), while case-insensitive ordering allows for "A"s and "a"s to be intermixed (since neither has priority). Allowing different cases of the same letter to intermix becomes important when there is at least one additional sort criteria in the
ORDER BY
list, specified to the right of (i.e. after) the string column in question. The final two example queries in the db<>fiddle linked at the top of this answer shows the difference.With all of this in mind, the main question is: what precisely are you trying to accomplish? The request, as it is currently stated in the question, is for a case-insensitive sort. Does this mean that the comparisons should still be case-sensitive? That would seem like an odd combination given that it is fairly common to have/want a case-insensitive comparison, but the sort is less important as there is often little difference between case-sensitive and insensitive sorting (case-sensitive is consistent, but case-insensitive will have random ordering of strings that differ only in casing, and that randomness sometimes results in the same ordering as what case-sensitive would return), unless there are additional fields being sorted on.
Comment from the O.P.:
It's still not clear why any "gaps" need to be filled, but it sounds like there is a slight issue with terminology: the term "sort" / "sorting" is being used, but I suspect that the main issue is with comparisons, given that there's often no visible difference between case -sensitive and -insensitive when it comes to sorting. And, without overriding a predicate or
ORDER BY
field/expression with aCOLLATE
clause, the same sensitivity (or lack thereof) would be applied to both. This is important because, if I'm correct that O.P. truly means both, then:lower
might actually be slightly better than casting tocitext
in this specific context becausecitext
is going to calllower
anyway but is also doing a datatype conversion. To be fair, I haven't done any testing around this so it's merely an edumacated guess, assuming that when used as a datatype for a column,citext
is slightly better due to the values being pre-calculated, but in this case there's no opportunity to precalculate.citext
orlower
in theORDER BY
will only solve half the issue (and the less important half given, again, that the distinction is barely noticable)ORDER BY
clause and usecitext
as the datatype for the columns, which will take care of both comparisons and sorting.