I have a Postgres 11 table on RDS containing a column email
; some of the values in this column [and only that column] are clearly de facto duplicates but differ in case, i.e., different capitalization, such as:
foo@****.com
Foo@****.com
To be clear, none of the present rows are true duplicates, nor share precisely the same values in that column. My objective is to identify these records [and, once found, eliminate/merge the de-facto duplicates].
My initial inclination was to use a self-join, e.g.:
SELECT c.email
FROM schema.table c
INNER JOIN schema.table d ON lower(c.email) = lower(d.email)
ORDER BY c.email;
However, this returns all the email records rather than only those that are de-facto-duplications.
Using a subquery such as the following produces a similar [i.e., too-inclusive] result:
SELECT c.email, alias.email
FROM schema.table c
JOIN (SELECT email FROM schema.table) alias ON lower(c.email) = lower(alias.email);
Since I’m not looking for an aggregate, but rather a case-insensitive comparison, it seems to me that a window function is not the correct approach.
I think that this should be a straightforward query, but I’m having a difficult time seeing it clearly and am sure there is an error in the way I’m conceiving of the problem; it’s pretty frustrating.
In addition to searching here and on SO, I consulted Molinaro’s SQL Cookbook, but to no avail.
What is the correct way to structure the query so that it returns only those records whose email
values are the same, disregarding case?
edit note : my initial question formulation expressed a misguided inclination to use ILIKE for case-insensitive matching, but the use of lower() as suggested in the below answers is far more sensible
Best Answer
Another case for
EXISTS
:If you have a PK, use it instead of
ctid
. Related:This returns every qualifying row once. The added
Why not use a simple join?ORDER BY
helps to keep dupes together and in deterministic sort order (unless your locale is case insensitive).If you have, say, 10 variants of the same email, a simple join would give you 10 over 2 = 90 rows, and repeat every combination with reversed roles. Basically a limited Carthesian product of all dupes for each set of dupes.
Related:
A trigram index as suggested there should greatly help performance with tables of non-trivial size.
Also note that
lower(t.email) = lower(t1.email)
is slightly different fromt.email ILIKE t1.email
. The latter treats the right side as pattern, where some characters have special meaning unless you escape them. See: