PostgreSQL – Find Near-Duplicate Values in Single Table

amazon-rdsduplicationlikepostgresqlquery

I have a Postgres 11 table on RDS containing a column email; some of the values in this column [and only that column] are clearly de facto duplicates but differ in case, i.e., different capitalization, such as:

foo@****.com
Foo@****.com

To be clear, none of the present rows are true duplicates, nor share precisely the same values in that column. My objective is to identify these records [and, once found, eliminate/merge the de-facto duplicates].

My initial inclination was to use a self-join, e.g.:

SELECT c.email 
FROM schema.table c 
INNER JOIN schema.table d ON lower(c.email) = lower(d.email)
ORDER BY c.email;

However, this returns all the email records rather than only those that are de-facto-duplications.

Using a subquery such as the following produces a similar [i.e., too-inclusive] result:

SELECT c.email, alias.email 
FROM schema.table c 
  JOIN (SELECT email FROM schema.table) alias ON lower(c.email) = lower(alias.email);

Since I’m not looking for an aggregate, but rather a case-insensitive comparison, it seems to me that a window function is not the correct approach.

I think that this should be a straightforward query, but I’m having a difficult time seeing it clearly and am sure there is an error in the way I’m conceiving of the problem; it’s pretty frustrating.

In addition to searching here and on SO, I consulted Molinaro’s SQL Cookbook, but to no avail.

What is the correct way to structure the query so that it returns only those records whose email values are the same, disregarding case?

edit note : my initial question formulation expressed a misguided inclination to use ILIKE for case-insensitive matching, but the use of lower() as suggested in the below answers is far more sensible

Best Answer

Another case for EXISTS:

SELECT *
FROM   schema.table t
WHERE  EXISTS (
   SELECT FROM schema.table t1
   WHERE  lower(t.email) = lower(t1.email)
   AND    t.ctid <> t1.ctid
   )
ORDER   BY lower(email), email;

If you have a PK, use it instead of ctid. Related:

How do I remove duplicate records in a join table in PostgreSQL?

This returns every qualifying row once. The added ORDER BY helps to keep dupes together and in deterministic sort order (unless your locale is case insensitive).

Why not use a simple join?

If you have, say, 10 variants of the same email, a simple join would give you 10 over 2 = 90 rows, and repeat every combination with reversed roles. Basically a limited Carthesian product of all dupes for each set of dupes.

LOWER LIKE vs iLIKE

A trigram index as suggested there should greatly help performance with tables of non-trivial size.

Also note that lower(t.email) = lower(t1.email) is slightly different from t.email ILIKE t1.email. The latter treats the right side as pattern, where some characters have special meaning unless you escape them. See:

Escape function for regular expression or LIKE patterns

Related Solutions

Mysql – select rows where column contains same data in more than one record

HAVING is a great aggregate filter. (http://dev.mysql.com/doc/refman/5.6/en/group-by-extensions.html) For example, select the article_titles with more than on occurrence:

SELECT count(*), article_title
FROM articles
GROUP BY article_title
HAVING COUNT(*) > 1;

Adding columns to the SELECT and GROUP BY clauses allow you to locate duplicates based on a composite key of multiple columns.

Postgresql – List distinct column values where those rows share other column values

This query will do. The trick is to use COUNT(DISTINCT city_id):

SQL Fiddle

PostgreSQL 8.3.20 Schema Setup:

CREATE TABLE Table1
    ("id" int, "name" varchar(6), "language" varchar(1), "dialect" varchar(2), "city_id" int)
;

INSERT INTO Table1
    ("id", "name", "language", "dialect", "city_id")
VALUES
    (01, 'London', 'A', 'A1', 1),
    (02, 'London', 'A', 'A2', 1),
    (03, 'London', 'B', 'B1', 2),
    (04, 'London', 'B', 'B2', 3)
;

Query 1:

select t.*, d.dups
FROM table1 t INNER JOIN
(
  select name, language, count(distinct city_id) as dups
  from table1
  group by name, language
  having count(distinct city_id) > 1
) d
ON t.name = d.name and t.language = d.language

Results:

| ID |   NAME | LANGUAGE | DIALECT | CITY_ID | DUPS |
|----|--------|----------|---------|---------|------|
|  3 | London |        B |      B1 |       2 |    2 |
|  4 | London |        B |      B2 |       3 |    2 |

Best Answer

Related Solutions

Mysql – select rows where column contains same data in more than one record

Postgresql – List distinct column values where those rows share other column values

Related Question