PostgreSQL – Locate Multiple Duplicate Columns in Table

duplicationpostgresqlsubquery

I am trying to report on duplicate records in a single table which has a unique key of app_cao_number. The duplicates occur if either: 1. The Passport field is duplicated; 2. The ID field is duplicated, or; 3. The Surname+FirstName are duplicated.

I can do this easily enough with three passes of the table using ORDER BY. But I am hoping to use a single SELECT statement, with subqueries, to do the job.

Starting with just finding duplicate IDs I have the following statement:

SELECT app_cao_number, app_id,
    (SELECT app_id FROM people p2 
        WHERE p2.app_id IS NOT null 
        AND p2.app_id <> ''
        AND p1.app_cao_number <> p2.app_cao_number 
        AND p1.app_id = p2.app_id 
        GROUP BY p2.app_id) AS DupId
FROM people p1
WHERE app_id IS NOT null
AND app_id <> ''

This appears to get me the results that I want, but also include rows that have a null DupId – despite my attempts to ignore blank and null values in the SELECT statement. Once this works I should be able to expand it to include the passport and name checks.

Please can someone explain why I have the following data output with nulls in the DupId column? Thank you.

enter image description here

Further:
I thought it might be the GROUP BY clause, but I replaced it with a DISTINCT clause (below), but this gave the same result.

(SELECT DISTINCT p2.app_id FROM people p2 
    WHERE p2.app_id IS NOT null 
    AND p2.app_id <> ''
    AND p1.app_cao_number <> p2.app_cao_number 
    AND p1.app_id = p2.app_id 
    ) AS DupId

UPDATE

sample fiddle

Best Answer

Look for the model - does you need something like this?

fiddle

create table test (id int, value1 int, value2 int)
insert into test values
(1,11,21),
(2,12,22),
(3,13,23),
(4,14,24),
(5,12,24),
(6,16,26),
(7,17,24),
(8,18,28)
8 rows affected
select t1.id id, 
       t2.id dup_id,
       case when t1.value1 = t2.value1 then 'value 1'
            when t1.value2 = t2.value2 then 'value 2'
            else 'some error'
            end dup_field,
       case when t1.value1 = t2.value1 then t1.value1 :: text
            when t1.value2 = t2.value2 then t1.value2 :: text
            else 'some error'
            end dup_value
from test t1, test t2
where t1.id < t2.id
and ( t1.value1 = t2.value1
      or
      t1.value2 = t2.value2 )
id | dup_id | dup_field | dup_value
-: | -----: | :-------- | :--------
 2 |      5 | value 1   | 12       
 4 |      5 | value 2   | 24       
 4 |      7 | value 2   | 24       
 5 |      7 | value 2   | 24