PostgreSQL – Locate Multiple Duplicate Columns in Table

duplicationpostgresqlsubquery

I am trying to report on duplicate records in a single table which has a unique key of app_cao_number. The duplicates occur if either: 1. The Passport field is duplicated; 2. The ID field is duplicated, or; 3. The Surname+FirstName are duplicated.

I can do this easily enough with three passes of the table using ORDER BY. But I am hoping to use a single SELECT statement, with subqueries, to do the job.

Starting with just finding duplicate IDs I have the following statement:

SELECT app_cao_number, app_id,
    (SELECT app_id FROM people p2 
        WHERE p2.app_id IS NOT null 
        AND p2.app_id <> ''
        AND p1.app_cao_number <> p2.app_cao_number 
        AND p1.app_id = p2.app_id 
        GROUP BY p2.app_id) AS DupId
FROM people p1
WHERE app_id IS NOT null
AND app_id <> ''

This appears to get me the results that I want, but also include rows that have a null DupId – despite my attempts to ignore blank and null values in the SELECT statement. Once this works I should be able to expand it to include the passport and name checks.

Please can someone explain why I have the following data output with nulls in the DupId column? Thank you.

Further:
I thought it might be the GROUP BY clause, but I replaced it with a DISTINCT clause (below), but this gave the same result.

(SELECT DISTINCT p2.app_id FROM people p2 
    WHERE p2.app_id IS NOT null 
    AND p2.app_id <> ''
    AND p1.app_cao_number <> p2.app_cao_number 
    AND p1.app_id = p2.app_id 
    ) AS DupId

UPDATE

sample fiddle

Best Answer

Look for the model - does you need something like this?

fiddle

create table test (id int, value1 int, value2 int)

✓

insert into test values
(1,11,21),
(2,12,22),
(3,13,23),
(4,14,24),
(5,12,24),
(6,16,26),
(7,17,24),
(8,18,28)

8 rows affected

select t1.id id, 
       t2.id dup_id,
       case when t1.value1 = t2.value1 then 'value 1'
            when t1.value2 = t2.value2 then 'value 2'
            else 'some error'
            end dup_field,
       case when t1.value1 = t2.value1 then t1.value1 :: text
            when t1.value2 = t2.value2 then t1.value2 :: text
            else 'some error'
            end dup_value
from test t1, test t2
where t1.id < t2.id
and ( t1.value1 = t2.value1
      or
      t1.value2 = t2.value2 )

id | dup_id | dup_field | dup_value
-: | -----: | :-------- | :--------
 2 |      5 | value 1   | 12       
 4 |      5 | value 2   | 24       
 4 |      7 | value 2   | 24       
 5 |      7 | value 2   | 24

Related Solutions

Mysql – remove duplicate rows in thesql table that does not contain primary key

In the spirit of @yercube's answer, I have an answer that has an added twist.

CREATE TABLE stage
(
    id int not null auto_increment,
    name varchar(20),
    primary key (id)
);
CREATE TABLE stage2 LIKE stage;
INSERT INTO stage (name) SELECT name FROM item;
INSERT INTO stage2 (id) SELECT min_id FROM
(SELECT MIN(id) min_id,name FROM stage GROUP BY name) A;
UPDATE stage2 A INNER JOIN stage B USING (id) SET A.name=B.name;
TRUNCATE TABLE item;
INSERT INTO item (name) SELECT name FROM stage2;
DROP TABLE stage;
DROP TABLE stage2;

This will load stage2 with the first occurrence of each name from item, zap the item table, and load the unique occurrences back.

If you look back in @yercube's answer and compare it to my answer, his is much more simplistic because

@yercube uses one temp table, while I use two
I had to create a column for iteration control, @yercube did not need to
@yercube has fewer steps
both answers achieve the same thing

I do not expect my answer to be accepted. The sole purpose of my answer was demonstrate that other answers lose the concise clarity needed to solve your problem. Again, hats off to @yercube.

MySQL GROUP_CONCAT – Fix Duplicate Data Issue When DISTINCT Can’t Be Used

You have identified the source of the problem: that recipe is joined to two tables, recipe_detail and recipe_tagmap (and these to several other tables related to respectively "ingredients" and "tags"), and recipe is having one-to-many relationships with both of them.

One solution is to individually GROUP BY and aggregate first (one aggregation for the list of the tables related to ingredients and another for the group of tables related to tags, and then join back (again) to the main table (recipe):

SELECT recipe.*, 
       iid,  
       iname, 
       mabbr, 
       tag
FROM  recipe

  LEFT JOIN 
    ( SELECT recipe_detail.recipe_id,
             GROUP_CONCAT(recipe_detail.ingredient_id) AS iid,  
             GROUP_CONCAT(ingredient.name) AS iname, 
             GROUP_CONCAT(ingredient_mfr.abbr) AS mabbr
      FROM recipe
        JOIN recipe_detail
          ON recipe.id = recipe_detail.recipe_id
        LEFT JOIN ingredient
          ON recipe_detail.ingredient_id = ingredient.id
        LEFT JOIN ingredient_mfr
          ON ingredient.mfr_id = ingredient_mfr.id
      WHERE recipe.user_id = 1
      GROUP BY recipe_detail.recipe_id
    ) AS details
        ON recipe.id = details.recipe_id

  LEFT JOIN
    ( SELECT recipe_tagmap.recipe_id,
             GROUP_CONCAT(recipe_tag.name) AS tag 
      FROM recipe
        JOIN recipe_tagmap
          ON recipe.id = recipe_tagmap.recipe_id
        LEFT JOIN recipe_tag
         ON recipe_tagmap.tag_id = recipe_tag.id
      WHERE recipe.user_id = 1
      GROUP BY recipe_tagmap.recipe_id
    ) AS tags
      ON recipe.id = tags.recipe_id

WHERE recipe.user_id = 1 ;

Tested at: SQL-Fiddle

(Using the recipe table inside the 2 aggregations is not strictly needed but since you only want the recipes of one user, it will help for efficiency, restricting the number of rows retrieved from several tables and aggregated.)

Best Answer

Related Solutions

Mysql – remove duplicate rows in thesql table that does not contain primary key

MySQL GROUP_CONCAT – Fix Duplicate Data Issue When DISTINCT Can’t Be Used

Related Question