PostgreSQL – Show Only Duplicate Values

duplicationpostgresql

How to select rows with duplicate values (non-distinct rows) and summarize with a count of the number of duplicate values in order of the most frequent duplicates?

Given a table of:

person_
- name_
- favorite_color_

…with this data:

Bob       Brown
Lucy      Lavender
Orin      Orange
Liam      Lavender
Betty     Brown
Blair     Brown

…show:

3         Brown
2         Lavender

We omit “Orange” from the list as it is distinct, with no duplicate siblings.

I am most interested in Postgres, but I expect this can be done in standard SQL.

Best Answer

That can be done with a simple SELECT statement.

SELECT count(*) AS count_ , favorite_color_ 
FROM person_ 
GROUP BY favorite_color_
HAVING count(*) > 1
ORDER BY count_ DESC
;

The count(*) command is an aggregate function supported by Postgres and gives the number of rows with duplicate values in a column specified by the GROUP BY clause.

To omit the values without any duplicates, add the HAVING clause. A HAVING is similar to a WHERE but applies to each row being generated to represent the grouping.

To get results to sort with most frequent duplicate values at the top, use the ORDER BY clause. We reference the column alias count_ that we created in the first line with the AS clause.

Related Solutions

PostgreSQL Index – Can It Refer to Only One or a Subset of Values?

You can create a partial index that includes all columns of the table:

CREATE UNIQUE INDEX not_processed
    ON transactions 
        (tran_id, tran_type, tran_note, created_at)
    WHERE process_id = 0 ;

Sql-server – Query that Searches Duplicate Values Based on a Specific Value

You can eliminate the duplicates from your query by using the ROW_NUMBER() aggregate:

IF OBJECT_ID('tempdb..#Table') IS NOT NULL
DROP TABLE #table;
CREATE TABLE #Table
(
      ID01 INT NOT NULL
    , ID02 INT NOT NULL
);

INSERT INTO #Table (ID01, ID02)
VALUES (1, 1)
     , (1, 2) --problematic
     , (1, 3) --problematic
     , (1, 4) --problematic
     , (2, 1)
     , (2, 1)
     , (2, 2) --problematic
     , (3, 1)
     , (3, 1)
     , (4, 1);

;WITH cte AS (
    SELECT DISTINCT 
          t1_ID01 = t1.ID01
        , t1_ID02 = t1.ID02
        , rn1 = ROW_NUMBER() OVER (PARTITION BY t1.ID01 ORDER BY t1.ID01, T1.ID02)
        , rn2 = ROW_NUMBER() OVER (PARTITION BY t1.ID01, t1.ID02 ORDER BY t1.ID01, T1.ID02)
    FROM #Table t1
    )
SELECT *
FROM cte
WHERE cte.rn1 > 1
    AND cte.rn2 = 1;

The first ROW_NUMBER() function, rn1, is used to select rows where there are multiple ID02 values for each individual ID01 value. The second ROW_NUMBER() function, rn2, is used to preserve the case where ID01 and ID02 have multiple duplicate values, "which our system already corrects for automatically".

That pattern can be leveraged to remove the invalid rows from the source table, by using the DELETE FROM <cte> syntax:

;WITH cte AS (
    SELECT t1.ID01
        , t1.ID02
        , rn1 = ROW_NUMBER() OVER (PARTITION BY t1.ID01 ORDER BY t1.ID01, T1.ID02)
        , rn2 = ROW_NUMBER() OVER (PARTITION BY t1.ID01, t1.ID02 ORDER BY t1.ID01, T1.ID02)
    FROM #Table t1
    )
DELETE 
FROM cte
WHERE cte.rn1 > 1
    AND cte.rn2 = 1;

SELECT *
FROM #Table;

The output; first the select, then the table after problematic rows have been removed:

Best Answer

Related Solutions

PostgreSQL Index – Can It Refer to Only One or a Subset of Values?

Sql-server – Query that Searches Duplicate Values Based on a Specific Value

Related Question