PostgreSQL – How to Remove Partial Duplicates from a Table

I have a database that contains some erroneous results that are essentially but not technically duplicates. The structure is like this:

id_page (PK), id_site, label, create_date

There should only be one entry for each site for each label but it turns out that I have some duplicates for site and label, where id_page and create_date are different. I would like to remove the rows with the lowest create_date.

I think I have come up with a solution for this but I'd appreciate feedback.

WITH duplicates as (
   SELECT id_page, id_site, count(id_site) over (partition by id_site) as ct, 
    min("create_date") over (partition by id_site) as dt
   from pages
   where label = '2018-12-15'
   )
DELETE from pages
where id_page in (
   select p.id_page
   from duplicates as d
   join pages as p on (p.id_page = d.id_page
                   and p.create_date = d.dt)
WHERE ct = 2
);

WITH duplicates as ( SELECT id_page, id_site, row_number() over (partition by id_site order by id_site, create_date) rn FROM pages WHERE label = '2018-12-15' ) DELETE FROM pages WHERE id_page IN (SELECT id_page FROM duplicates WHERE rn > 1);

PostgreSQL – How to Remove Partial Duplicates from a Table

Best Answer

Related Question

Best Answer

Related Solutions

Mysql – How to group by when querying duplicate records

DB2 – Remove All Records with Duplicates

Related Question