PostgreSQL – How to Remove Partial Duplicates from a Table

deleteduplicationpostgresql

I have a database that contains some erroneous results that are essentially but not technically duplicates. The structure is like this:

id_page (PK), id_site, label, create_date

There should only be one entry for each site for each label but it turns out that I have some duplicates for site and label, where id_page and create_date are different. I would like to remove the rows with the lowest create_date.

I think I have come up with a solution for this but I'd appreciate feedback.

WITH duplicates as (
   SELECT id_page, id_site, count(id_site) over (partition by id_site) as ct, 
    min("create_date") over (partition by id_site) as dt
   from pages
   where label = '2018-12-15'
   )
DELETE from pages
where id_page in (
   select p.id_page
   from duplicates as d
   join pages as p on (p.id_page = d.id_page
                   and p.create_date = d.dt)
WHERE ct = 2
);

Best Answer

WITH duplicates as 
(
    SELECT id_page, id_site,
           row_number() over (partition by id_site order by id_site, create_date) rn
    FROM   pages
    WHERE  label = '2018-12-15'
)
DELETE FROM pages 
WHERE  id_page IN (SELECT id_page 
                   FROM   duplicates
                   WHERE  rn > 1);