Postgresql – Delete duplicates and reroute referencing rows to new master

cteduplicationgreatest-n-per-grouppostgresql

Given three tables:

A users table with id's:

╔════╗
║ id ║
╠════╣
║ 1  ║
╟────╢
║ 2  ║
╟────╢
║ 3  ║
╚════╝

A purposes table with user_id's and name's:

╔════╤═════════╤══════════╗
║ id │ user_id │ name     ║
╠════╪═════════╪══════════╣
║ 1  │ 3       │ Business ║
╟────┼─────────┼──────────╢
║ 2  │ 2       │ Personal ║
╟────┼─────────┼──────────╢
║ 3  │ 2       │ Personal ║
╚════╧═════════╧══════════╝

And a journeys table with user_id's and purpose_id's.

╔════╤═════════╤════════════╗
║ id │ user_id │ purpose_id ║
╠════╪═════════╪════════════╣
║ 1  │ 2       │ 3          ║
╟────┼─────────┼────────────╢
║ 2  │ 1       │ 2          ║
╟────┼─────────┼────────────╢
║ 3  │ 1       │ 1          ║
╚════╧═════════╧════════════╝

We want to construct a query, in PostgreSQL, where we find all of the records in the purposes table that have the same name and user_id, deleting those with the least number of associated journeys while preserving the one record with the maxmium number of journeys while also updating the purpose_id of those journeys whose associated purpose has been deleted.

Right now I have a query to find those duplicate purposes using a window function, however the result only returns one of the duplicates and it's not certain they it has the maximum number of journeys (as I haven't specified that in any of the clauses).

select id from (
  select
    id,
    row_number()
  over(partition by user_id, name) as dupe_count
  from
    purposes
) purposes
where purposes.dupe_count > 1;

Best Answer

Can be a single statement with data-modifying CTEs:

WITH blacklist AS (
   SELECT *
   FROM  (
      SELECT id
           , first_value(id) OVER (PARTITION BY user_id, name
                                   ORDER BY j_ct DESC NULLS LAST, id) AS master_id
                                                               -- id as tiebreaker
      FROM   purposes
      LEFT   JOIN (  -- left join to preserve rows without any journeys
         SELECT purpose_id AS id, count(*) AS j_ct
         FROM   journeys
         GROUP  BY 1
         ) j USING (id)
      ) sub
   WHERE  master_id <> id  -- is not master_id
   )
 , del AS (
   DELETE FROM purposes p
   USING  blacklist b
   WHERE  b.id = p.id
   )
UPDATE journeys j
SET    purpose_id = b.master_id
FROM   blacklist b
WHERE  b.id = j.purpose_id;

This should do exactly what you need - even in cases of ties or where no journeys exist at all.

Step by step

The 1st CTE blacklist identifies all IDs in table purposes where another row exists with the same (user_id, name) but more associated journeys. Remember that other ID as master_id to reassign journeys further down the road. id serves as tiebreaker: the smallest id prevails.

The 2nd CTE del deletes the identified dupes.
The outer UPDATE reroutes journeys to their new master.

Note DESC NULLS LAST in the ORDER BY clause. Important for cases with no journeys at all, since that result in a NULL value for the journey count j_ct. See:

Use the window function first_value() to get the master_id for each group of peers - after forming groups with PARTITION BY and sorting properly with ORDER BY.