Given three tables:
A users
table with id
's:
╔════╗
║ id ║
╠════╣
║ 1 ║
╟────╢
║ 2 ║
╟────╢
║ 3 ║
╚════╝
A purposes
table with user_id
's and name
's:
╔════╤═════════╤══════════╗
║ id │ user_id │ name ║
╠════╪═════════╪══════════╣
║ 1 │ 3 │ Business ║
╟────┼─────────┼──────────╢
║ 2 │ 2 │ Personal ║
╟────┼─────────┼──────────╢
║ 3 │ 2 │ Personal ║
╚════╧═════════╧══════════╝
And a journeys
table with user_id
's and purpose_id
's.
╔════╤═════════╤════════════╗
║ id │ user_id │ purpose_id ║
╠════╪═════════╪════════════╣
║ 1 │ 2 │ 3 ║
╟────┼─────────┼────────────╢
║ 2 │ 1 │ 2 ║
╟────┼─────────┼────────────╢
║ 3 │ 1 │ 1 ║
╚════╧═════════╧════════════╝
We want to construct a query, in PostgreSQL, where we find all of the records in the purposes
table that have the same name
and user_id
, deleting those with the least number of associated journeys while preserving the one record with the maxmium number of journeys while also updating the purpose_id
of those journeys whose associated purpose has been deleted.
Right now I have a query to find those duplicate purposes using a window function, however the result only returns one of the duplicates and it's not certain they it has the maximum number of journeys (as I haven't specified that in any of the clauses).
select id from (
select
id,
row_number()
over(partition by user_id, name) as dupe_count
from
purposes
) purposes
where purposes.dupe_count > 1;
Best Answer
Can be a single statement with data-modifying CTEs:
This should do exactly what you need - even in cases of ties or where no journeys exist at all.
Step by step
The 1st CTE blacklist identifies all IDs in table
purposes
where another row exists with the same(user_id, name)
but more associated journeys. Remember that other ID asmaster_id
to reassign journeys further down the road.id
serves as tiebreaker: the smallestid
prevails.The 2nd CTE
del
deletes the identified dupes.The outer
UPDATE
reroutes journeys to their new master.Note
DESC NULLS LAST
in theORDER BY
clause. Important for cases with no journeys at all, since that result in aNULL
value for the journey countj_ct
. See:Use the window function
first_value()
to get themaster_id
for each group of peers - after forming groups withPARTITION BY
and sorting properly withORDER BY
.