The flavour in question is PostgreSQL, but it is a generic enough query pattern that your techniques for optimising against other RDBMs should help as well.
The purpose of the query is to show all records from a table that are duplicated along several columns (user selectable).
SELECT usr_id,
team_name, team_streetno, team_city,
usr_firstname, usr_lastname, usr_jobtitle, usr_department, usr_email
FROM team c
JOIN usr p on p.team_id=c.team_id
JOIN
(
select match_firstname, match_lastname, match_jobtitle, match_department, match_email ,
match_team_name, match_team_streetno, match_team_city
FROM usr
JOIN team on usr.team_id=team.team_id
GROUP BY match_firstname, match_lastname, match_jobtitle, match_department, match_email,
match_team_name, match_team_streetno, match_team_city
HAVING count(usr_id)>1
) AS sub
ON sub.match_team_name = c.match_team_name
AND sub.match_team_streetno = c.match_team_streetno
AND sub.match_team_city = c.match_team_city
AND sub.match_firstname = p.match_firstname
AND sub.match_lastname = p.match_lastname
AND sub.match_jobtitle = p.match_jobtitle
AND sub.match_department = p.match_department
AND sub.match_email = p.match_email
ORDER BY usr_firstname, usr_lastname, usr_jobtitle, usr_department, usr_email,
team_name, team_streetno, team_city
The EXPLAIN output looks like this:
I tried this variation to make arrays out of the duplicated usr_id's and JOIN back to the usr table using = ANY
; that increased the duration by about 15%.
SELECT u.usr_id
, t.team_name, t.team_streetno, t.team_city
, u.usr_firstname, u.usr_lastname, u.usr_jobtitle, u.usr_department, u.usr_email
FROM (
SELECT array_agg(u1.usr_id) dups
FROM usr u1
JOIN team t1 ON u1.team_id = t1.team_id
GROUP BY match_firstname, match_lastname, match_jobtitle, match_department, match_email
, match_team_name, match_team_streetno, match_team_city
HAVING count(usr_id)>1
) AS dup
JOIN usr u on u.usr_id = ANY (dup.dups)
JOIN team t on t.team_id = u.team_id
ORDER BY u.usr_firstname, u.usr_lastname, u.usr_jobtitle, u.usr_department, u.usr_email
, t.team_name, t.team_streetno, t.team_city
Look forward to your suggestions.
Consolidating comments
Tried WITH TMP AS (... ,count(*) over (partition by...) c FROM ..) WHERE c>1
, which fared worse than the arrag_agg/ANY approach. Seems like windowing functions in PostgreSQL is not that great, or it needs to be rewritten in another form.
For context, consider census data. If someone else filled it for you, he may fill in Tom one year, Thomas or Tomas the next. The "match" columns are derived from the base columns.
EXISTS
has been tried, but didn't perform as well as the code shown above.
Best Answer
Updated after OP supplied test case voiding much of this answer.
Postgres version
There is a good chance Postgres 9.2 will perform better. A lot of improvements for big data have gone into it.
Simplified query
First off, you can simplify your query quite a bit:
I removed the JOIN to the tableteam
from the subquery, guessing that all the columns stem from tableusr
(my request for table definitions went unanswered ...) That should help a bit.Turns out, some of the columns belong to table team.
I replace the verbose equijoin conditions with the much shorter
USING
syntax. That's also probably slightly faster.Replace
count(usr_id) > 1
withcount(*) > 1
. Does the same here (sinceusr_id
is the PK, ergoNOT NULL
), slightly faster.I took the test case @RichardTheKiwi posted in the comments and created an sqlfiddle from it.
I also ran a local test with temporary tables on PostgreSQL 9.1.6.
Only slightly faster than the query in the question:
1338 ms for 15137 rows (out of 70k).
EXISTS
semi-joins might be a hot contender if you have many duplicates. With only a few, not so much. You already tried and ruled that out. Are you positive you tried optimal syntax?Window function
count()
You could also try if
count()
as window function is faster - without CTE and without additional joins. But I doubt it. I sure does make the query shorter, though:50% slower in my local test.
Multi-column index?
This is largely obsolete after I have seen the test case.
Indexes are obviously hard to use here, since the columns to check for dupes vary. You might be able to profile user behavior and identify the most common combinations to support those with an index. There is a more radical possibility here, though:
Postgres can use multi-column indexes, even if not all index columns are used. Performance is much better if one or more leading columns of the index are used in the condition, but it still works if not. Compare:
Working of indexes in PostgreSQL
If we can assume that most use cases will check on, for instance,
(match_firstname, match_lastname)
, a multi-column index like the following might help. Not sure, you'll have to test:Columns of the index in order of decreasing usefulness. Order of columns in your
GROUP BY
/PARTITION BY
clause is not relevant.Index did not help in my local test - was used, but made the query ~ 20 % slower.
Partial indexes?
If the table is read-only or at least rarely written and many of your columns often have unique values, you could take this a step further. Rows with a unique value in one of the used match_* columns can be excluded right away. For columns with many unique values it would pay to either add a column
match_email_mult boolean
which you setTRUE
ifmatch_email
is not unique for this row. Or, if you have many columns, you could use abit(n)
column or use aninteger
columns for an even small footprint on disk. More under this related question on SO.Don't bother with columns with only few unique values.
Repeat for all qualifying columns. Then add conditions to your query:
Postgres can combine a number of bitmap index scans rather quickly. If that can weed out a large portion of the table to begin with, the query planner might switch from a sequential scan to bitmap-index scans ...
Not tested.