Updated after OP supplied test case voiding much of this answer.
Postgres version
There is a good chance Postgres 9.2 will perform better. A lot of improvements for big data have gone into it.
Simplified query
First off, you can simplify your query quite a bit:
SELECT u.usr_id
,t.team_name, t.team_streetno, t.team_city
,u.usr_firstname, u.usr_lastname, u.usr_jobtitle, u.usr_department, u.usr_email
FROM usr u
JOIN team t USING (team_id)
JOIN (
SELECT u.match_firstname, u.match_lastname, u.match_jobtitle, u.match_department
,u.match_email, t.match_team_name, t.match_team_streetno, t.match_team_city
FROM usr u
JOIN team t USING (team_id)
GROUP BY u.match_firstname, u.match_lastname, u.match_jobtitle, u.match_department
,u.match_email, t.match_team_name, t.match_team_streetno, t.match_team_city
HAVING count(usr_id) > 1
) AS sub USING (match_firstname, match_lastname, match_jobtitle, match_department
,match_email, match_team_name, match_team_streetno, match_team_city)
ORDER BY u.usr_firstname, u.usr_lastname, u.usr_jobtitle, u.usr_department, u.usr_email
,t.team_name, t.team_streetno, t.team_city;
I removed the JOIN to the table team
from the subquery, guessing that all the columns stem from table usr
(my request for table definitions went unanswered ...) That should help a bit.
Turns out, some of the columns belong to table team.
I replace the verbose equijoin conditions with the much shorter USING
syntax. That's also probably slightly faster.
Replace count(usr_id) > 1
with count(*) > 1
. Does the same here (since usr_id
is the PK, ergo NOT NULL
), slightly faster.
I took the test case @RichardTheKiwi posted in the comments and created an sqlfiddle from it.
I also ran a local test with temporary tables on PostgreSQL 9.1.6.
Only slightly faster than the query in the question:
1338 ms for 15137 rows (out of 70k).
EXISTS
semi-joins might be a hot contender if you have many duplicates. With only a few, not so much. You already tried and ruled that out. Are you positive you tried optimal syntax?
Window function count()
You could also try if count()
as window function is faster - without CTE and without additional joins. But I doubt it. I sure does make the query shorter, though:
SELECT *
FROM (
SELECT u.usr_id
,t.team_name, t.team_streetno, t.team_city
,u.usr_firstname, u.usr_lastname, u.usr_jobtitle, u.usr_department, u.usr_email
,count(*) OVER (PARTITION BY
u.match_firstname, u.match_lastname, u.match_jobtitle, u.match_department
,u.match_email, t.match_team_name, t.match_team_streetno, t.match_team_city) AS ct
FROM usr u
JOIN team t USING (team_id)
) x
WHERE ct > 1
ORDER BY usr_firstname, usr_lastname, usr_jobtitle, usr_department
,usr_email, team_name, team_streetno, team_city;
50% slower in my local test.
Multi-column index?
This is largely obsolete after I have seen the test case.
Indexes are obviously hard to use here, since the columns to check for dupes vary. You might be able to profile user behavior and identify the most common combinations to support those with an index. There is a more radical possibility here, though:
Postgres can use multi-column indexes, even if not all index columns are used. Performance is much better if one or more leading columns of the index are used in the condition, but it still works if not. Compare:
Working of indexes in PostgreSQL
If we can assume that most use cases will check on, for instance, (match_firstname, match_lastname)
, a multi-column index like the following might help. Not sure, you'll have to test:
CREATE INDEX usr_match_multi_idx
ON usr (match_firstname, match_lastname, match_jobtitle, match_department, match_email);
Columns of the index in order of decreasing usefulness. Order of columns in your GROUP BY
/ PARTITION BY
clause is not relevant.
Index did not help in my local test - was used, but made the query ~ 20 % slower.
Partial indexes?
If the table is read-only or at least rarely written
and many of your columns often have unique values, you could take this a step further. Rows with a unique value in one of the used match_* columns can be excluded right away. For columns with many unique values it would pay to either add a column match_email_mult boolean
which you set TRUE
if match_email
is not unique for this row. Or, if you have many columns, you could use a bit(n)
column or use an integer
columns for an even small footprint on disk. More under this related question on SO.
Don't bother with columns with only few unique values.
CREATE INDEX usr_match_email_mult_idx ON usr (usr_id)
WHERE match_email_mult;
Repeat for all qualifying columns. Then add conditions to your query:
WHERE match_firstname_mult
AND match_lastname_mult
AND ...
Postgres can combine a number of bitmap index scans rather quickly. If that can weed out a large portion of the table to begin with, the query planner might switch from a sequential scan to bitmap-index scans ...
Not tested.
Well, the issue is now resolved:
While it would seem logical to use filtered indexes (NOT NULL), to reduce database size and as so many sources on the web say, increase performance, the reality it seems is something else entirely.
In layman's terms, SQL Server query planner resolves even your basic inner joins without making any assumptions as to the content of the columns. Even though NULL values do not form a join, they must be included in the column index in order for query planner to use it, unless otherwise specified with predicates such as WHERE joinCol_ID IS NOT NULL. Basically, SQL Server does not use filtered indexes for joins at all, unless the queries themselves are modified to account for the filter value. Instead, it will create new statistics on these columns and / or use a clustered index scan or other indexes including the column, whichever it deems most effective. Using filtered indexes on foreign keys is therefore an absolutely horrid idea.
We still have no idea how months worth of testing this in multiple other environments never produced the same results outside of this one, single DB, but this is the way it's supposed to work. Apparently something that as far as we know is not related to cache, statistics or configurations, caused the production DB to behave differently and correctly detect and use the filtered indexes, while all of the testing environments simply used the old indexes (seeing as the indexes were dropped and recreated with the same name, this seems a valid theory even if there is no real proof).
So the lesson of the story: The web is filled with examples of how underused filtered indexes are, how awesome they can be. But this serious downside never popped up except as a nagging thought in the back of my head saying "if these are so great, then why aren't NULL values filtered out of indexes by default, since they only take up space and only serve a purpose in special circumstances"? Well, now I know why. :)
Best Answer
Seems like a decent approach. Of course, one should apply some human verification to this before automatically dropping everything that seems unused. For example, it's conceivable that the statistics were recently reset and/or an index is only used for some occasional batch tasks.