I recently heard the advice: "If you're using DISTINCT, I challenge you to fix your code." However, I'm wondering what considerations are behind this advice, and whether it means that I should approach the following problem differently.
Problem context
I have a denormalised date dimension table (Kimball star schema). In this table, I have collapsed several 3rd normal form tables: date, absolute week, absolute month and year. For each of these collapsed tables, I have a column that represents a unique identity, a textual name column and a numeric manual sort ordering index. Here's an example row to show the structure, the numbers are illustrative only (in vertical form for ease of reading):
- DateId: 20110507
- DateName: 7th May 2011
- DateOrder: 17813
- WeekId: 201118
- WeekName: Week 18, 2011
- WeekOrder: 617
- MonthId: 201105
- MonthName: May 2011
- MonthOrder: 74
- YearId: 2011
- YearName: 2011
- YearOrder: 41
I've chosen to use a date dimension table for ease of understanding, but the problem can be transposed to any other dimension table involving hierarchies.
Problem synopsis
I want to retrieve the distinct set of week and month combinations with the related textual fields.
Possible solutions
1. DISTINCT
SELECT DISTINCT MonthId, WeekId, YearName, MonthName, WeekName
FROM DimDate
Note that I cannot exclude MonthId or WeekId because I cannot assume that free text fields are unique.
2. GROUP BY
SELECT MIN(YearName), MIN(MonthName), MIN(WeekName)
FROM DimDate
GROUP BY MonthId, WeekId
3. ROWNUMBER
WITH grp AS (
SELECT YearName, MonthName, WeekName
, ROW_NUMBER() OVER (PARTITION BY MonthId, WeekId) AS r
FROM DimDate
)
SELECT YearName, MonthName, WeekName
FROM grp
WHERE grp.r = 1
4. ETL
In this approach, indexed views of every combination of non-leaf attributes are precalculated and queried directly. Obviously this is fastest at query time, but could require many duplications of the dimension table, depending on the complexity of the hierarchies in the dimension.
Comments
I would expect that the ROW_NUMBER
solution is the fastest (excluding the ETL approach), but it also requires a subquery or CTE in the query specification. Although this is not a serious problem, it does make the query harder to read. I would expect that the GROUP BY
and DISTINCT
methods to be almost equivalent, both suffering from the fact that every column has to be used in some form of sort operation, whereas the ROW_NUMBER
solution only requires the partitioning columns to be sorted.
Am I missing something? Can SQL Server make more effective use of non-unique indexing of denormalised column identities?
–edit – additional notes:
-
The week and month identifiers are absolute, so they imply the year in this scheme. I enforce that weeks are only in a single year. In SSAS, would be two hierarchies: date, week, year and date, month, year.
-
I believe DISTINCT gives the same output because the columns requested only vary by week, month or year and year is implied by both week and month.
-
Indexing is currently on the identifier columns only (single column non-unique)
In general, I'm performing an aggregate query on a fact table but reporting by a set of non-leaf dimension attributes which are chosen by an end-user and not known in advance. I was trying to reduce the problem space for this question.
Best Answer
In my experience, an aggregate (DISTINCT or GROUP BY) can be quicker then a ROW_NUMBER() approach. Saying that, ROW_NUMBER is better with SQL Server 2008 than SQL Server 2005.
However, you'll have to try for your situation.
Compare query plans, and use Profiler and SET to capture IO, CPU, Duration etc
For a lot of background, see these SO questions:
Finally, do you need the ROW_NUMBER approach? It looks like you're fixing a problem caused by de-normalisation.
And some notes: