Sql-server – Comparing DISTINCT, GROUP BY and ROW_NUMBER() in SQL Server 2008 with data warehouse workloads

querysql server

I recently heard the advice: "If you're using DISTINCT, I challenge you to fix your code." However, I'm wondering what considerations are behind this advice, and whether it means that I should approach the following problem differently.

Problem context

I have a denormalised date dimension table (Kimball star schema). In this table, I have collapsed several 3rd normal form tables: date, absolute week, absolute month and year. For each of these collapsed tables, I have a column that represents a unique identity, a textual name column and a numeric manual sort ordering index. Here's an example row to show the structure, the numbers are illustrative only (in vertical form for ease of reading):

  • DateId: 20110507
  • DateName: 7th May 2011
  • DateOrder: 17813
  • WeekId: 201118
  • WeekName: Week 18, 2011
  • WeekOrder: 617
  • MonthId: 201105
  • MonthName: May 2011
  • MonthOrder: 74
  • YearId: 2011
  • YearName: 2011
  • YearOrder: 41

I've chosen to use a date dimension table for ease of understanding, but the problem can be transposed to any other dimension table involving hierarchies.

Problem synopsis

I want to retrieve the distinct set of week and month combinations with the related textual fields.

Possible solutions

1. DISTINCT

SELECT DISTINCT MonthId, WeekId, YearName, MonthName, WeekName
  FROM DimDate

Note that I cannot exclude MonthId or WeekId because I cannot assume that free text fields are unique.

2. GROUP BY

SELECT MIN(YearName), MIN(MonthName), MIN(WeekName)
  FROM DimDate
 GROUP BY MonthId, WeekId

3. ROWNUMBER

WITH grp AS (
  SELECT YearName, MonthName, WeekName
       , ROW_NUMBER() OVER (PARTITION BY MonthId, WeekId) AS r
    FROM DimDate
)
SELECT YearName, MonthName, WeekName
  FROM grp
 WHERE grp.r = 1

4. ETL

In this approach, indexed views of every combination of non-leaf attributes are precalculated and queried directly. Obviously this is fastest at query time, but could require many duplications of the dimension table, depending on the complexity of the hierarchies in the dimension.

Comments

I would expect that the ROW_NUMBER solution is the fastest (excluding the ETL approach), but it also requires a subquery or CTE in the query specification. Although this is not a serious problem, it does make the query harder to read. I would expect that the GROUP BY and DISTINCT methods to be almost equivalent, both suffering from the fact that every column has to be used in some form of sort operation, whereas the ROW_NUMBER solution only requires the partitioning columns to be sorted.

Am I missing something? Can SQL Server make more effective use of non-unique indexing of denormalised column identities?

–edit – additional notes:

  1. The week and month identifiers are absolute, so they imply the year in this scheme. I enforce that weeks are only in a single year. In SSAS, would be two hierarchies: date, week, year and date, month, year.

  2. I believe DISTINCT gives the same output because the columns requested only vary by week, month or year and year is implied by both week and month.

  3. Indexing is currently on the identifier columns only (single column non-unique)

In general, I'm performing an aggregate query on a fact table but reporting by a set of non-leaf dimension attributes which are chosen by an end-user and not known in advance. I was trying to reduce the problem space for this question.

Best Answer

In my experience, an aggregate (DISTINCT or GROUP BY) can be quicker then a ROW_NUMBER() approach. Saying that, ROW_NUMBER is better with SQL Server 2008 than SQL Server 2005.

However, you'll have to try for your situation.
Compare query plans, and use Profiler and SET to capture IO, CPU, Duration etc

For a lot of background, see these SO questions:

Finally, do you need the ROW_NUMBER approach? It looks like you're fixing a problem caused by de-normalisation.

And some notes:

  • shouldn't YearID be in the GROUP BY or PARTITION BY?
  • Won't DISTINCT give different output?
  • Are these columns indexed?