Sql-server – the most performant way to group by and join high cardinality columns in column based table

columnstoredatabase-designindexquery-performancesql server

I currently have a SQL Server, 20 column fact table that could grow even wider into the hundreds. The current row count is about 9 million records. The indexing strategy is currently row-based. The query joins to every lookup table totaling about a dozen joins (most columns in the table are FKs). This is an analytical table that is generated nightly. Query filters are ad hoc so row-indexing most permutations is not feasible. I would like to explore removing these joins and using a columnstore index, since for the most part, only a handful of filter columns will be specified per query request.

Given I have three tables:

Table 1: SomeFactTable (~9 million records)

| Column         | Description          | Cardinality                   |
| --------       | --------------       | --------------                |
| Id             | int32 PK             | Unique                        |
| Timestamp      | datetime, non-null   | medium                        |
| LookupTableA_ID| int32, non-null, FK  | Medium-high (40k records)     |
| LookupTableB_ID| int32, non-null, FK  | Low (dozens records)          |
| NumberA        | int32, non-null      | Very high (8 million records) |

Table 2: LookupTableA (~40,000 records)

| Column      | Description            | Cardinality    |
| --------    | --------------         | -------------- |
| Id          | int32 PK               | Unique         |
| Description | varchar(255), non-null | high           |

Table 3: LookupTableB (~50 records)

| Column      | Description            | Cardinality    |
| --------    | --------------         | -------------- |
| Id          | int32 PK               | Unique         |
| Description | varchar(255), non-null | high           |

I have two questions.

  1. How can I combine the benefits of a columnstore index with also grouping on the high cardinality NumberA column?

For example, given this query, it performs in about 100ms (which is great!). If I uncomment the group by NumberA, then the performance degrades to ~900ms (good, but I'd like to do better).

SELECT 
  max(Id) Id
  ,[Timestamp] 
FROM 
  [dbo].[SomeFactTable]
group by 
  [Timestamp]  
  --,NumberA
ORDER BY 
  [Timestamp] ASC 
OFFSET 500 ROWS FETCH NEXT 20 ROWS ONLY
  1. Most of the time I need to order by one or more of the lookup tables which requires a join. Sometimes these tables have many records (e.g. LookupTableA) and sometimes they only have a handful (e.g. LookupTableB). When there are many records, then the performance degrades. When there are only a handful, then the performance does not degrade. How can I maintain high performance when the table has many records?

As an example, this is a join with medium cardinality. It also has the high cardinality column from question 1. The query returns in about 2,900 ms.

SELECT 
  max(SomeFactTable.Id) Id, 
  [Timestamp],
  [Description]
FROM 
  dbo.SomeFactTable
  INNER JOIN dbo.LookupTableA ON SomeFactTable.LookupTableA_ID = LookupTableA.Id 
group by 
  [Timestamp], 
  [Description], 
  NumberA
ORDER BY 
  [Description] ASC, 
  [DateAbsent] ASC 
OFFSET 500 ROWS FETCH NEXT 20 ROWS ONLY

This is a join with low cardinality. It also has the high cardinality column from question 1. The query returns in about 1,000 ms (and returns even much faster if NumberA wasn't grouped).

SELECT 
  max(SomeFactTable.Id) Id, 
  [Timestamp],
  [Description]
FROM 
  dbo.SomeFactTable
  INNER JOIN dbo.LookupTableB ON SomeFactTable.LookupTableB_ID = LookupTableB.Id 
group by 
  [Timestamp], 
  [Description], 
  NumberA
ORDER BY 
  [Description] ASC, 
  [DateAbsent] ASC 
OFFSET 500 ROWS FETCH NEXT 20 ROWS ONLY

Best Answer

I only have a little bit of experience with columnstore indexing, but my experience has been awesomely positive so far. I'm assuming when you say the cardinality of the SomeFactTable.NumberA column is ~8 million of the ~9 million records, that you mean they share the same value. If so, then I think a columnstore index would work great for you on it, because it compresses the data of the columns it encompasses, and ~8 million / ~9 million records would be about a 90% compression ratio.

Again I'm no expert but I'd recommend testing a columnstore index on just the (NumberA) column vs one on all the fields you might group on, i.e. (Timestamp, Description, NumberA).