Sql-server – the most performant way to group by and join high cardinality columns in column based table

columnstoredatabase-designindexquery-performancesql server

I currently have a SQL Server, 20 column fact table that could grow even wider into the hundreds. The current row count is about 9 million records. The indexing strategy is currently row-based. The query joins to every lookup table totaling about a dozen joins (most columns in the table are FKs). This is an analytical table that is generated nightly. Query filters are ad hoc so row-indexing most permutations is not feasible. I would like to explore removing these joins and using a columnstore index, since for the most part, only a handful of filter columns will be specified per query request.

Given I have three tables:

Table 1: SomeFactTable (~9 million records)

| Column         | Description          | Cardinality                   |
| --------       | --------------       | --------------                |
| Id             | int32 PK             | Unique                        |
| Timestamp      | datetime, non-null   | medium                        |
| LookupTableA_ID| int32, non-null, FK  | Medium-high (40k records)     |
| LookupTableB_ID| int32, non-null, FK  | Low (dozens records)          |
| NumberA        | int32, non-null      | Very high (8 million records) |

Table 2: LookupTableA (~40,000 records)

| Column      | Description            | Cardinality    |
| --------    | --------------         | -------------- |
| Id          | int32 PK               | Unique         |
| Description | varchar(255), non-null | high           |

Table 3: LookupTableB (~50 records)

| Column      | Description            | Cardinality    |
| --------    | --------------         | -------------- |
| Id          | int32 PK               | Unique         |
| Description | varchar(255), non-null | high           |

I have two questions.

How can I combine the benefits of a columnstore index with also grouping on the high cardinality NumberA column?

For example, given this query, it performs in about 100ms (which is great!). If I uncomment the group by NumberA, then the performance degrades to ~900ms (good, but I'd like to do better).

SELECT 
  max(Id) Id
  ,[Timestamp] 
FROM 
  [dbo].[SomeFactTable]
group by 
  [Timestamp]  
  --,NumberA
ORDER BY 
  [Timestamp] ASC 
OFFSET 500 ROWS FETCH NEXT 20 ROWS ONLY

Most of the time I need to order by one or more of the lookup tables which requires a join. Sometimes these tables have many records (e.g. LookupTableA) and sometimes they only have a handful (e.g. LookupTableB). When there are many records, then the performance degrades. When there are only a handful, then the performance does not degrade. How can I maintain high performance when the table has many records?

As an example, this is a join with medium cardinality. It also has the high cardinality column from question 1. The query returns in about 2,900 ms.

SELECT 
  max(SomeFactTable.Id) Id, 
  [Timestamp],
  [Description]
FROM 
  dbo.SomeFactTable
  INNER JOIN dbo.LookupTableA ON SomeFactTable.LookupTableA_ID = LookupTableA.Id 
group by 
  [Timestamp], 
  [Description], 
  NumberA
ORDER BY 
  [Description] ASC, 
  [DateAbsent] ASC 
OFFSET 500 ROWS FETCH NEXT 20 ROWS ONLY

This is a join with low cardinality. It also has the high cardinality column from question 1. The query returns in about 1,000 ms (and returns even much faster if NumberA wasn't grouped).

SELECT 
  max(SomeFactTable.Id) Id, 
  [Timestamp],
  [Description]
FROM 
  dbo.SomeFactTable
  INNER JOIN dbo.LookupTableB ON SomeFactTable.LookupTableB_ID = LookupTableB.Id 
group by 
  [Timestamp], 
  [Description], 
  NumberA
ORDER BY 
  [Description] ASC, 
  [DateAbsent] ASC 
OFFSET 500 ROWS FETCH NEXT 20 ROWS ONLY

Best Answer

I only have a little bit of experience with columnstore indexing, but my experience has been awesomely positive so far. I'm assuming when you say the cardinality of the SomeFactTable.NumberA column is ~8 million of the ~9 million records, that you mean they share the same value. If so, then I think a columnstore index would work great for you on it, because it compresses the data of the columns it encompasses, and ~8 million / ~9 million records would be about a 90% compression ratio.

Again I'm no expert but I'd recommend testing a columnstore index on just the (NumberA) column vs one on all the fields you might group on, i.e. (Timestamp, Description, NumberA).

Related Solutions

Sql-server – Index fragmentation increase as more row being deleted

This isn't fragmentation.

Fragmentation is generated of course, but deletes will simply create "islands" of remaining pages, which is less evil then GUID/clustered key INSERT fragmentation.

If you're PK is an IDENTITY, then CreationDate should roughly track this so you're actually deleting chunks of contiguous rows anyway.

Do you have an index on CreationDate
Do you have delete cascades?
Is the TOP 1000 in a single transaction?

For point 3, doing a loop inside a transaction is pointless: is this it?

At some point, a statistics update may be needed if you delete enough rows but I don't think it's that.

Other options:

why not use TRUNCATE TABLE, wrapped in a stored procedure with EXECUTE AS OWNER
use SYNONYMs for poor man's partitioning

Sql-server – Performance of SQL Server indexing

Depending on the amount of RAM your server has, it might just read the whole table into memory and keep it there - so your first query will load it, and any subsequent queries will just operate on those pages loaded in memory.

To get any meaningful data, you need to flush the cache before every run of your query! (run DBCC FREEPROCCACHE and DBCC DROPCLEANBUFFERS before any query execution!)

I tried your scenario with 100'000 dummy file names and almost the same table structure (I changed FileName to be VARCHAR(260)).

Without index:

Table 'TestFiles'. Scan count 1, logical reads 2137, physical reads 5, read-ahead reads 2136, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Query Subtree Cost 1.68884

With non-clustered index on filename:

Table 'TestFiles'. Scan count 1, logical reads 6, physical reads 4, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Query Subtree Cost 0.0065704

This was done with this statement:

DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS 

SET STATISTICS IO ON

SELECT * 
FROM testfiles
WHERE FileName = 'File-D8584B44-518F-428A-86A1-7836E0B60502'

Best Answer

Related Solutions

Sql-server – Index fragmentation increase as more row being deleted

Sql-server – Performance of SQL Server indexing

Related Question