Sql-server – Where should I put nvarchar(max) dimensions in the data warehouse

columnstoredata-warehousedatabase-designsql serversql-server-2016

I am designing the ERD for a new data warehouse.

  • SQL Server 2016
  • Clustered Columnstore Indexes for the "fact" tables
  • Loose star schema

I put "fact" in quotes and say "loose star schema" because, with clustered columnstore, I can put many dimensions right into the "fact" table without the typical concerns over row width. Many of my dimensions are going into the "fact" tables. I am creating some dimension tables and surrogate keys, but only if the dim has attributes in addition to the dim description itself.

This brings me to some very wide, high cardinality, fields that are nvarchar(max). Without going too in-depth, think of these fields as denormalized lists. I need the list denormalized for the grain of one of my data sources. I do have it normalized in another fact table, but not one I am surfacing in this data source.

Users need these fields to search for keywords in the data mart I am surfacing. In my current design they are in the clustered columnstore "fact" table. Users will frequently query the fact tables without touching the nvarchar(max) fields.

Is there a place to put wide dimensions into my data warehouse that is more correct than the clustered columnstore table?

Joe Obbish informed me that we cannot currently put nvarchar(max) into CCIs. Would it be the best practice for me to create a table of LOBs as an extension of my "fact" table?

We may add other languages in the future. Currently and for the foreseeable term, the nvarchar column contains just English.

Best Answer

Microsoft recommends using a CCI for large data warehouse tables with a few caveats including:

Don't use a clustered columnstore index when:

  • The table requires varchar(max), nvarchar(max), or varbinary(max) data types. Or, design the columnstore index so that it doesn't include these columns.

Simply put, your options are to forgo columnstore entirely, create a nonclustered columnstore index on the table that doesn't include the VARCHAR(MAX) column, or to move the LOB columns to a separate table. You said that end users will sometimes access the tables without querying the VARCHAR(MAX) columns so I would try to use columnstore when possible so those queries can get the full benefit.

If I was designing this my first attempt would be to test your workload with nonclustered columnstore indexes that include every column except the VARCHAR(MAX) ones. That is a separate index so you will incur extra storage for the columns, but if you see the typical CCI compression ratio it will only be 10% extra. It's the simplest design and will put you in a good position to take advantage of SQL Server 2017's availability to include VARCHAR(MAX) columns in clustered columnstore indexes. Niko Neugebauer wrote a blog post about some testing on CCIs with LOB data in SQL Server vNext which you can find here.