How to start indexing a new data warehouse

data-warehouseindexindex-tuningperformanceperformance-tuningsp-blitzindex

I am building a data warehouse (SQL Server 2012, Kimball Dimensional modeling, 750GB) that will supply data to reports (SSRS), cubes (SSAS) and various datafeeds.

I plan to use an index tuning tool (probably sp_BlitzIndex), but how do I start? Do I start off with no indexes and then let sp_BlitzIndex tell me what indexes to use after a while? Or do I start with the obvious indexes and build on that with sp_BlitzIndex?

I was planning on adding a clustered index on every primary key and a non-clustered index on every foreign key — and then let sp_BlitzIndex suggest the less obvious choices. But maybe I should leave it all up to sp_BlitzIndex.

Best Answer

Your plan of starting with clustered indexes on primary key is a great start.

Ideally, your clustered indexes (to start with) should be an INTEGER IDENTITY for the DW's surrogate keys to:

  1. keep the clustered index as narrow as possible,
  2. and so that the Dimension and Fact tables are not heaps.

If, and only if, you expect a table to surpass 2,147,483,647 rows (the max INT value), use BIGINT IDENTITY instead of INT. (See Numeric Data Types).

Carefully consider non-clustered indexes on EVERY foreign key. It might not be necessary, and may be more harm than good to put an NCI on EVERY foreign key. Instead, focus on creating non-clustered indexes only on Foreign Keys for which they would be used in JOINs or WHERE clauses in your queries. However, don't just create one Non-Clustered index for every FK -- if a query uses multiple FK's in joins, consider creating a single NCI to support those joins. Play with the order of the index's key columns to see which works best. TEST TEST TEST.

EDIT 9/26/2016 Here's a blog post by Kendra Little discussing indexes on Foreign Keys: http://www.littlekendra.com/2016/09/22/indexing-foreign-keys-guidelines/

Heaps are bad for SELECT / UPDATE / DELETE performance (i.e. SSRS and SSAS queries), especially with any kind of JOIN or WHERE clauses in queries, because all queries on heaps would be Table Scans -- there's no such thing as a Heap Seek. Fragmentation is often a problem with HEAP tables.

Heaps are great for INSERT performance. Heaps are generally known to be the fastest for INSERTing new data, because there would be no index changes (i.e. updates to an index's b-tree) or sorting of the data necessary that SQL Server would need to manage if there were indexes. This is most useful for STAGING tables. (See What are valid usage scenarios for heap tables). The degree to which HEAPS are better than CLUSTERED INDEXES while INSERTing new data will vary.

EDIT: 9/7/2016 Here's a blog post by Daniel Hutmacher discussing additional use cases for heaps: https://sqlsunday.com/2016/09/01/compelling-case-for-heaps/

Be aware that clustered indexes (and non-clustered indexes) will reduce performance of your (assumed) daily nightly loads, because any data changes (INSERTs, UPDATEs, DELETES, MERGEs) will cause indexes that use those columns with changed data (I'll combine INSERTed, UPDATEd, DELETEd, MERGEd as "changed" data) to also be updated, which causes I/O writes to the indexes. It might be a small hit to performance; it might be a big hit to performance -- it depends on the size of the table and how much data is being changed (I/U/D/M), and other factors. A "small" dimension table would probably be fine to keep the Clustered Index and Non-Clustered indexes on for the ETL. A "large" dimension table might benefit from the disable/re-build method. TEST TEST TEST. (See Performance of Non Clustered Indexes on Heaps vs Clustered Indexes).

One method (not necessarily a recommendation, but an option) to combat the performance reduction for data changes of a (large) table with existing indexes is to:

  1. disable or delete the indexes of table(s) as part of your SSIS ETL jobs prior to loading the data, in order to optimize INSERT performance,
  2. then rebuild or re-create the index after data load is complete.

If you try this route, test and track performance to ensure this method is genuinely faster than loading the data to the table with indexes -- it might not be faster. TEST TEST TEST.

While performance tuning queries, take measurements using SET STATISTICS IO, TIME ON; in front of your SELECT queries. Focus on reducing "logical reads" with your non-clustered indexes. Document and track logical reads before the index and after the index; keep a history. If you are successful, you can show these numbers to your team and/or boss.

Your data warehouse will feed SSRS reports and SSAS cube processing runs. Create non-clustered indexes that support/tune these queries (maybe the queries that SSRS and SSAS use are views or stored procedures?). In other words, tune your SSRS/SSAS queries (via non-clustered indexes). You'll want to have your SSRS report views or procedures query code to be static/stabilized (i.e. no code changes) for some time (say, 1 week of steady executions) for the plan cache, and then proceed with tuning. You could do it sooner, but you might not get the same results.

For optimizing SSAS, use Profiler (or Extended Events, if you prefer) during your SSAS cube processing runs to capture the SELECT queries the cube uses when pulling data from your DW -- it might not be what you expect. These are the queries you tune for improving SSAS cube processing time. Don't forget to turn off profiler when done, as it does add some overhead.

The bonus of index tuning your DW, is that you are also tuning your SSRS reports and SSAS cube processing runs at the same time.

Index performance tuning is an on-going practice; you're never done. You might need to make changes to your index strategy if:

  • New tables get created / joined on
  • New where clauses or new joins to the views / stored procedures
  • New columns get added to existing tables
  • Data changes (i.e. now there's 1 million, or 10 million, or 100 million more rows than when you first index tuned)

Index performance tuning can be challenging and tricky, yet highly rewarding when you get a big performance win. I encourage you to search for free webinars that are dedicated to talking about index strategies. Learn about the different types of non-clustered indexes (i.e. "covering" indexes, filtered indexes, columnstore indexes, etc.) and how and when they are useful, and when they might do more harm than good.

When you're ready to step up above free webinars, consider finding books to buy, and/or paid training to further improve your indexing skills -- maybe your employer will pay for the training and/or books. Since you mentioned Brent Ozar team's sp_BlitzIndex, Kendra Little has done great in-person paid classes (which I've attended) and recorded webinars on index performance tuning. There are also plenty of other great MVPs and great SQL Pros that offer index training too.

Good luck, I hope this was helpful and insightful.