Sql-server – SQL Server : ETL Stage Index

data-warehouseetlindexsql serverssis

We have a very large staging table (> 80 GB). From our source system we load invoice data in the staging table. From the staging we transform the data and load into DWH/Fact. Every day we delete the current month, then reload from source into stage. The stage contains complete history over time.

In some DW loads we only need the current month. Sometimes year and previous year.

What is a better index strategy:

  1. Clustered index on a date column (Fiscal Period)
  2. Primary key with IDENTITY as surrogate key
  3. Clustered index for natural key (some kind of line item e.g. invoice number)

All queries contain the date column (Fiscal Period) and sometimes additional columns like Invoice type as non-clustered index. In the ETL we can disable the non-clustered index but not the clustered index.

Which of the three types has the best performance for:

  • Insert into Stage table
  • Query the Stage table

Best Answer

If it possible for you to use partitioning, I would highly recommend leveraging it in this situation as you know that you're working with periodic data loads which fall into months or years. With partitioning, you could reserve the creation of your indexes on other columns that you may need on things such as InvoiceNo, etc. Let partitioning take care of isolating which month or period you are working with.

If you are fairly sure about the fact that date is integral portion of your join or searches, then you I would associate a "smartkey" for your loading strategy and place a clustered index on that instead.

If you are truly determined to use the date as an index seek, then I would place an non-clustered index on this field as you'll likely have to scan rows in your use-case anyway. Keep in mind creating a clustered index on date and then incurring updates or deletes will fragment your table.