Sql-server – Is it advisable to add indexes to a data warehouse’s staging area

data-warehouseindexsql-server-2012

We have a data warehouse built in MSSQL 2012, Staging area, EDW, cubes and reports.

In the staging area, the tables are a 1:1 copy of those supporting the systems used to fill the DWH with. The staging area tables are TRUNC'ed and refilled daily (nightly). Then views on those tables provides input for the ETL process towards the EDW.

Besides providing a copy for the EDW to work with, these staging tables are also queried using MSSQL Management Studio by some of our analists, as not to put pressure on the source systems with their queries.

For both the ETL towards EDW as for the analist's queries, I get the idea that having an index on (some of) these tables might aid performance (greatly). Is it wise to have indexes on (large) tables that get TRUNC'ed daily. If not, why? And what kind of index (Clustered vs Non-clusterd vs someting else entirely)?

Best Answer

We have a similar situation as the one you described, except that these copied tables are only used for ETL (by select queries). I'll describe how we approached it, but you have to evaluate if it works for your environment (see Jon Seigel's comment at your question).

The concept is straightforward:

  • We index these "copy" tables in the staging area to support queries as much as possible. Most of indexes (especially on the large tables which get hit by scans) are columnstore indexes (we're on 2012 as well), some are covering nonclustered indexes. As I said, these tables are (for the time being) only used for select queries in the ETL, so we can make sure every query is supported either by a columnstore index or a covering nonclustered index.
  • Before the data pump (copy) from source systems into the staging area, we drop the indexes, to make bulk loads as fast as possible and to ensure we're touching source systems for the minimal amount of time. After the data pump is complete, we recreate them - the logic is set up so that the recreation is done "dynamically" (as in, indexes are not hardcoded - every index that gets dropped is then recreated exactly as it was), which allows us to change indexing without having to maintain the drop/create code.

This approach works for us, drop/recreate is faster than loading into indexed tables; well in the case of a columnstore index you don't have many other options anyway apart from partitioning & using partition switch, but we're not doing that for these tables (EDW fact tables are a different story).