Sql-server – What options are there for rebuilds of indexes on large tables for a data warehouse

data-warehouseindexsql-server-2008-r2standard-edition

I have a bulk loading process that loads millions of records into a few fact tables within a warehouse. Those tables are primarily clustered on time. The non-clustered indexes are in place for how the data is used to speed up performance.

I typically drop some of the non-clustered indexes to both speed up inserts and to reduce index fragmentation after large data loads. However, this process of dropping and rebuilding is causing a huge amount of time as the data grows.

Example: One table took 2 hours to apply a new non-clustered index on 100m+ rows.

Likewise, if I leave the non-clustered indexes in place, they will increase the inserts 3x to 10x in some cases that force your hand to drop and rebuild.

While it's awesome to drop and rebuild indexes, they don't really pan out that well as data grows in those tables. What options are available to me? Should I bulk up the server with more memory (currently 32GB) and cpu (4 vCPU)? Should I rethink my indexes? Should I find a balance of keeping some indexes on for reorganization versus dropping and rebuilding?

(Note: I don't have enterprise edition.)

I'm thinking my only option is enterprise edition with table partitioning where I can rebuild indexes per partition as opposed to the whole table.

Best Answer

At 4 vCPU you're on the smallest Enterprise Edition, so your cost isn't prohibitive. You will need to keep up to date with the Software Assurance if you want HA on your VM.

I would recommend adding memory to this (assuming its available in the hypervisor) and taking this VM to the maximum that edition of SQL can support (64GB for 2008 R2). That's probably your lowest cost option.

If that doesn't give what you want / need then realistically you are looking at Enterprise Edition and table partitioning to manage the data which isn't being manipulated by the ETL.

If the majority of the data isn't being changed, and it just gets reloaded time after time, then that needs to change as its not going to be sustainable. Only load the data which needs to be reloaded.