Sql-server – Index optimization for table that is truncated every day

clustered-indexnonclustered-indexoptimizationquery-performancesql server

I have a bunch of tables in SQL Server 2012.

These tables are fed by other tables. Every day there are routines that truncate (fully delete) them and run queries whose result are inserted into them. They are like static views, storing those processed data so that they are queried quicker.

And there are source tables, which receive millions of new records every day, these aren't deleted.

All these tables have a time when they get a lot of inserts, and in the rest of the day they are only read.

Even with these 'static view' tables, I think I can enhance overall performance. Is there any best practice to handle these behaviors?

For the first situation, I've run a few benchmarks and saw that the best performance comes from using nonclustered indexes on the most filtered field. I drop these indexes before truncating and recreate them after inserting. Dropping and recreating is faster than rebuilding, and inserts are faster too.

Should I keep PKs (without semantic meaning) and clustered index too? I suppose that with no index those inserts will be even faster, but will I have any issue if these tables have no PK? They are never used explicitly.

I ask this because when I create the table and set its PK with

idBenchmark int NOT NULL IDENTITY (1,1) PRIMARY KEY

clustered index is created automatically, with a random name, and therefore I can't drop and recreate it.

Best Answer

Go ahead and drop the clustering key while you're importing data. When you've finished your INSERTs, create the clustering key first, then the PK if it's non-clustered, then any remaining indices. I'm running such scripts at this very moment, and it takes about half as long as inserting into a table which is fully indexed.

There's no problem in going without a clustering key while loading data. I would recommend that you import your data in the same order that it will be clustered, if possible; this will reduce the need to shuffle the data around when it's clustered. If you're using an arbitrary IDENTITY column, I suggest you reconsider; there may be a better candidate for clustering (or you may not even need a clustering key).

There's no problem in going without a PK while loading data. It's most important in maintaining referential integrity and in giving your indices a narrow target to hit; neither applies when you're bulk-loading data, assuming that you trust your data to not contain duplicates.

Duplicates are not always the devil either, and it may be faster to remove them after bulk load than to build processes at the front end that de-dupe in other ways.

My preference is to pull in data in all its ugly rawness first, often into a heap table, then do SQL cleanup before copying it to a new table. Maybe that's just because I'm a SQL guy and everything looks like a nail to me, but SQL is optimized for set-based operations. On the other hand, if the volume of data is huge you may need to do conversions and cleanup on an RBAR basis as you import it via SSIS or whatnot.

I would also recommend explicitly defining your PK (if any) with an ALTER TABLE statement after your CREATE TABLE. This reminds you to make an explicit choice (including whether to cluster on the PK), puts it next to your other index declarations, and lets you give it a non-random name.