Sql-server – Clustered index in ETL (drop – insert – create?)

clustered-indexindexindex-tuningsql servert-sql

I have a table which is 130 milion rows. Every day I extract data from a source system and insert into this table. The extract is incremental i.e. I only get the new data from the source. The table is a heap with one nonclustered index.

So I am thinking about creating a clustered index. I have read that loading data in clustered table creates page splits and it is better to drop the index, and after the insert recreate the index. Some also seem to say it is better to diable and then rebuild.

I am not sure but I don't think the above advice is solid. Even when I insert into a heap new pages have to be allocated. What makes page splits so bad on a clustered index?

I think perhaps it is better to just insert into the clustered index? It might be slower than to insert in a heap but later it will benefit from a clustered index in joins and calculations.

Is the above advice about dropping and recreating the clustered index a myth?

I would say the overall goal is to improve everything i.e. joins and calculations on the table after the new inserts and possibly maintaining the table and deleting oldest rows in another job and avoiding heap fragmentation. So I can live with inserts becoming slower..

The "inserts" are actually from a "data flow tap" from a ODBC data source in SSIS and they do not happen in a bulk (I think).

Thansk

Best Answer

The perfmon counters [\SQLServer\Databases(*)\Bulk Copy Rows/sec] can help determine whether the SSIS package is performing bulk insert. You want bulk insert, especially if the underlying table is ever converted to a clustered columnstore.

In your ETL, I'm assuming that each ETL involves X+Y rows, where X is a number of rows that are updates and thus are deleted from the ~130 million row target table and Y are net new rows. A delete is performed on the target table and then all X+Y rows are inserted.

(It's also possible that this ETL simply inserts Y new rows with each iteration, and deletes X old rows past some cutoff point - eg deletes all rows past 30 days old. That makes some of this designing/planning easier especially if the ~130 million row target table stays fairly stable in size over time, rather than consistently growing over time.)

It's very important to know whether this schema and process is to be optimized for the ETL, optimized for the non-ETL workloads, or balanced between them.

For some system and workload combinations, ETL is done within an acceptable amount of time just about however it's done. In such a case, optimization can almost completely focus on the non-ETL workloads. The maxims then are "write the data how it will be read" and "keep it simple". No need to worry about the page split behavior (and transaction log ramifications) of a clustered index vs heap, or that clustered index insert will most likely always be a serialization point in the ETL. It all becomes: if there's a key - either single or combination - that is meaningful for the reporting queries and with a significant number of the table columns, that can be considered as a candidate for the clustered index key.

Fragmentation of both the clustered index (if adopted) and the nonclustered index can be managed according to the same considerations. (But index maintenance is usually more of a consideration for non-ETL workloads than ETL workloads.) I usually focus on index reorganizations rather than rebuilds, because what I really care about is keeping the pages pretty full. One reason I like looking at it from that perspective: it is much less expensive to keep track of over time. I can just watch how many mb each million rows is on average, before and after a reorg. That can be measured fairly inexpensively, in comparison to using sys.dm_db_index_physical_stats. There are two inherent advantages to rebuild vs reorg: an index rebuild automatically includes a fullscan stats update, and an index rebuild can be performed in parallel while a reorg is always serial.

Now, what if ETL performance optimization takes precedence over everything else? Then you don't want a non-clustered index on the heap, unless it's used to make the deletes faster by making a more efficient plan available. It's possible that the added cost of maintaining that non-clustered index during the ETL inserts is a significant drag on ETL performance, even if it speeds up the deletes. In that case, consider creating the non-clustered index before the deletes. And dropping it immediately after the deletes. That way it never has to be maintained with REORG or REBUILD. And the deletes get the benefit while the inserts don't have to worry about the cost. But that has to be measured in order to know if it's benefecial. And the measurement needs to be re-visited as the table grows to 2x, 3x size etc. Or if the number of deleted rows in each ETL changes signficantly.

If ETL performance takes precedence over everything else, probably don't want a clustered index on that table, either. Here's one of the performance pitfalls of heavy insert activity into a clustered index: if its estimated at more than 100 rows, its going to be sorted before the insert. Depending on the size of the insert, the maximum memory grant, and the grant that insert query recieves that may mean tempdb spill - and a big increase in elapsed time for the insert.

All right. Chances are, this system and workload hint toward performance optimizations needed for both ETL and non-ETL workloads. Maybe the non-clustered index is used both for the ETL deletes and by reporting queries? In that case, consider dropping it before the inserts, and creating it after the inserts.

This blog post from Paul White is one I think about a lot. He describes wide, per-index plans and narrow, per-row plans here.

Optimizing T-SQL queries that change data 26 January 2013 https://www.sql.kiwi/2013/01/optimizing-t-sql-queries-that-change-data.html

Maybe keeping the non-clustered index in place for both deletes and inserts is the best choice for your needs. Maybe creating before the deletes, and dropping after the deletes. Maybe dropping before the deletes, and creating after the inserts. It depends which query plans actually use the index, and the relative cost of creating the index vs keeping it around for the activity.

Similarly, maybe a clustered index will be beneficial to the deletes and to the reporting queries (although if the clustered index is meaningful to the deletes the non-clustered index is less likely to be meaningful to the deletes). It'll be a net cost to the inserts, though, compared to a heap.

And... yeah, I agree that in general creating a clustered index and then dropping the clustered index in order to manage fragmentation/average fill in the heap is not something to be done on a regular basis. I'm much more in favor of making sure TABLOCK/TABLOCKX hints are used as needed in delete statements, so page locks are taken and pages emptied by the delete can be deallocated from the heap.

If a clustered index will be created and dropped in order to manage fragmentation of a heap as a once-in-a-while-when-really-needed thing, I recommend taking such an action at a time that the NCI(s) can be dropped before the clustered index is created, and then NCI(s) created again after the clustered index is dropped. Because creating a clustered index on a heap with NCIs will rebuild the NCI(s). And dropping the clustered index will rebuild the NCI(s) again. Droppng the NCI(s) beforehand and creating the NCI(s) again at the end reduces that rigamarole by about half.

Hopefully, rather than creating a clustered index just to drop it again, if once-in-a-while intervention is needed on the heap, it can take the form of ALTER TABLE... MOVE. For that operation, too, I recommend dropping NCI(s) beforehand, and creating NCI(s) afterward. I like having individual control over the NCI rebuilds rather than having them take place automatically due to an action taken on the underlying heap.

One important variable to keep in mind in evaluating these options: the recovery model of the database in question. If transaction log operations are a concern either for performance or transaction log space and the database is in simple recovery model, engaging minimally logged operations wherever possible may be a big gain. If the database is in full recovery model and the txlog is a concern, page splits in a clustered index (or in the NCI for that matter) become a bigger deal. For full recovery model, the benefit of sort_in_tempdb for index create or rebuild operations increases (unless tempdb is more of a sore spot for the instance than the txlog is for the database).