Sql-server – Truncate and insert: Heap and build index or directly into Clustered table

bulk-insertetlindexsql servert-sql

In ETL is it better to drop the index before inserting millions of rows and right after create the index again, or to simply insert into the empty table with the indexe in place.

I know that I can test it and measure it (have not done that yet) but I what to understand the reason, that is what is more expensive: sort and insert into a clustered index or to create an index.

I have kept my index in place and when I insert I see a sort at the end of the execution plan. Also the clustered index insert operator right before the root node is quite expensive (basically all the cost is divided between the sort and the clustered index insert operators).

I use TABLOCK for my insert, recovery model is simple and table is rowstore.

Best Answer

I would keep the clustered index in-place especially if the data is being inserted in large lumps (rather than lots of individual inserts).

You should drop the non-clustered indexes if rebuilding the data from scratch.

^{NOTE: As per your mention of truncating, this answer is talking about rebuilding a table from scratch.

Considerations would be different if you were adding millions of rows to a table that already contained billions.}

the clustered index insert operator right before the root node is quite expensive

It will be as that is the step that is writing all the data to permanent storage. You'll get a similarly expensive step with a heap.

(basically all the cost is divided between the sort and the clustered index insert operators).

This is expected. If you turn the index off so that you have a heap, when you readd the clustered index it will have to reread the data, perform the same sort, and rewrite the pages in the new order - so it will be as expensive, probably more so, than the initial insert with the clustered index turned on.

I know that I can test it and measure it

This is a very good point!

Don't just measure the data insert though: remember that creating the index on an already populated table will be expensive too, so it is "unfair" to compare just the "insert heap" & "insert with CI" not "insert heap + build CI" and "insert with CI". Also, it will need more space in the relevant data file as it will have two copies of the data as the index is being rebuilt (the heap and the newly forming clustered index that will replace it when complete).

Try both in otherwise empty fresh (therefore near zero length data files) test DBs to see the different file growth effects too.

but I what to understand the reason

I suggest trying it, with the index rebuilds too, and look at the work done as displayed in the query plans and the IO statistics.

Related Solutions

Sql-server – computed columns, index, clustered index and covering index

Computed Columns can be stored in the data page in one of two ways. Either by creating them as PERSISTED or by including them in the clustered index definition.

If they are included in the clustered index definition then even if the columns are not marked as PERSISTED then the values will still be stored in each row. These index key values will additionally be stored in the upper level pages.

If the computed column is imprecise (e.g. float) or not verifiable as deterministic (e.g. CLR functions) then it is a requirement for the column to be marked as PERSISTED in order to be made part of an index key.

So to give an example

CREATE TABLE T
(
A INT,
C1 AS REPLICATE(CHAR(A),100) PERSISTED,
C2 AS REPLICATE(CHAR(A),200),
C3 AS CAST(A AS FLOAT) PERSISTED,
C4 AS CAST(A + 1 AS FLOAT)
)

CREATE UNIQUE CLUSTERED INDEX IX ON T(C2,C3)

C1 will be stored in just the data page rows as it is marked as PERSISTED but not indexed.
C2 will be stored in both the rows on the data page and the index higher levels as it is an index key column.
C3 will be stored as for C2. As it is imprecise it is a requirement to mark it as PERSISTED however.
C4 won't be stored anywhere as it is neither marked as PERSISTED nor indexed.

Similarly all computed columns referenced in non clustered index definitions as key columns need to be stored at all levels of the index as they are part of the index key. There is the same requirement regarding precise/deterministic results.

Fails

CREATE NONCLUSTERED INDEX IX2 ON T(A,C4)

With the error.

Cannot create index or statistics 'IX2' on table 'T' because the computed column 'C4' is imprecise and not persisted. Consider removing column from index or statistics key or marking computed column persisted.

To include it as part of the non clustered index key it must also be stored in the clustered index data pages. However

Succeeds.

CREATE NONCLUSTERED INDEX IX2 ON T(A) INCLUDE (C4)

Computed columns that are only INCLUDEd columns are persisted to the NCI leaf page and do not have the requirement that they also be persisted in the data page.

Sql-server – Minimal Logging Conditions in SQL

A few of these numbers do not seem to match the table on the technet page.

There are small differences in the sizes of the log records generated in your tests, but these are due to other internal logging behaviours, not whether minimal logging is occurring or not.

A good definition of minimal logging is provided by Sunil Agarwal of the Storage Engine team:

Individual rows are not logged and only the changes to page allocation structures are logged

Any test where you see individual row changes logged (e.g. LOP_INSERT_ROWS) is not using minimal logging for the associated allocation unit. Some operations can be minimally logged with respect to one allocation unit (e.g. an index) and not minimally logged against another. Also, in some circumstances, inserts to existing pages may not be minimally-logged but changes to newly allocated page may be.

Most of the details can be found in a series of Storage Engine team blog posts:

One detail not explored there is that to be minimally-logged (in SQL Server 2008 or later) INSERT...SELECT changes to b-tree structures must have the DMLRequestSort query plan operator property set to true. This applies to the circumstances where the Data Loading Performance Guide shows 'Depends': the query plan must use wide (per-index) maintenance with DMLRequestSort=true.

I wrote more about this in Minimal Logging with INSERT…SELECT and Fast Load Context.

Best Answer

Related Solutions

Sql-server – computed columns, index, clustered index and covering index

Sql-server – Minimal Logging Conditions in SQL

Related Question