Sql-server – Temporary Tables With Nonclustered Indexes Including All Columns

nonclustered-indexsql serversql server 2014temporary-tables

Background: Once per day several temporary tables that calculate a variety of different aggregated values are created. They all contain the same unique identifier (the field I would make a PRIMARY KEY on). Each table has around 100k rows with only 2–8 columns. Several JOINs and UNION ALLs are then done on the temporary tables. Every field in all of the temporary tables are used. Furthermore, most of the time every row is also used—worst case scenario, half of the rows is used somewhere and the other half is used somewhere else. Once the query is finished, the results are saved on the disk so that individuals can access the data for the rest of the day.

Question: Which of the following approaches should be the best fastest:

• Have no indexes on the temporary tables.

• Have a clustered index on all of the temporary tables (via the PRIMARY KEY declaration on the unique identifier).

• Have a nonclustered index on the unique identifier while including the remaining columns on all of the temporary tables.

• The last two bullet points together.

Thoughts: While running all four options simultaneously, each one had a query cost of 25% (relative to the batch); however, when the first bullet point is run, the execution plan (on the SELECT query) states that I should create a nonclustered index on the unique identifier while including the remaining columns for each of the temporary tables.

I'm a little perplexed by this. If I effectively use every field and row in all of the temporary tables, why would it suggest this? Wouldn't a heap or a clustered index be better?

To me, a heap would minimize the overhead in creating an index and sorting the data; and since I essentially need every row, there is no harm in doing a table scan.

A clustered index, on the other hand, should improve the temporary table that is created via a JOIN on two of the other temporary tables as well as improve the final SELECT query which relies on JOINs and UNION ALLs.

Related Note: On a large table with 400 columns and 70M rows—don't ask me why we have an absurd amount of columns—a query that only grabbed the PRIMARY KEY was ≈50 times faster when utilizing a non-relevant nonclustered index over the clustered index.

If anyone has any insight, I'd be grateful.

Best Answer

To summarize (and include some information from the comments), you have a process that runs once per day early in the morning that populates several 100k row temp tables with 2 - 8 columns each. This is a broad question but my initial reaction would be to create primary keys on all of the tables. If performance is good enough then I would be satisfied. If performance isn't good enough I would investigate further to find ways to improve the code. Removing the primary keys would technically be an option but in general I would expect to find performance gains by changing the queries that populate the temp tables.

A heap is a table without a clustered index. Note that a primary key does not need to be defined the same way as a clustered index. In fact, you can define a heap with a primary key. As far as I know this isn't a very common thing to do. There are a few advantages in using heaps for your workload:

  1. Inserting the data will not lead to a sort. If you insert 100k rows into a table with a clustered index it is possible that the data will need to be sorted depending on the query that populates the table. All of the columns or just the clustering key could be included in the sort.
  2. As of SQL Server 2014 the insert can run in parallel if you use the SELECT INTO syntax. Note that the query optimizer may choose not to use parallel inserts depending for a variety of reasons including the estimated size of the data, and MAXDOP settings.

In general, I would not worry about a sort of 100k rows, especially if this process runs early in the morning with nothing else. Parallel inserts are unlikely to be a big help as well for such a small amount of data.

There are a few disadvantages of using heaps for this workload:

  1. No data integrity. There might be a bug in your process or corrupt data which could have been caught earlier with a primary key on the temp tables.
  2. The data is not sorted, so any query plan operator which requires sorted data (such as a merge join operator or inserting into a clustered table) will require an explicit sort. Let's say you reference one table in three queries and all three times the query optimizer sorts the data. Why not sort the data on disk at the start to avoid those three sorts in the query plans?

Hard to say more on these points because I don't know anything about your process. It's possible that your queries against the temp tables could benefit from a MERGE JOIN but I doubt it will make a big difference at your volume.

Now let's talk about tables with a clustered index. As stated before, a clustered index does not need to match the primary key but I'll assume the most common default which is they are both the same. That is what you end up with when you define a PRIMARY KEY inline as part of the table definition.

There are a few advantages of tables with clustered indexes for your workload:

  1. Data integrity!
  2. Query plan operators benefit from a sort may already have the data sorted in the correct order. For example, this makes merge joins more attractive, especially if you are getting all of the data from the involved tables.
  3. The query optimizer may be able to pick a more efficient plan if it knows that certain columns are unique. One example of this is SQL Server may know that the joins are not many to many which can lead to more accurate cardinality estimates and costing of join operators.

There are a few disadvantages of tables with clustered indexes for your workload:

  1. Parallel inserts into the temp tables are not available in any version of SQL Server.
  2. Populating the tables may require an explicit sort.

The advantages and disadvantages are mostly the exact opposite of a heap table, which should not be surprising.

The third and four options aren't really worth considering. You can create an index that includes all columns but it will duplicate the data. Just define your primary key the right way and you should have no use for an additional nonclustered index that includes all columns. Note that SQL Server Management Studio will not recommend that you should create a clustered index on a table.

One difference between the clustered index/primary key and the nonclustered index is that the clustered index/primary will not allow NULL values in the key columns and automatically enforces a uniqueness constraint. Of course, it is possible for you to create a nonclustered index with the same restrictions.

Ultimately, all that you can do is test your workload with the different options. So we cannot definitely say which approach will be fastest. With one set of data and queries the heap approach may be fastest. With a different set of data and queries the clustered table approach may be fastest. Do not discount the value of data integrity, but you may have other ways of enforcing data integrity.