Background: Once per day several temporary tables that calculate a variety of different aggregated values are created. They all contain the same unique identifier (the field I would make a PRIMARY KEY on). Each table has around 100k rows with only 2–8 columns. Several JOINs and UNION ALLs are then done on the temporary tables. Every field in all of the temporary tables are used. Furthermore, most of the time every row is also used—worst case scenario, half of the rows is used somewhere and the other half is used somewhere else. Once the query is finished, the results are saved on the disk so that individuals can access the data for the rest of the day.
Question: Which of the following approaches should be the best fastest:
• Have no indexes on the temporary tables.
• Have a clustered index on all of the temporary tables (via the PRIMARY KEY declaration on the unique identifier).
• Have a nonclustered index on the unique identifier while including the remaining columns on all of the temporary tables.
• The last two bullet points together.
Thoughts: While running all four options simultaneously, each one had a query cost of 25% (relative to the batch); however, when the first bullet point is run, the execution plan (on the SELECT query) states that I should create a nonclustered index on the unique identifier while including the remaining columns for each of the temporary tables.
I'm a little perplexed by this. If I effectively use every field and row in all of the temporary tables, why would it suggest this? Wouldn't a heap or a clustered index be better?
To me, a heap would minimize the overhead in creating an index and sorting the data; and since I essentially need every row, there is no harm in doing a table scan.
A clustered index, on the other hand, should improve the temporary table that is created via a JOIN on two of the other temporary tables as well as improve the final SELECT query which relies on JOINs and UNION ALLs.
Related Note: On a large table with 400 columns and 70M rows—don't ask me why we have an absurd amount of columns—a query that only grabbed the PRIMARY KEY was ≈50 times faster when utilizing a non-relevant nonclustered index over the clustered index.
If anyone has any insight, I'd be grateful.
Best Answer
To summarize (and include some information from the comments), you have a process that runs once per day early in the morning that populates several 100k row temp tables with 2 - 8 columns each. This is a broad question but my initial reaction would be to create primary keys on all of the tables. If performance is good enough then I would be satisfied. If performance isn't good enough I would investigate further to find ways to improve the code. Removing the primary keys would technically be an option but in general I would expect to find performance gains by changing the queries that populate the temp tables.
A heap is a table without a clustered index. Note that a primary key does not need to be defined the same way as a clustered index. In fact, you can define a heap with a primary key. As far as I know this isn't a very common thing to do. There are a few advantages in using heaps for your workload:
SELECT INTO
syntax. Note that the query optimizer may choose not to use parallel inserts depending for a variety of reasons including the estimated size of the data, andMAXDOP
settings.In general, I would not worry about a sort of 100k rows, especially if this process runs early in the morning with nothing else. Parallel inserts are unlikely to be a big help as well for such a small amount of data.
There are a few disadvantages of using heaps for this workload:
Hard to say more on these points because I don't know anything about your process. It's possible that your queries against the temp tables could benefit from a
MERGE JOIN
but I doubt it will make a big difference at your volume.Now let's talk about tables with a clustered index. As stated before, a clustered index does not need to match the primary key but I'll assume the most common default which is they are both the same. That is what you end up with when you define a
PRIMARY KEY
inline as part of the table definition.There are a few advantages of tables with clustered indexes for your workload:
There are a few disadvantages of tables with clustered indexes for your workload:
The advantages and disadvantages are mostly the exact opposite of a heap table, which should not be surprising.
The third and four options aren't really worth considering. You can create an index that includes all columns but it will duplicate the data. Just define your primary key the right way and you should have no use for an additional nonclustered index that includes all columns. Note that SQL Server Management Studio will not recommend that you should create a clustered index on a table.
One difference between the clustered index/primary key and the nonclustered index is that the clustered index/primary will not allow NULL values in the key columns and automatically enforces a uniqueness constraint. Of course, it is possible for you to create a nonclustered index with the same restrictions.
Ultimately, all that you can do is test your workload with the different options. So we cannot definitely say which approach will be fastest. With one set of data and queries the heap approach may be fastest. With a different set of data and queries the clustered table approach may be fastest. Do not discount the value of data integrity, but you may have other ways of enforcing data integrity.