Sql-server – Save performance with large update on Index with Included Column

clustered-indexindexindex-tuningperformancesql server

Points
------------------
PK  QuestionId  int (+30.000.000 distinct values)
PK  EventId     int (large batches where 80.000 rows have the same EventId)
    Value       smallint

The table is about 40 million rows and has performance issues.

There are two main queries:

On QuestionId:

  • about 30 million different QuestionId values (lot of variation)
  • a lot of queries during busy times (multiple thousands per minute)

On EventId:

  • there will be batch updates of +150.000 rows where EventId=X to set the Value=NULL during very busy moments.

So my first thought to get the best performance was that I make EventId,QuestionId the ClusteredIndex so the batch update can easily find all of EventId close to eachother and directly update the values.

My second thought was adding an Index with QuestionId and included column Value so it can read the value directly from the Index (EventId is irrelevant in this case).

But then I thought: would the Clustered Index even matter? As the Included column Value from the Index also needs be updated during the batch update.

  • While not trading off on query performance – is it possible to get the batch update fast (few seconds) or do I have to accept this process will always be slow without upgrading hardware.
  • Any other thoughts what would be the best way to set the ClusteredIndex / Indexes?

I know in theory I should test everything out and measure it, but the site is live and used heavily.

I'm a solo developer and I don't have the resources to hire someone. Any estimated guesses and thoughts about this would be really helpful as this gives me already the right direction!

Best Answer

So if your primary access path is by question, then the unique clustered index that makes the most sense will be (QuestionId, EventId).

Adding a second index to EventId may not be useful as the index might not be selective enough and the query engine will decide it's just faster to read the entire table instead of doing a lot of work to read a large portion of it.

Alternately, if you always query based wholly or in part on EventId, the clustered index of (EventId,Questionid) is more suitable, and has the added benefit of making your updates based on EventId require less I/O to complete.

I would not include Value to an additional index, as this will essentially copy the entire table (just clustered on different columns) and your updates will take even longer as Value must be kept in sync across both the clustered index and the additional index.

At a certain point there is no free lunch, and the correct solution is probably choosing the clustered index with the leading column that supports the most use cases, then adding RAM/CPU/faster storage to deal with cases where the entire table (or large chunks of it) must be read or written to. With 40 million rows and such a narrow table I can't imagine this being something that more RAM wouldn't fix.

Depending on your edition of SQL Server, you can also look into if page compression would reduce the table size significantly, as this would reduce the number of reads/writes to disk (the additional CPU overhead is offset by fewer disk operations). My guess in your case is it would not, but it's work looking at.

So if I understand correctly (with the current knowledge) your gut feeling would be Clustered Index on QuestionId,EventId and then a normal Index on EventId for the batch update?

Only if the primary use is to return specific QuestionIds without regard to EventId. You can try the additional index in EventId, but you may find it isn't being used often (or at all) for updates (or updates still take longer than you'd like), depending on how EventIds are distributed through your data with respect to QuestionId.

You also have to determine what is more important to you overall - select performance or update performance. If the update is the pain point, (EventId,QuestionId) will undoubtedly be the better choice. Given the number of unique values for QuestionId, adding an index on that column could be useful for SELECT performance, but that's going to depend on how QuestionId is distributed and how many you are searching for at a time.

Keeping statistics up to date will be crucial in either case.

An extremely simple example (for the sake of completeness):

Let's say we have a DBMS that maintains a clustered index and stores 4 rows per page. We have a table with a primary key of (QuestionId, EventId) and one additional column, Value.

If we create the clustered index as (QuestionId, EventId), the data in our imaginary DBMS is (roughly speaking) stored like so:

Page | QuestionId | EventId | Value
-----------------------------------
A    | 1          | 2       | ...
A    | 1          | 3       | ...
A    | 1          | 6       | ...
A    | 1          | 7       | ...
B    | 1          | 8       | ...
B    | 1          | 10      | ...
B    | 1          | 11      | ...
B    | 2          | 2       | ...
C    | 3          | 2       | ...
C    | 4          | 1       | ...
C    | 5          | 6       | ...
C    | 5          | 7       | ...
D    | 6          | 1       | ...
D    | 7          | 2       | ...
D    | 7          | 6       | ...
D    | 7          | 8       | ...

So if I need to perform an operation based on QuestionId, the engine will not have to read more pages than necessary.

However, if I need to perform an operation based on EventId, I will have to read the entire table (clustered index scan), unless I add an additional index, which will look like this (and require four pages):

EventId | QuestionId
--------------------
1       | 4
1       | 6
2       | 1
2       | 2
2       | 3
2       | 7
3       | 1
6       | 1
6       | 5
6       | 7
7       | 1
7       | 5
8       | 1
8       | 7
10      | 1
11      | 1

This index be selective for some EventIds, but in the extreme case (EventId = 2) the whole table would still need to be read, and for some cases (EventId = 6) our optimizer might decide that searching the index and reading the table is more expensive than just reading the entire table.

If we instead cluster on EventId, QuestionId our table looks like this:

Page | EventId | QuestionId | Value
-----------------------------------
A    | 1        | 4         | ...
A    | 1        | 6         | ...
A    | 2        | 1         | ...
A    | 2        | 2         | ...
B    | 2        | 3         | ...
B    | 2        | 7         | ...
B    | 3        | 1         | ...
B    | 6        | 1         | ...
C    | 6        | 5         | ...
C    | 6        | 7         | ...
C    | 7        | 1         | ...
C    | 7        | 5         | ...
D    | 8        | 1         | ...
D    | 8        | 7         | ...
D    | 10       | 1         | ...
D    | 11       | 1         | ...

Any operation based on EventId will only read the necessary portions of the table, and like our first instance, any operation based on QuestionId will require a scan without an additional index. If we create an index on QuestionId, the index will be:

QuestionId | EventId
--------------------
1          | 2
1          | 3
1          | 6
1          | 7
1          | 8
1          | 10
1          | 11
2          | 2
3          | 2
4          | 1
5          | 6
5          | 7
6          | 1
7          | 2
7          | 6
7          | 8

So as in the first example, this index will be more useful for some questions and less useful for others. For QuestionId = 1 the optimizer will probably say the cost of reading half the index then looking up half the table isn't worth the cost and will just read the entire table instead of utilizing the index.

If we include Value on the index, we now have to change the table AND the index within the same transaction. Best case, this doubles the work for any operation. Worst case this requires the entire table or index (which is just a copy of the table) to be read and possibly locked.

Now it's possible with your actual data that adding an additional index on QuestionId or EventId will provide a lot of benefit. But it's not going to solve every problem, and it might not be worth the overhead imposed on inserts/updates/deletes.