Sql-server – Save performance with large update on Index with Included Column

clustered-indexindexindex-tuningperformancesql server

Points
------------------
PK  QuestionId  int (+30.000.000 distinct values)
PK  EventId     int (large batches where 80.000 rows have the same EventId)
    Value       smallint

The table is about 40 million rows and has performance issues.

There are two main queries:

On QuestionId:

about 30 million different QuestionId values (lot of variation)
a lot of queries during busy times (multiple thousands per minute)

On EventId:

there will be batch updates of +150.000 rows where EventId=X to set the Value=NULL during very busy moments.

So my first thought to get the best performance was that I make EventId,QuestionId the ClusteredIndex so the batch update can easily find all of EventId close to eachother and directly update the values.

My second thought was adding an Index with QuestionId and included column Value so it can read the value directly from the Index (EventId is irrelevant in this case).

But then I thought: would the Clustered Index even matter? As the Included column Value from the Index also needs be updated during the batch update.

While not trading off on query performance – is it possible to get the batch update fast (few seconds) or do I have to accept this process will always be slow without upgrading hardware.
Any other thoughts what would be the best way to set the ClusteredIndex / Indexes?

I know in theory I should test everything out and measure it, but the site is live and used heavily.

I'm a solo developer and I don't have the resources to hire someone. Any estimated guesses and thoughts about this would be really helpful as this gives me already the right direction!

Best Answer

So if your primary access path is by question, then the unique clustered index that makes the most sense will be (QuestionId, EventId).

Adding a second index to EventId may not be useful as the index might not be selective enough and the query engine will decide it's just faster to read the entire table instead of doing a lot of work to read a large portion of it.

Alternately, if you always query based wholly or in part on EventId, the clustered index of (EventId,Questionid) is more suitable, and has the added benefit of making your updates based on EventId require less I/O to complete.

I would not include Value to an additional index, as this will essentially copy the entire table (just clustered on different columns) and your updates will take even longer as Value must be kept in sync across both the clustered index and the additional index.

At a certain point there is no free lunch, and the correct solution is probably choosing the clustered index with the leading column that supports the most use cases, then adding RAM/CPU/faster storage to deal with cases where the entire table (or large chunks of it) must be read or written to. With 40 million rows and such a narrow table I can't imagine this being something that more RAM wouldn't fix.

Depending on your edition of SQL Server, you can also look into if page compression would reduce the table size significantly, as this would reduce the number of reads/writes to disk (the additional CPU overhead is offset by fewer disk operations). My guess in your case is it would not, but it's work looking at.

So if I understand correctly (with the current knowledge) your gut feeling would be Clustered Index on QuestionId,EventId and then a normal Index on EventId for the batch update?

Only if the primary use is to return specific QuestionIds without regard to EventId. You can try the additional index in EventId, but you may find it isn't being used often (or at all) for updates (or updates still take longer than you'd like), depending on how EventIds are distributed through your data with respect to QuestionId.

You also have to determine what is more important to you overall - select performance or update performance. If the update is the pain point, (EventId,QuestionId) will undoubtedly be the better choice. Given the number of unique values for QuestionId, adding an index on that column could be useful for SELECT performance, but that's going to depend on how QuestionId is distributed and how many you are searching for at a time.

Keeping statistics up to date will be crucial in either case.

An extremely simple example (for the sake of completeness):

Let's say we have a DBMS that maintains a clustered index and stores 4 rows per page. We have a table with a primary key of (QuestionId, EventId) and one additional column, Value.

If we create the clustered index as (QuestionId, EventId), the data in our imaginary DBMS is (roughly speaking) stored like so:

Page | QuestionId | EventId | Value
-----------------------------------
A    | 1          | 2       | ...
A    | 1          | 3       | ...
A    | 1          | 6       | ...
A    | 1          | 7       | ...
B    | 1          | 8       | ...
B    | 1          | 10      | ...
B    | 1          | 11      | ...
B    | 2          | 2       | ...
C    | 3          | 2       | ...
C    | 4          | 1       | ...
C    | 5          | 6       | ...
C    | 5          | 7       | ...
D    | 6          | 1       | ...
D    | 7          | 2       | ...
D    | 7          | 6       | ...
D    | 7          | 8       | ...

So if I need to perform an operation based on QuestionId, the engine will not have to read more pages than necessary.

However, if I need to perform an operation based on EventId, I will have to read the entire table (clustered index scan), unless I add an additional index, which will look like this (and require four pages):

EventId | QuestionId
--------------------
1       | 4
1       | 6
2       | 1
2       | 2
2       | 3
2       | 7
3       | 1
6       | 1
6       | 5
6       | 7
7       | 1
7       | 5
8       | 1
8       | 7
10      | 1
11      | 1

This index be selective for some EventIds, but in the extreme case (EventId = 2) the whole table would still need to be read, and for some cases (EventId = 6) our optimizer might decide that searching the index and reading the table is more expensive than just reading the entire table.

If we instead cluster on EventId, QuestionId our table looks like this:

Page | EventId | QuestionId | Value
-----------------------------------
A    | 1        | 4         | ...
A    | 1        | 6         | ...
A    | 2        | 1         | ...
A    | 2        | 2         | ...
B    | 2        | 3         | ...
B    | 2        | 7         | ...
B    | 3        | 1         | ...
B    | 6        | 1         | ...
C    | 6        | 5         | ...
C    | 6        | 7         | ...
C    | 7        | 1         | ...
C    | 7        | 5         | ...
D    | 8        | 1         | ...
D    | 8        | 7         | ...
D    | 10       | 1         | ...
D    | 11       | 1         | ...

Any operation based on EventId will only read the necessary portions of the table, and like our first instance, any operation based on QuestionId will require a scan without an additional index. If we create an index on QuestionId, the index will be:

QuestionId | EventId
--------------------
1          | 2
1          | 3
1          | 6
1          | 7
1          | 8
1          | 10
1          | 11
2          | 2
3          | 2
4          | 1
5          | 6
5          | 7
6          | 1
7          | 2
7          | 6
7          | 8

So as in the first example, this index will be more useful for some questions and less useful for others. For QuestionId = 1 the optimizer will probably say the cost of reading half the index then looking up half the table isn't worth the cost and will just read the entire table instead of utilizing the index.

If we include Value on the index, we now have to change the table AND the index within the same transaction. Best case, this doubles the work for any operation. Worst case this requires the entire table or index (which is just a copy of the table) to be read and possibly locked.

Now it's possible with your actual data that adding an additional index on QuestionId or EventId will provide a lot of benefit. But it's not going to solve every problem, and it might not be worth the overhead imposed on inserts/updates/deletes.

Answer

Since you refer to the website use-the-index-luke.com, consider the chapter:

Use The Index, Luke › The Where Clause › Searching For Ranges › Greater, Less and BETWEEN

It has an example that matches your situation perfectly (two-column index, one is tested for equality, the other for range), explains (with more of those nice index graphics) why @ypercube's advice is accurate and sums it up:

Rule of thumb: index for equality first — then for ranges.

Also good for just one column?

What to do for queries on just one column seems to be clear. More details and benchmarks concerning that under these related question:

Less selective column first?

Apart from that, what if you have only equality conditions for both columns?

It doesn't matter. Put the column first that is more likely to receive conditions of its own, which actually matters.

Consider this demo, or reproduce it yourself. I create a simple table of two columns with 100k rows. One with very few, the other one with lots of distinct values:

CREATE TEMP TABLE t AS
SELECT (random() * 10000)::int AS lots
     , (random() * 4)::int     AS few
FROM generate_series (1, 100000);

DELETE FROM t WHERE random() > 0.9;  -- create some dead tuples, more "real-life"

ANALYZE t;

SELECT count(distinct lots)   -- 9999
     , count(distinct few)    --    5
FROM   t;

Query:

SELECT *
FROM   t
WHERE  lots = 2345
AND    few = 2;

EXPLAIN ANALYZE output (Best of 10 to exclude caching effects):

Seq Scan on t  (cost=0.00..5840.84 rows=2 width=8)
               (actual time=5.646..15.535 rows=2 loops=1)
  Filter: ((lots = 2345) AND (few = 2))
  Buffers: local hit=443
Total runtime: 15.557 ms

Add index, retest:

CREATE INDEX t_lf_idx ON t(lots, few);

Index Scan using t_lf_idx on t  (cost=0.00..3.76 rows=2 width=8)
                                (actual time=0.008..0.011 rows=2 loops=1)
  Index Cond: ((lots = 2345) AND (few = 2))
  Buffers: local hit=4
Total runtime: 0.027 ms

Add other index, retest:

DROP INDEX t_lf_idx;
CREATE INDEX t_fl_idx  ON t(few, lots);

Index Scan using t_fl_idx on t  (cost=0.00..3.74 rows=2 width=8)
                                (actual time=0.007..0.011 rows=2 loops=1)
  Index Cond: ((few = 2) AND (lots = 2345))
  Buffers: local hit=4
Total runtime: 0.027 ms

Best Answer

An extremely simple example (for the sake of completeness):

Related Solutions

Sql-server – SQL Server 2012 – Frequent SELECT and UPDATE on bit column. Index doesn’t help

Postgresql – Multicolumn index and performance

Answer

Also good for just one column?

Less selective column first?

Related Question