Sql-server – Do I need separate indexes for each type of query, or will one multi-column index work

indexsql server

I somewhat know the answer to this question already, but I always feel as though there is more I need to pick up on the topic.

My basic understanding is that generally speaking, a single index that just includes all the fields you might be querying/sorting on at any given time isn't likely to be useful, yet I have seen this type of thing. As in, someone thought, "Well, if we just put all this stuff in an index, the database can use it to find what it needs", without having ever seen an execution plan for some of the actual queries being run.

Imagine a table like so:

id int pk/uid
name varchar(50)
customerId int (foreign key)
dateCreated datetime

I might see a single index including the name, customerId and dateCreated fields.

But my understanding is that such an index would not be used in a query like, for example:

SELECT [id], [name], [customerId], [dateCreated]
   FROM Representatives WHERE customerId=1 
   ORDER BY dateCreated

For such a query, it seems to me that a better idea would be an index including the customerId and dateCreated fields, with the customerId field being 'first'. This would create an index that would have the data organized in such a way that this query could quickly find what it needs – in the order that it needs.

Another thing I see, perhaps as frequently as the first, is individual indexes on each field; so, one each on name, customerId and dateCreated fields.

Unlike the first example, this type of arrangement seems to me sometimes to at least be partially useful; the query's execution plan may show that at least it's using the index on the customerId to select the records, but it's not using the index with the dateCreated field to sort them.

I know this is a broad question, because the specific answer to any particular query on any particular set of tables is usually to see what the execution plan says it's going to do, and otherwise take the specifics of the table(s) and queries into account. Also, I know that it depends on how often a query might be run as opposed to the overhead of maintaining a particular index for it.

But I suppose what I'm asking is as a general 'starting point' for indexes, does the idea of having specific indexes for specific, frequently-pulled queries and the fields in the WHERE or ORDER BY clauses make sense?

Best Answer

You are right in that your example query would not use that index.

The query planner will consider using an index if:

all the fields contained in it are referenced in the query
some of the fields starting from the beginning are referenced

It will not be able to make use of indexes that start with a field not used by the query.

So for your example:

SELECT [id], [name], [customerId], [dateCreated]
   FROM Representatives WHERE customerId=1 
   ORDER BY dateCreated

it would consider indexes such as:

[customerId]
[customerId], [dateCreated]
[customerId], [dateCreated], [name]

but not:

[name], [customerId], [dateCreated]

If it found both [customerId] and [customerId], [dateCreated], [name] its decision to prefer one over the other would depend on the index stats which depend on estimates of the balance of data in the fields. If [customerId], [dateCreated] were defined it should prefer that over the other two unless you give a specific index hint to the contrary.

It is not uncommon to see one index defined for every field in my experience either, though this is rarely optimal as the extra management needed to update the indexes on insert/update, and the extra space needed to store them, is wasted when half of them may never get used - but unless your DB sees write-heavy loads the performance is not going to stink badly even with the excess indexes.

Specific indexes for frequent queries that would otherwise be slow due to table or index scanning is generally a good idea, though don't overdo it as you could be exchanging one performance issue for another. If you do define [customerId], [dateCreated] as an index, for example, remember that the query planner will be able to use that for queries that would use an index on just [customerId] if present. While using just [customerId] would be slightly more efficient than using the compound index this may be mitigated by ending up having two indexes competing for space in RAM instead of one (though if your entire normal working set fits easily into RAM this extra memory competition may not be an issue).

Related Solutions

Sql-server – Index included columns

So my question is, is there any difference in using the suggested index above, or what I think is a better alternative...

The missing-index suggestions made by the optimizer are opportunistic and relevant only to the particular query concerned. The optimizer goes through an index analysis phase, where it might note the absence of a covering index it didn't find. These suggestions are not intended to be a replacement for a full workload-representative DTA session, much less proper index design by a skilled database practitioner based on wide knowledge of the data and critical queries.

The suggestions should always be reviewed, as you have done, to ensure an optimal set of indexes for all queries is created - not one covering index per query as could be the case if the suggestions were followed literally.

There are naturally implications when widening the keys of an index compared with using INCLUDE column, some of which have been noted by others. I personally prefer to INCLUDE the clustering keys explicitly where they are useful. Clustered indexes can be changed, and it is rare that the person performing this change would check to see if any queries were relying on the implicit behaviour.

Changing columns from INCLUDE to keys may also affect update query plans (overall shape and Halloween Protection requirements) and there are logging implications where keys of an index might change too.

I would probably choose to modify the suggestion as you have done, but I would be careful to validate update (= insert/update/delete/merge) query plans for the affected table.

Mysql – How to setup complex multi-column index for massive table

To answer your second question:

MySQL does not have a parallel query execution engine, so even if you partition the query, you are still single threaded. This will eventually kill your scale.

However, you could partition the table by visitor_id. This would allow you to run several queries (one per partition) in parallel, all of them form:

SELECT COUNT(DISTINCT visitor_id) 
FROM table WHERE location_id = # 
AND region_id = # 
AND action_id = # AND ts BETWEEN x AND y
AND visitor_id BETWEEN <partition_start> and <partition_end>

The output of these parallel queries (which you could store in a temp table as they run) is trivially combinable into the final result by simply adding the distinct counts together.

This is very similar to sharding, but instead of doing it across machines, you are doing it on the same table. By picking a good hash function to generate visitor_id (for example, a modulo or bit reversal if the original id is generated with a AUTO_INCREMENT) you can ensure that all partitions are approximately equal sized.

The reason you want to partition by visitor_id and not one of the other columns is that it makes the DISTINCT additive across partitions. For example, consider a table with two partitions. One holds visitor_id 0-99 in one holds and 100-199. You can now express two queries that can run in parallel:

INSERT INTO TempResult(visitor_id)
SELECT COUNT(DISTINCT visitor_id) 
    FROM table WHERE location_id = # 
    AND region_id = # 
    AND action_id = # AND ts BETWEEN x AND y
    AND visitor_id BETWEEN 0 and 99

And this one in parallel:

INSERT INTO TempResults (visitor_id)
SELECT COUNT(DISTINCT visitor_id) 
    FROM table WHERE location_id = # 
    AND region_id = # 
    AND action_id = # AND ts BETWEEN x AND y
    AND visitor_id BETWEEN 100 and 199

Because you know the visitor_id is not overlapping between partitions, the final result is:

SELECT SUM(visitor_id) FROM TempResults

You would of course need to pick the partition boundaries in such a way that partitions have approximately the same size.

I will let ypercube file the answer to the indexing question as this is the one that deserves the reward.

Best Answer

Related Solutions

Sql-server – Index included columns

Mysql – How to setup complex multi-column index for massive table

Related Question