Sql-server – Indexing for a table with many large subsets

indexsql serversql-server-2017

In SqlServer 2017 we have a table with millions of rows. The table has a few dozen columns. One of the columns is a varchar(50) with a well defined range of allowed values, the value of this column is guaranteed to be one string from a list of about 4,000 strings. The only queries we care to optimize against this table all exclusively deal with the rows that have just one of those values, but they need to retrieve all of the columns of those rows. Does it make sense to put an index on that column and if so which type of index should be used?

Example Schema:

+---------------------------------------+
| Id   | Category | ... 20 other columns|
+------+----------+---------------------+
| 1    | Food     |      .....          |
+------+----------+---------------------+
| 2    | Lumber   |      .....          |
+------+----------+---------------------+

Example Query: select * from table where Category = 'food';

So, in this example, the Category column contains one string from a list of about 4000. I've considered a clustered index, but this column is not unique. I'd add a non-clustered index, but the query calls for all columns to be returned so it has to go from the index back to the main table to retrieve all of the data anyway, right? Then do we have to use a full table index or is there a better option?

Best Answer

Does it make sense to put an index on that column and if so which type of index should be used?

We can't answer that for you. It depends on if query performance is good enough without the index. If end user response time, transactions per second, overall server CPU, or whatever metric is the most important to you is fine without the index then you don't need it right now. You may need it in the future if the table grows bigger. With all of that said, the current query and table structure that you have won't give you the best performance. If an improvement is needed then you have three main options:

  1. Create a nonclustered primary key on the id column and a clustered index on Category.

    This will give you the best possible performance for the query that you listed in the question. The downside is that changing the clustered index may have negative impacts on other queries. Clustered indexes do not need to be unique. You also do not need a primary key on your table. I just proposed an example schema that had both (I assume that you have a primary key today).

  2. Add a nonclustered index on Category that includes all other columns.

    This will also give you the best possible performance for the query that you listed in the question. As a downside it will double the disk space required for the table and will slow down DML operations on the table.

  3. Add a nonclustered index on Category that doesn't include all other columns.

    This won't give you the best possible performance. It will improve performance as long as you're filtering on a value that's selective enough and SQL Server chooses to use the index. If you need to help the optimizer along, the techniques in this blog post by Distinguished Answerer Erik Darling may be helpful. There's still a DML impact to this option, but it should be significantly less than option 2.

Ultimately the only way that you can get a satisfactory answer is by looking into it yourself. We don't know what kind of importance that you put on DML performance, total disk usage, the performance of other queries, etc. All of the options that I mentioned above have their own drawbacks. Good luck!