Sql-server – Indexing strategy for dynamic predicate

indexindex-tuningsql serversql-server-2012

Assume SQL Server 2012 Standard edition.

My database has a table with 500 million rows. The table has about a dozen columns, none of which are very wide (some varchar(100)'s and some ints).

The clustered index (also primary key) is an identity column.

The application using this table has a screen where the user can search on most of the columns. One search field on the screen, which is required, has the option to search starts with or contains, resulting in either

WHERE ABC LIKE 'something%' -- starts with

WHERE ABC LIKE '%something%' -- contains

The actual queries are parameterized, unlike my examples here.

The other search fields do a starts with search just like the first example above, but they are not required. So, any combination of these fields can be searched on resulting in a dynamic where clause.

Given this information, what indexes should be created for optimal performance?

Bearing in mind that I'm new to query performance tuning, my naive strategy for this scenario alone is to create a non-clustered index for each column and using full text search for the column that has the contains search option. I'd love to hear why or why not that's a bad idea and what a better approach would be.

Update

It's known to me that full text searching is how to optimize the case of a "contains" search.

I'm much more interested in the other aspect of the problem: how to optimize for the other search fields which may or may not be present in any given query predicate. The details surrounding the field that can benefit from a full text index are included in my question only to help paint a more complete picture of my particular situation.

Best Answer

If I were you I would run a trace specific to hits on that table. It shouldn't be overly intensive since you are restricting it to just queries against that table, from your application. Run just the minimum needed by the DTA (Database Tuning Advisor). Run it for a day here, a day there, make sure you get some end of week days and some end of month days. Then run the whole lot through the DTA.

Here is why, I'm willing to bet that you have specific combinations of columns that are going to come up more often than not. You can create more complex indexes based on that information. You might also find that you can create some correlated statistics. Basically statistics that have more than one column. For example creating a statistic on City and State together may improve queries against those two columns.

However make sure you don't create to many indexes. On a table that large I'm guessing you do a fair number of writes and every additional index added will slow them down. Of course you may do most of your writes during a batch process.

Also make sure that you put an automatic process to update your statistics periodically. With that many rows the statistics aren't going to update on their own very often. Only once 500+20% of the rows have changed, in at 500 mil rows that's a LOT.

Related Solutions

Mysql – index thesql concatenated columns

Two options I can think of. First, if you are on MyISAM or InnoDB 5.6+, you could store the concatenation in a separate field and use a FULLTEXT index on that field.

The other option is to index the first_name and last_name fields separately. Then change your query to:

WHERE a.first_name LIKE "twain%" OR a.last_name LIKE "twain%"

Removing the wildcard from the beginning will allow the indexes to be used.

Sql-server – SQL Server 2008 R2 : A Possible indexing strategy for the given schema

The way this is designed you only have suboptimal choices. Random GUIDs are not well suited as clustered index keys, since they are neither small (which affects the size of all secondary indexes) nor sequential (unless you can use NEWSEQUENTIALID()) which leads to index fragmentation, which leads to wasted space, slower insert performance and slower query performance through more I/O.

On the other hand, if your normalized tables are linked by such a GUID then each join depends on them and you will have to bite the bullet and use them as primary keys with clustered index anyway. Just create the PRIMARY KEY constraint and the clustered index in separate steps so you can define PAD_INDEX = ON and FILLFACTOR=50 to slow down the fragmentation somewhat. Still, expect to do regular, expensive index REBUILDs to reduce the inevitable fragmentation.

Your secondary indexes must not start with the id, because that renders them useless! Imagine a telephone book, where each entry is given a random or running id, then the phone book is sorted by that id plus the name. Have fun searching a given name in that. A useable index must start with the column that is used in the where- or join clause.

So, with the clustered indexes created so far you cover queries of the type

SELECT p.productname, s.name as StoreName 
FROM Products p
INNER JOIN Store s ON p.storeid = s.id

The query runs through the products, can efficiently look up the store ids and has immediate access to the store name, since the store id index is clustered.

Now you want to do this:

SELECT p.productname, s.name as StoreName 
FROM Products p
INNER JOIN Store s ON p.storeid = s.id
WHERE p.productname LIKE 'A%'

For this you need a nonclustered index with just productname as the key column (and optionally storeid as included column, if you do frequent range searches on productname).

OK, what about the reverse case?

SELECT p.productname, s.name as StoreName 
FROM Store s
INNER JOIN Products p ON p.storeid = s.id
WHERE s.name = 'My little cornershop'

For this, you need two additional indexes: One nonclustered on store with the name column and one nonclustered on products with storeid as the column. SQL Server can efficiently find the store record (expecting only one record), then through the second index can find all product entries for this store (still only a few compared to all entries in product), then for each of these products go through the clustered index (the clustered index key is automatically part of each nonclustered index) to get to the productname column.

I hope you see the pattern here. Create a nonclustered index for each column that gets queried with a high selectivity (meaning that only a small subset of all the rows will be selected).

The row-columns are completely useless in this scenario, just drop them to save space.

Using client generated GUIDs is attractive from the client point of view. You can create coherent datasets (such as a new customer including his first order) and push them to the database without caring for the correct INSERT order and without having to read database generated ids afterwards to update your object model. But you pay a nontrivial performance price for this when it comes to getting the data back from the database, as I hopefully made clear above. The large primary key (8 bytes) gets added to each nonclustered index and blows up its size, and you get a heavily fragmented clustered index which is never good.

Using IDENTITY values for primary keys has disadvantages at INSERT time, but pays off every time after that.

Update

Best Answer

Related Solutions

Mysql – index thesql concatenated columns

Sql-server – SQL Server 2008 R2 : A Possible indexing strategy for the given schema

Related Question