Sql-server – Indexing Referenced Table Columns

index-tuningsql serverstring-searching

I would like to do a search across columns of joined tables. What would the best index configuration be for the query below?

Table A
------------
Id
ValueA
-- Table A has many more columns

Table B
------------
Id
AId
ValueB
-- Table B has many more columns

At the moment, I have clustered indexes on the Primary Keys and non-clustered indexes (which don't include any columns) for columns AId, ValueA, and ValueB.

SELECT *
FROM TableA a
INNER JOIN TableB b ON a.Id = b.AId
WHERE 
    a.ValueA LIKE 'SearchTerm%' OR
    b.ValueB LIKE 'SearchTerm%'

Update
Table A is likely to run into the low millions while Table B will only have low tens of thousands of entries.

In the real world Table A will be a list of transactions, where the value is some sort of a transaction reference. Table B will be a list of users where the value could be anything such as a name or any other descriptive information.

At the moment we use an ORM which will essentially do a SELECT *. Naturally this could be changed, but there would be a fairly large impact on the application side.

Best Answer

Your query is a tricky one for the optimizer, for two reasons.

The optimizer doesn't currently have logic to transform a disjunction (OR) across tables into a union (related Q & A).
The optimizer doesn't reason about deferring lookups (dealing with keys only until the last moment). This relates to the SELECT * component of the given query.

Together, these limitations mean there is very little indexing can do to assist the query as written (ORM generated), leaving aside the possibility of a (nonclustered) column store index.

If performance is inadequate, you may have to bite the bullet and write a custom query. In that case, modifying the existing index on ValueB to include AId would be useful, coupled with a rewrite like the following:

SELECT TA.*, TB.* 
FROM 
(
    -- Disjunction as union
    SELECT A.Id, b.Id
    FROM TableA a 
    INNER JOIN TableB b 
        ON b.AId = a.Id
    WHERE 
        a.ValueA LIKE 'SearchTerm%'

    UNION

    SELECT A.Id, b.Id
    FROM TableB b 
    INNER JOIN TableA a 
        ON a.Id = b.AId
    WHERE 
        b.ValueB LIKE 'SearchTerm%'
) AS M (A_Id, B_Id)
-- Lookups
JOIN dbo.TableA AS TA ON TA.Id = M.A_Id
JOIN dbo.TableB AS TB ON TB.Id = M.B_Id;

The target plan shape is:

How much better this is in practice depends on several factors, including how selective the LIKE predicates are (and that they are always prefix searches).

Guessed schema:

DROP TABLE IF EXISTS dbo.TableB, dbo.TableA;

CREATE TABLE dbo.TableA 
(
    Id integer NOT NULL PRIMARY KEY, 
    ValueA varchar(256) NULL INDEX iValueA,
    Padding char(4000) NOT NULL DEFAULT '', -- other columns

    /*INDEX i (ValueA) INCLUDE (Id)*/
);

CREATE TABLE dbo.TableB 
(
    Id integer NOT NULL PRIMARY KEY,
    AId integer NOT NULL REFERENCES dbo.TableA (Id) INDEX iAId,
    ValueB varchar(256) NULL /*INDEX iValueB*/,
    Padding char(2000) NOT NULL DEFAULT '', -- other columns

    INDEX i (ValueB) INCLUDE (AId)
);

UPDATE STATISTICS dbo.TableA WITH ROWCOUNT = 3000000, PAGECOUNT = 3000000;
UPDATE STATISTICS dbo.TableB WITH ROWCOUNT = 30000, PAGECOUNT = 30000;

Related Solutions

Sql-server – SQL Server 2008 R2 : A Possible indexing strategy for the given schema

The way this is designed you only have suboptimal choices. Random GUIDs are not well suited as clustered index keys, since they are neither small (which affects the size of all secondary indexes) nor sequential (unless you can use NEWSEQUENTIALID()) which leads to index fragmentation, which leads to wasted space, slower insert performance and slower query performance through more I/O.

On the other hand, if your normalized tables are linked by such a GUID then each join depends on them and you will have to bite the bullet and use them as primary keys with clustered index anyway. Just create the PRIMARY KEY constraint and the clustered index in separate steps so you can define PAD_INDEX = ON and FILLFACTOR=50 to slow down the fragmentation somewhat. Still, expect to do regular, expensive index REBUILDs to reduce the inevitable fragmentation.

Your secondary indexes must not start with the id, because that renders them useless! Imagine a telephone book, where each entry is given a random or running id, then the phone book is sorted by that id plus the name. Have fun searching a given name in that. A useable index must start with the column that is used in the where- or join clause.

So, with the clustered indexes created so far you cover queries of the type

SELECT p.productname, s.name as StoreName 
FROM Products p
INNER JOIN Store s ON p.storeid = s.id

The query runs through the products, can efficiently look up the store ids and has immediate access to the store name, since the store id index is clustered.

Now you want to do this:

SELECT p.productname, s.name as StoreName 
FROM Products p
INNER JOIN Store s ON p.storeid = s.id
WHERE p.productname LIKE 'A%'

For this you need a nonclustered index with just productname as the key column (and optionally storeid as included column, if you do frequent range searches on productname).

OK, what about the reverse case?

SELECT p.productname, s.name as StoreName 
FROM Store s
INNER JOIN Products p ON p.storeid = s.id
WHERE s.name = 'My little cornershop'

For this, you need two additional indexes: One nonclustered on store with the name column and one nonclustered on products with storeid as the column. SQL Server can efficiently find the store record (expecting only one record), then through the second index can find all product entries for this store (still only a few compared to all entries in product), then for each of these products go through the clustered index (the clustered index key is automatically part of each nonclustered index) to get to the productname column.

I hope you see the pattern here. Create a nonclustered index for each column that gets queried with a high selectivity (meaning that only a small subset of all the rows will be selected).

The row-columns are completely useless in this scenario, just drop them to save space.

Using client generated GUIDs is attractive from the client point of view. You can create coherent datasets (such as a new customer including his first order) and push them to the database without caring for the correct INSERT order and without having to read database generated ids afterwards to update your object model. But you pay a nontrivial performance price for this when it comes to getting the data back from the database, as I hopefully made clear above. The large primary key (8 bytes) gets added to each nonclustered index and blows up its size, and you get a heavily fragmented clustered index which is never good.

Using IDENTITY values for primary keys has disadvantages at INSERT time, but pays off every time after that.

Mysql – Low Cardinality of Index performing better than High Cardinality

The between condition is short form for:

 ... range_from <= $value 
 and $value <= range_to

In this context range_from and range_to have an effective cardinality of 2, in range or out of range.

Given your indexes the range_from needs to be searched for all values up to and including $value. If value is close to the end the possible values, the number of records that need to be checked is quite high. If you are looking for recent records, this is will be close to all the records.

Using the index with the depot column first, far fewer records match the range_from value. If you have 100 depots, you have 1% of the records to check compared to not using it. Depots with few records will be faster than those with more records. Your results on table A will vary depending on the value of $other.

In either case, the query could be completed within the index. Without depot in the index, the query optimizer might decide to scan the table rather than the index. Some SQL dialects allow a hint to the optimizer indicating the number of rows the query will match. This can influence the resulting query plan.

While you consider the cardinality low, from an optimization standpoint, it is quite high it reduces the number of index rows to be scanned by a factor of 100.

While the cardinality of the from_range may be high from a values standpoint. When running a range search it is of little value in searching for records. For high values (which may be the most common case), it will match close to 100% of the records. Its high cardinality, is of more use when running a range check of the form:

where from_range between $low and $high

The index may be more useful if the range columns are ordered range_to, range_from.

Best Answer

Related Solutions

Sql-server – SQL Server 2008 R2 : A Possible indexing strategy for the given schema

Mysql – Low Cardinality of Index performing better than High Cardinality

Related Question