SQL Server Query Performance – Slow Query with ‘Contains’ and ‘=’ in WHERE Clause

full-text-searchsql serversql-server-2008-r2

The following query takes about 10 seconds to finish on a table with 12k records

select top (5) *
from "Physician"
where "id" = 1 or contains("lastName", '"a*"')

But if I change the where clause to either

where "id" = 1

where contains("lastName", '"a*"')

It'll return instantly.

Both columns are indexed and the lastName column is also full-text indexed.

CREATE TABLE Physician
(
   id         int identity    NOT NULL,
   firstName  nvarchar(100)   NOT NULL,
   lastName   nvarchar(100)   NOT NULL
);

ALTER TABLE Physician
  ADD CONSTRAINT Physician_PK
  PRIMARY KEY CLUSTERED (id);

CREATE NONCLUSTERED INDEX Physician_IX2
   ON Physician (firstName ASC);

CREATE NONCLUSTERED INDEX Physician_IX3
   ON Physician (lastName ASC);

CREATE FULLTEXT INDEX
    ON "Physician" ("firstName" LANGUAGE 0x0, "lastName" LANGUAGE 0x0)
    KEY INDEX "Physician_PK"
    ON "the_catalog"
    WITH stoplist = off;

Here is the Execution Plan

What could be the problem?

Best Answer

Your execution plan

When looking at the query plan, we can see that one index is touched to serve two filter operations.

Very simply put, due to the TOP operator, a row goal was set. Much more information & prerequisites on row goals can be found here

From that same source:

A row goal strategy generally means favouring non-blocking navigational operations (for example, nested loops joins, index seeks, and lookups) over blocking, set-based operations like sorting and hashing. This can be useful whenever the client can benefit from a quick start-up and steady stream of rows (with perhaps a longer overall execution time – see Rob Farley's post above). There are also the more obvious and traditional uses e.g. in presenting results a page at a time.

The entire table gets probed into the filters with the use of a left semi join that has a row goal set, hoping to return the 5 rows as fast and efficient as possible.

This does not happen, resulting in many iterations over the .Fulltextmatch TVF.

Recreating

Based on your plan, I was able to somewhat recreate your problem:

CREATE TABLE dbo.Person(id int not null,lastname varchar(max));

CREATE UNIQUE INDEX ui_id ON  dbo.Person(id)
CREATE FULLTEXT CATALOG ft AS DEFAULT;  
CREATE FULLTEXT INDEX ON dbo.Person(lastname)   
   KEY INDEX ui_id   
   WITH STOPLIST = SYSTEM;  
GO  

INSERT INTO dbo.Person(id,lastname)
SELECT top(12000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)),
REPLICATE(CAST('A' as nvarchar(max)),80000)+ CAST(ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) as varchar(10))
FROM master..spt_values spt1
CROSS APPLY master..spt_values spt2;
CREATE CLUSTERED INDEX cx_Id on dbo.Person(id);

Running the query

SELECT TOP (5) *
FROM dbo.Person
WHERE "id" = 1 OR contains("lastName", '"B*"');

Results into a query plan comparable to yours:

In the above example, B does not exist in the fulltext index. As a result it depends on the parameter & data how efficient the query plan can be.

A better explanation of this can be found in Row Goals, Part 2: Semi Joins by Paul White

...In other words, on each iteration of an apply, we can stop looking at input B as soon as the first match is found, using the pushed-down join predicate. This is exactly the sort of thing a row goal is good for: generating part of a plan optimized to return the first n matching rows quickly (where n = 1 here).

For example, changing the predicate so the results are found way sooner (at the beginning of the scan).

select top (5) *
from dbo.Person
where "id" = 124 
or contains("lastName", '"A*"');

the where "id" = 124 gets eliminated due to the fulltext index predicate already returning 5 rows, satisfying the TOP() predicate.

The results show this as well

id lastname 
1  'AAA...'   
2  'AAA...'
3  'AAA...'
4  'AAA...'
5  'AAA...'

And the TVF executions:

Inserting some new rows

INSERT INTO dbo.Person
SELECT 12001, REPLICATE(CAST('B' as nvarchar(max)),80000);
INSERT INTO dbo.Person
SELECT 12002, REPLICATE(CAST('B' as nvarchar(max)),80000);

Running the query to find these previous inserted rows

SELECT TOP (2) *
from dbo.Person
where "id" = 1
or contains("lastName", '"B*"');

This again results in too many iterations over almost all of the rows to return the last but one value found.

id   lastname
1     'AAA...'
12001 'BBB...'

Resolving

When removing the row goal by using traceflag 4138

SELECT TOP (5) *
FROM dbo.Person
WHERE "id" = 124 
OR contains("lastName", '"B*"')
OPTION(QUERYTRACEON 4138 );

The optimizer uses a join pattern closer to implementing a UNION, in our case this is favourable as it pushes the predicates down to their respective clustered index seeks, and does not use the row goaled left semi join operator.

Another way to write this, without using the above mentioned traceflag:

SELECT top (5) *
FROM
(
SELECT * 
FROM dbo.Person
WHERE "id" = 1 
UNION
SELECT * 
FROM dbo.Person
WHERE contains("lastName", '"B*"')
 ) as A;

With the resulting query plan:

where the fulltext function is applied directly

As a sidenote, for op, the query optimizer hotfix traceflag 4199 resolved his problem. He implemented this by adding OPTION(QUERYTRACEON(4199)) to the query. I was not able to reproduce that behaviour on my end. This hotfix does contain a semi join optimization:

Trace Flag: 4102 Function: SQL 9 - Query performance is slow if the execution plan of the query contains semi join operators Typically, semi join operators are generated when the query contains the IN keyword or the EXISTS keyword. Enable flag 4102 and 4118 to overcome this.

Source

Extra

During cost based optimization, the optimizer could also add an index spool to the execution plan, implemented by LogOp_Spool Index on fly Eager (or the physical counterpart)

It does this with my dataset for TOP(3) but not for TOP(2)

SELECT TOP (3) *
from dbo.Physician
where "id" = 1
or contains("lastName", '"B*"')

On the first execution, an eager spool reads and stores the entire input before returning the subset of rows that is requested by the Predicate Later executions read and return the same or a different subset of rows from the worktable, without ever having to execute the child nodes again.

Source

With the seek predicate applied to this index eager spool:

Related Solutions

Sql-server – Performance of SQL Server indexing

Depending on the amount of RAM your server has, it might just read the whole table into memory and keep it there - so your first query will load it, and any subsequent queries will just operate on those pages loaded in memory.

To get any meaningful data, you need to flush the cache before every run of your query! (run DBCC FREEPROCCACHE and DBCC DROPCLEANBUFFERS before any query execution!)

I tried your scenario with 100'000 dummy file names and almost the same table structure (I changed FileName to be VARCHAR(260)).

Without index:

Table 'TestFiles'. Scan count 1, logical reads 2137, physical reads 5, read-ahead reads 2136, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Query Subtree Cost 1.68884

With non-clustered index on filename:

Table 'TestFiles'. Scan count 1, logical reads 6, physical reads 4, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Query Subtree Cost 0.0065704

This was done with this statement:

DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS 

SET STATISTICS IO ON

SELECT * 
FROM testfiles
WHERE FileName = 'File-D8584B44-518F-428A-86A1-7836E0B60502'

Sql-server – Best way to index a table with two datetime columns that are equally queried

There is no substitute for benchmarking. To answer the question, I would create and populate several possible tables. Then I would expose these tables to your typical workload, and benchmark.

Including additional columns in you NCIs will slow down modifications and speed up selects. Based on frequency of both, we can choose which approach uses less resources. If a row is on average read twice per year, your conclusions might be different as compared to the case when every row is on average read twice per minute.

Besides, not all queries are born equal. If some queries must complete in certain time no matter what, then you must make sure these requirements are met. Obviously such requirement trump the common good approach described above. Only you can know the actual requirements.