Sql-server – Understanding performance of highly unique WHERE clauses

azure-sql-databaseconditionoptimizationperformancesql server

I've been struggling to understand how to deal with a specific type of performance issue that shows up very often in this scenario: when you are wanting to apply multiple filters in a query, but you know that first filter will return a very small number of rows from a very large table.

For example, we have a 3rd-party heterogeneous table with 10M+ rows where the columns mean different things based on the "object type" in TYPEID. Here is an example query:

SELECT ID, NAME, INT109 FROM DATA WHERE TYPEID = 8301514 AND INT109 = 1

In this query, there are no covering indexes for the two filters, but there is an index on the 'TYPEID' column. What is perplexing is that even though there are only about 500 rows out of the 10M in the table with TYPEID = 8301514, this query sometimes takes many seconds to run.

If I simply remove the INT109 = 1 filter at the end, the query runs almost instantly:

SELECT ID, NAME, INT109 FROM DATA WHERE TYPEID = 8301514

It makes no sense to me that having fewer filters would make a query run much faster. Also the behavior seems to be inconsistent – the first query can run really fast too if it has been run multiple times, like something is being cached. It's difficult to do reliable experiments (this is in SQL Azure). Is this normal behavior? Is this something that can be caused by a bad execution plan (even though I'm not using parameters) or statistics that are out of date?

Best Answer

What's probably going on here is that at some point, you ran that first query with a value that had a lot more than 500 rows - so many that it figured it was better to scan 10M rows than do hundreds of thousands of lookups to get the Name and INT109 columns. And then that plan got cached, and reused when you provided a different value which would've benefited from a different plan. SQL made an assumption that it would be better to avoid recompiling than to evaluate each different value you could provide.

When you wrote a different query, it evaluated it fresh, and gave the best plan, although that might not be ideal for a different value.

The best way to fix this is to have an index on (TYPEID, INT109) INCLUDE (ID, Name), so that lookups are not needed, and the plan will seek regardless of the stats around TYPEID. Plus, this index is useful even if you leave out INT109.

Related Solutions

Sql-server – Query Performance Issue

You say

Above plan is for subid = 11 or 7 in @t table variable

I think you may be under a misapprehension here. SQL Server does not look at the contents of the table variable and choose a plan based upon the values it contains.

The statement is compiled before the table variable contains any rows at all and you will get the same plan (that assumes a single row) regardless of whether it eventually contains 2 (and would match 95.5% of the rows) or 1 (and would match only 0.0008%).

The table variable may of course also contain multiple rows but SQL Server will not take account of that except if you use the OPTION (RECOMPILE) hint and even then there are no statistics on table variables so it cannot take any account of actual values.

Some alternate plans are below

plan 1

plan 2

These require finding all matching rows and sorting them.

Because NCx_1 is not declared as a unique index the include(QueueItemID) is ignored (as explained in More About Nonclustered Index Keys) and QueueItemID gets added as an index key column instead. This means that SQL Server can seek on IsProcessed, QCode and the matching rows will be ordered by QueueItemID.

The plan in your question therefore avoids a sort operation but performance is entirely reliant upon how many rows in practice need to be evaluated before the first one matching the SubID IN (SELECT SubID FROM @t) predicate is found and the range seek can stop.

Of course this can vary wildly depending on how common the SubID values contained in @t are and whether there is any skew in the distribution of these values with respect to QueueItemID (You say that both around 350k rows match the seek predicate and that around 350k end up being seeked so for SubID = 7 it sounds like these are all at the end or perhaps no rows match at all - which would be the worse case for this plan).

It would be interesting to know what the estimated number of rows coming out of the seek is. Presumably this is much less than 350,000 and thus SQL Server chooses the plan you see based on this estimated cost.

If the table variable will always just have few rows you might find this rewrite works better for you.

SELECT TOP 1 QueueItemID
FROM   @t
       CROSS APPLY (SELECT TOP 1 t.QueueItemID
                    FROM   QueueTable t
                    WHERE  t.IsProcessed = 0
                           AND t.QCode = 'USA'
                           AND SubID = [@t].SubID
                    ORDER  BY t.QueueItemID) CA 
ORDER BY QueueItemID

For me it gives the plan below where it seeks into the index on subid,isprocessed,qcode,queueitemid as many times as you have rows in the table variable. It is similar to the first plan shown but may be slightly more efficient as each seek stops after the first row is returned.

plan

Sql-server – SQL Server WHERE clause on CLR method (spatial) peformance

The first thing to look at is your indexing strategy. Bad execution plans are often caused by insufficient indexes or stale statistics. Your statistics warning might hint at that.

If that does not resolve your problem there are a few hacks that you can try:

A top operation requires SQL Server to separate query sections in the execution plan.

If you know you are always dealing with less then 2 billion rows you could write your query like this:

SELECT * FROM(
  SELECT TOP(2000000000) *
    FROM <complex join>
    WHERE ro.reportguid = '64c0a4af-ee4d-4e83-a194-2a14e8a6ab0e'
  )X
WHERE l.geom.STGeometryType() = 'Point'

An alternative is to write a scalar valued function that takes in the l.geom column and a few other columns from the other tables and returns the STGeometryType() value while ignoring all the other values. Because SQL Server does not consider the function logic at optimization time, it is forced to execute the function after the join. That does not guarantee that the other filter is executed first but often it works out that way.

The third option is to play around with join hints and join order. They often lead to a change in where filters are applied.

All three options I would consider a last resort because they make the code ugly and you run the risk that someone removes that ugliness later trying to make the code better.

Best Answer

Related Solutions

Sql-server – Query Performance Issue

Sql-server – SQL Server WHERE clause on CLR method (spatial) peformance

Related Question