Sql-server – Performance of query with complex where clause on 10M rows

performancequery-performancesql serversql-server-2008

We have a query that destroys our production server when run.

It is part of a reporting functionality and the bad part looks as follows:

SELECT DISTINCT 
    mt.ID           AS ID
FROM 
    [dbo].[MyTable]    mt
WITH (NOLOCK)
WHERE           
    (@aVariable           IS NULL 
      OR (CONVERT(VARCHAR(22), mt.Date1, 112) >= CONVERT(VARCHAR(22), @date1, 112))
    AND (@status            IS NULL 
    OR @status <> 2 
    OR  (   @status = 2 
        AND (  SELECT COUNT(*) 
               FROM 
               MyTable mt2
               WITH (NOLOCK)
               WHERE 
                  mt2.CaseID = mt.CaseID 
                  AND mt2.Date1 > mt.Date1
             ) = 0
        )
    )
    AND (@aSecondVariable IS NULL 
           OR (CONVERT(VARCHAR(22), mt.Date1, 112) <= CONVERT(VARCHAR(22), @date1, 112)))
    AND (@aThirdVariable  IS NULL 
           OR (CONVERT(VARCHAR(22), mt.Date2, 112) >= CONVERT(VARCHAR(22), @date2, 112)))
    AND (@aFourtVariable  IS NULL 
          OR (CONVERT(VARCHAR(22), mt.Date2, 112) <= CONVERT(VARCHAR(22), @date2, 112)))

Furthermore, the table and indexes are created as follows:

CREATE INDEX [MyIndex] ON [dbo].[MyTable]
    ([AColumn], [AColumn2], [Date1], [Date2]) 
WITH (FILLFACTOR = 90)

MyTable has about 80 columns and a single column Primary Key: (ID).

The MyTable mt consists of about 10.000.000 rows. There is an index on the columns which contain aVariable, aSecondVariable, aThirdVariable and aFourthVariable. About half of the values of the date columns are null. In the index they are on place 3 and 4.

When we run the query on one server (without users) it performs really well. When we run it on production (with users) it takes too long and times out.

We are wondering how this can be. The execution plans are the same on both servers. We thought that the result might be cached somewhere or that the memory which is free (2GB RAM) is insufficient.

We are no experts on database performance and hope some real DBAs can provide us with their views. Thanks.

Best Answer

A few suggestions:

Remove the DISTINCT as the ID is the primary key. There is no way you'll get duplicate rows in the result.
Do not convert the datetime columns. This makes your conditions non-sargable and the query will always do a table scan. The variables would need no conversion either if they are declared as dates but that is not a problem for sargability.
Use close-open ended ranges for date and datetimes, as @Aaron suggests in the comments. Please read his blog post: Bad habits to kick : mis-handling date / range queries .
Use NOT EXISTS instead of (SELECT COUNT(*) ...) = 0 to check if there is no rows matching some condition.
Drop the WITH (NOLOCK) hints. Another blog post: Bad habits : Putting NOLOCK everywhere and a question in this site with several valuable points: Is NOLOCK always bad?.
Add appropriate indexes. I'd guess separate indexes on (Date1) and (Date2) or an index on (Date1, Date2) would be ok but that needs testing. An index on (CaseId, Date1) will be useful for the subquery.
(Minor note) always use the schema prefixes. Another blog post by @Aaron Bertrand: Bad habits to kick : avoiding the schema prefix.
Add OPTION (RECOMPILE) (as @Mikael Eriksson suggested). This basically tells the optimizer to not rely on cached plans but spend some time for each query run to recompile the query - i.e. produce a new plan according to the new parameter values. With 7 variables that can change the query substantially, this seems a very good option. Read the article from @Paul White for more detailed explanation of the "Parameter Embedding Optimization", parameter sniffing and also other options and advantages: Parameter Sniffing, Embedding, and the RECOMPILE Options.

The query rewritten:

SELECT  
    mt.ID 
FROM 
    [dbo].[MyTable] AS mt
WHERE 
      (@aVariable       IS NULL OR mt.Date1 >= CAST(@date1 AS DATE))
  AND (@aSecondVariable IS NULL OR mt.Date1 < DATEADD(day, 1, CAST(@date1 AS DATE)))
  AND (@aThirdVariable  IS NULL OR mt.Date2 >= CAST(@date2 AS DATE))
  AND (@aFourtVariable  IS NULL OR mt.Date2 < DATEADD(day, 1, CAST(@date2 AS DATE)))
  AND ( @status IS NULL 
     OR @status <> 2 
     OR ( @status = 2 
          AND NOT EXISTS
              ( SELECT * 
                FROM dbo.MyTable AS mt2
                WHERE mt2.CaseID = mt.CaseID 
                  AND mt2.Date1 > mt.Date1
              )
        ) 
      ) 
OPTION (RECOMPILE) ;

The part about @status can be written in a slightly more compact way.

I don't think this will make much difference (if any) in performance and it may look obfuscated but I add for completeness this version as well:

SELECT  
    mt.ID 
FROM 
    [dbo].[MyTable] AS mt
WHERE 
      (@aVariable       IS NULL OR mt.Date1 >= CAST(@date1 AS DATE))
  AND (@aSecondVariable IS NULL OR mt.Date1 < DATEADD(day, 1, CAST(@date1 AS DATE)))
  AND (@aThirdVariable  IS NULL OR mt.Date2 >= CAST(@date2 AS DATE))
  AND (@aFourtVariable  IS NULL OR mt.Date2 < DATEADD(day, 1, CAST(@date2 AS DATE)))
  AND NOT EXISTS
      ( SELECT * 
        FROM dbo.MyTable AS mt2
        WHERE @status = 2 
          AND mt2.CaseID = mt.CaseID 
          AND mt2.Date1 > mt.Date1
      )
OPTION (RECOMPILE) ;

Related Solutions

Sql-server – How to speed up query on table with millions of rows

On reason this can happen is that you're using local variables.

The problem is that this query takes so much time to go, despite all of the indexes i've made on different columns.

Here's an example using a similar setup. In the Stack Overflow schema there's a narrow-ish table called Votes that looks like this.

With no index on CreationDate, our only option would be to scan the Clustered Index. But if we create one only on CreationDate, the optimizer can choose to use that index if it thinks doing a Key Lookup for the rest of the columns is cheaper than scanning the Clustered Index and applying a predicate.

CREATE INDEX ix_yourmom ON dbo.Votes(CreationDate)

For this query:

DECLARE @StartDate DATETIME = '2010-07-01';
DECLARE @EndDate DATETIME = '2010-07-02';

SELECT *
FROM   dbo.Votes AS v
WHERE  v.CreationDate BETWEEN @StartDate AND @EndDate;
GO

The cardinality estimate for unknown variables using between is 16.4317%. That leads to a clustered index scan and a missing index request for an index that covers the entire query.

If you run the query with RECOMPILE, you allow for the parameter embedding optimization.

DECLARE @StartDate DATETIME = '2010-07-01';
DECLARE @EndDate DATETIME = '2010-07-02';

SELECT *
FROM   dbo.Votes AS v
WHERE  v.CreationDate BETWEEN @StartDate AND @EndDate
OPTION ( RECOMPILE );

Which gives us a different query plan, and a more accurate estimate.

Hope this helps!

Mysql – Identical query, tables, but different EXPLAIN and performance

Just make a simple join. Sub-queries does not provide the best result quite often

EXPLAIN SELECT l.id, l.level_name, l.date_published, l.rating
FROM levels AS l
INNER JOIN users_favorites AS uf 
ON uf.level_id = l.id
WHERE l.user_id = 2;

Best Answer

Related Solutions

Sql-server – How to speed up query on table with millions of rows

Mysql – Identical query, tables, but different EXPLAIN and performance

Related Question