Sql-server – Alternative for WHERE clause in SQL query

cardinality-estimatessql serversql-server-2012

We have SQL Server 2012 Enterprise Edition running following query via a stored procedure:

declare @TopX int = 1000
declare @stores Table (Store varchar(5), LastDate datetime, LastId int, RangeEnd datetime)
insert into @stores 
  select * 
  from (select SourceStore, '2014-01-01' as i, null as ii, '2014-01-08' as iii 
        from StoreConfig.dbo.Version 
        group by SourceStore
       ) t 
  where (ABS(CAST((BINARY_CHECKSUM(*) * RAND()) as int)) % 100) < 50

IF OBJECT_ID('tempdb..#agreements') IS NOT NULL
DROP TABLE #agreements
IF OBJECT_ID('tempdb..#stores') IS NOT NULL
DROP TABLE #stores

select Store, 
       isnull(LastDate, '1899-01-01') StartDate, 
       isnull(LastId, -1) LastId, 
       isnull(RangeEnd, getdate()) RangeEnd
into #stores
from @stores
update #stores set StartDate = '2015-07-01', RangeEnd='2015-07-08'


-- THIS IS NOT FAST.. :(

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
--grab the @TopX agreements from primary customers
--note: only grabbing the columns in our index to prevent RID lookups 
  --on every agreement before we sort and take a relatively tiny subset
select top 1000
       a.SourceStore
       ,a.AgreementId
       ,isnull(a.ModifiedDate, a.CreatedDate) as ModifiedDate
       ,ca.CustomerId
from #stores s
inner join StoreOps.POSREPL3Agreement.Agreement a on a.SourceStore = s.Store
inner join StoreOps.Customer.CustomerAgreement ca on ca.SourceStore = a.SourceStore 
                                                 and ca.AgreementId = a.AgreementId 
                                                 and ca.IsPrimary = 1
where ( (a.ModifiedDate between s.StartDate and s.RangeEnd) 
       or ( a.ModifiedDate is null 
        and a.CreatedDate between s.StartDate and s.RangeEnd)
      ) 
  and ( isnull(a.ModifiedDate, a.CreatedDate) > s.StartDate 
     or a.AgreementId > s.LastId
      )
order by isnull(a.ModifiedDate, a.CreatedDate), a.AgreementId

We have following indexes for [Customer].[CustomerAgreement] table:

CREATE NONCLUSTERED INDEX [IX_CustomerAgreement_SourceStore_AgreementId] 
ON [Customer].[CustomerAgreement]
([SourceStore] ASC, [AgreementId] ASC, [IsPrimary] ASC)
INCLUDE ([CustomerId]) 
ON [PRIMARY]
GO

And this is index for [POSREPL3Agreement].[Agreement] table:

CREATE NONCLUSTERED INDEX [IX_Agreement_SourceStore_ModifiedDate]  
ON [POSREPL3Agreement].[Agreement]
([SourceStore] ASC, [ModifiedDate] ASC, [CreatedDate] ASC)
INCLUDE ([AgreementId]) 
ON [PRIMARY]
GO

If we remove the WHERE clause index works as expected and we see 1000 records from both the tables, but when we add the listed WHERE clause [Customer].[CustomerAgreement] estimates all the records instead of 1000.

How can we improve the WHERE clause or index to get [Agreement] table line up with [CustomerAgreement] table so that Estimated Row under [CustomerAgreement] is not ALL the records?

Table definitions

CREATE TABLE [Customer].[CustomerAgreement](
    [CustomerAgreementId] [int] NOT NULL,
    [CustomerId] [int] NOT NULL,
    [AgreementId] [int] NOT NULL,
    [IsPrimary] [bit] NOT NULL,
    [Store] [varchar](5) NOT NULL,
    [SourceStore] [varchar](5) NOT NULL,
    [RowGUID] [uniqueidentifier] ROWGUIDCOL  NOT NULL,
    [Repl_ID] [tinyint] NOT NULL,
 CONSTRAINT [PK_Customer_CustomerAgreementID_SourceStore] PRIMARY KEY NONCLUSTERED 
(
    [CustomerAgreementId] ASC,
    [SourceStore] ASC,
    [Repl_ID] ASC,
    [RowGUID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

CREATE TABLE [POSREPL3Agreement].[Agreement](
    [AgreementId] [int] NOT NULL,
    [QuoteId] [int] NULL,
    [AgreementNumber] [varchar](11) NOT NULL,
    [AgreementStatusId] [tinyint] NOT NULL,
    [AgreementPrinted] [bit] NOT NULL,
    [IsNewPOSCreated] [bit] NOT NULL,
    [LeaseFrequencyId] [tinyint] NOT NULL,
    [DeferredLateFeeAmount] [decimal](8, 2) NOT NULL,
    [InHomeVisitFeeAmount] [decimal](8, 2) NOT NULL,
    [IsASPTaxable] [bit] NOT NULL,
    [ServicePlusRate] [decimal](7, 5) NOT NULL,
    [ServicePlusFloor] [decimal](5, 2) NOT NULL,
    [TaxRatePercentage] [decimal](7, 5) NOT NULL,
    [IgnoreTaxRateChange] [bit] NOT NULL,
    [Balance] [decimal](8, 2) NOT NULL,
    [AmountPaidToDate] [decimal](10, 2) NOT NULL,
    [Deposit] [decimal](8, 2) NOT NULL,
    [DeliveryFee] [decimal](8, 2) NOT NULL,
    [StartDate] [datetime] NOT NULL,
    [DueDay] [int] NOT NULL,
    [DueDayTypeId] [tinyint] NULL,
    [PaidThroughDate] [datetime] NOT NULL,
    [PayOutDate] [datetime] NOT NULL,
    [FinalDate] [datetime] NULL,
    [IsNSFOutstanding] [bit] NOT NULL,
    [LeadSourceId] [tinyint] NOT NULL,
    [AgreementTypeId] [tinyint] NOT NULL,
    [AcquisitionAgreementTypeId] [tinyint] NULL,
    [SameAsCashDuration] [int] NOT NULL,
    [SameAsCashDurationType] [int] NOT NULL,
    [MinimumPercentageOfCashPriceForFinalPayment] [decimal](3, 2) NOT NULL,
    [EarlyPayoutLeaseAmountRate] [decimal](3, 2) NOT NULL,
    [EarlyPayout] [decimal](10, 2) NULL,
    [FinalPaymentAdditionalFee] [decimal](8, 2) NOT NULL,
    [FinalPaymentProrateAmount] [decimal](10, 2) NOT NULL,
    [CreditedAssociateId] [int] NULL,
    [NonRenewalGracePeriod] [int] NOT NULL,
    [LastAgreementTransactionId] [int] NULL,
    [CanPayout] [bit] NOT NULL,
    [IsServicePlusIncludedInPayout] [bit] NOT NULL,
    [IsProrateEnabled] [bit] NOT NULL,
    [AgreementDocument] [varbinary](max) NULL,
    [CreatedDate] [datetime] NOT NULL,
    [CreatedBy] [int] NULL,
    [ModifiedDate] [datetime] NULL,
    [ModifiedBy] [int] NULL,
    [Store] [varchar](5) NOT NULL,
    [SourceStore] [varchar](5) NOT NULL,
    [RowGUID] [uniqueidentifier] ROWGUIDCOL  NOT NULL,
    [Repl_ID] [tinyint] NOT NULL,
    [ECommerceDeliveredDate] [datetime] NULL,
    [DDEStatusId] [tinyint] NULL,
    [DDEAmount] [decimal](8, 2) NULL,
    [OrderMethodTypeId] [int] NULL,
    [SemiMonthlyUpcharge] [decimal](8, 2) NULL,
    [DefaultNonRenewalFee] [decimal](8, 2) NULL,
    [DefaultInHomeVisitFee] [decimal](8, 2) NULL,
    [NonRenewalSemiMonthlyFee] [decimal](8, 2) NULL,
    [NonRenewalSemiMonthlyFeeGracePeriod] [int] NULL,
    [NonRenewalFeeTypeId] [int] NULL,
    [NonRenewalSemiMonthlyRate] [decimal](8, 4) NULL,
    [NonRenewalMonthlyRate] [decimal](8, 4) NULL,
    [NonRenewalWeeklyFeeGracePeriod] [int] NULL,
    [NonRenewalWeeklyRate] [decimal](8, 4) NULL,
    [NonRenewalWeeklyFee] [decimal](8, 2) NULL,
    [DefaultNSFFee] [decimal](8, 2) NULL,
    [WeeklyUpcharge] [decimal](8, 2) NULL,
    [IsInHomeFeeEnabled] [bit] NULL,
    [CanChargeInHomeFeeAndNonRenewalFeeInSamePeriod] [bit] NULL,
    [ExtensionBalance] [decimal](8, 2) NOT NULL,
 CONSTRAINT [PK_Agreement_AgreementID_SourceStore] PRIMARY KEY NONCLUSTERED 
(
    [AgreementId] ASC,
    [SourceStore] ASC,
    [Repl_ID] ASC,
    [RowGUID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

Execution plan

Showplan XML download here (pastebin)

Best Answer

If we remove the WHERE clause index works as expected and we see 1000 records from both the tables, but when we add the listed WHERE clause [Customer].[CustomerAgreement] estimates all the records instead of 1000.

The short answer is that without any filtering of rows in the WHERE clause, the query optimizer estimates that it only needs to read 1000 rows from each table to produce the desired result of 1000 rows.

You did not provide an execution plan for this query, so I can't say much more than that. It is likely your query also omitted the ORDER BY clause in this case, otherwise a Sort would likely have been necessary, which would typically entail reading all rows from its subtree.

With the WHERE clause, the optimizer expects many more rows to be read (due to the expecting filtering effect) before the 1000th row is available to return to the client. The difference between the estimated and actual number of rows read is due to the problem of estimating the selectivity of the complex predicates from the available statistics. This is true even assuming the statistics are representative of the data. Fundamentally: it's too hard, and the optimizer estimates cardinality incorrectly.

1. Indexed computed column solution

You may find the following changes worthwhile:

  1. Index the #stores temporary table:

    CREATE UNIQUE CLUSTERED INDEX index_name 
    ON #stores 
    (
        Store,
        StartDate
    );
    
  2. Add a computed column to the Agreements table. This uses no storage, and is a very fast metadata-only operation:

    ALTER TABLE POSREPL3Agreement.Agreement
    ADD ComputedModifiedDate
    AS ISNULL(ModifiedDate, CreatedDate);
    
  3. Create (or modify the existing) index to use the computed column. This will satisfy the ORDER BY clause.

    CREATE INDEX index_name 
    ON POSREPL3Agreement.Agreement
    (
        ComputedModifiedDate,
        AgreementId
    )
    INCLUDE (SourceStore);
    
  4. Simplify the query to reference the computed column directly:

    SELECT TOP (1000)
        A.SourceStore,
        A.AgreementId,
        ModifiedDate = A.ComputedModifiedDate
    FROM #stores AS S
    JOIN POSREPL3Agreement.Agreement AS A
        ON A.SourceStore = S.Store
    JOIN Customer.CustomerAgreement AS CA
        ON CA.SourceStore = A.SourceStore
        AND CA.AgreementId = A.AgreementId
    WHERE
        CA.IsPrimary = 1
        AND A.ComputedModifiedDate BETWEEN S.StartDate AND S.RangeEnd
        AND 
        ( 
            A.ComputedModifiedDate > S.StartDate
            OR A.AgreementId > S.LastId
        )
    ORDER BY
        A.ComputedModifiedDate, 
        A.AgreementId;
    

If you are unable to rewrite the query as shown, the computed column index will need to include two extra column for technical reasons:

CREATE INDEX index_name 
ON POSREPL3Agreement.Agreement
(
    ComputedModifiedDate,
    AgreementId
)
INCLUDE 
(
    SourceStore,
    CreatedDate,
    ModifiedDate
)
WITH DROP_EXISTING;

The expected execution plan will still show inaccurate estimates (because the optimizer is over-optimistic about how quickly it can stop scanning the ordered computed column index) but the (Top N and Distinct) Sorts are eliminated, and ought to still perform better:

Expected execution plan

2. Comprehensive indexed view solution

If you unable to add the computed column and index, you could investigate using an indexed view instead:

CREATE VIEW dbo.ViewName
WITH SCHEMABINDING
AS
SELECT
    A.SourceStore,
    A.AgreementId,
    ComputedModifiedDate = ISNULL(A.ModifiedDate, A.CreatedDate)
FROM POSREPL3Agreement.Agreement AS A
JOIN Customer.CustomerAgreement AS CA
    ON CA.SourceStore = A.SourceStore
    AND CA.AgreementId = A.AgreementId
WHERE
    CA.IsPrimary = 1;
GO
CREATE UNIQUE CLUSTERED INDEX index_name
ON dbo.ViewName
(
    ComputedModifiedDate,
    AgreementId,
    SourceStore
);

The query then becomes:

SELECT TOP (1000)
    VN.SourceStore,
    VN.AgreementId,
    ModifiedDate = VN.ComputedModifiedDate 
FROM #stores AS S
JOIN dbo.ViewName AS VN
    WITH (NOEXPAND)
    ON VN.SourceStore = S.Store
WHERE
    VN.ComputedModifiedDate BETWEEN S.StartDate AND S.RangeEnd
    AND 
    ( 
        VN.ComputedModifiedDate > S.StartDate
        OR VN.AgreementId > S.LastId
    )
ORDER BY
    VN.ComputedModifiedDate, 
    VN.AgreementId;

3. Simpler indexed view solution

It is also possible to more directly reflect the computed column solution in an indexed view, though this idea does not eliminate the join:

CREATE VIEW dbo.ViewName
WITH SCHEMABINDING
AS
SELECT
    A.SourceStore,
    A.AgreementId,
    ComputedModifiedDate = ISNULL(A.ModifiedDate, A.CreatedDate)
FROM POSREPL3Agreement.Agreement AS A;
GO
CREATE UNIQUE CLUSTERED INDEX index_name
ON dbo.ViewName
(
    ComputedModifiedDate,
    AgreementId,
    SourceStore
);

This time the query becomes:

SELECT TOP (1000)
    VN.SourceStore,
    VN.AgreementId,
    ModifiedDate = VN.ComputedModifiedDate
FROM #stores AS S
JOIN dbo.ViewName AS VN
    WITH (NOEXPAND)
    ON VN.SourceStore = S.Store
JOIN Customer.CustomerAgreement AS CA
    ON CA.SourceStore = VN.SourceStore
    AND CA.AgreementId = VN.AgreementId
WHERE
    CA.IsPrimary = 1
    AND VN.ComputedModifiedDate BETWEEN S.StartDate AND S.RangeEnd
    AND 
    ( 
        VN.ComputedModifiedDate > S.StartDate
        OR VN.AgreementId > S.LastId
    )
ORDER BY
    VN.ComputedModifiedDate, 
    VN.AgreementId;

I had to guess at which columns would be unique in the view. If you implement either solution, take your wider workload into account, and add unique columns from the base tables if necessary to make the view clustered index unique. The [RowGUID] columns look favourite for this.


You should test carefully to assess the impact of the indexed view on data changes to the base tables, and the amount of storage required.

With any of these solutions, performance may not be as good as expected if the temporary table contains very early date ranges, or if there is a store that doesn't have a recent created or modified date. You should test it and see how it goes with your real data and requirements.

You should remove the statement that sets the isolation level to READ UNCOMMITTED if that is only there in a desperate attempt to increase performance. Please see my article for details.

Also, I notice that both tables are currently heaps. You should be aware that most tables benefit from having a clustered index, for space management reasons if nothing else. If the heaps experience deletions, you may need to rebuild them from time to time to reclaim empty pages that have done been automatically deallocated.

By the way, the RAND in your @stores query doesn't have the effect you probably think. It generates the same value for every row (a runtime constant).