Is a Clustered Index Scan bad for performance

azure-sql-databaseperformancequery-performance

I have a complex query (see below) which performs badly on a large database.
I'm now analyzing the query by removing the joins, look at the actual plan and step by step add the joins back.
I see clustered index scan with high I/O costs and a 100% CPU cost for a sort.
Can this query be more optimized?

WITH currentPeriod AS
    (SELECT ROW_NUMBER() OVER (PARTITION BY tp.[StartDate]
      ORDER BY m.[Created] DESC) AS rowIndex
            , m.[Created], m.[MessageId]
            , tp.[StartDate], tp.[EndDate], tp.[IsCorrection]
       FROM [TimePeriods] tp
       JOIN [Messages] m on m.Id = tp.[MessageId]
     )
    SELECT [Created], [StartDate], [EndDate], [IsCorrection]
      FROM CurrentPeriod cur
     WHERE rowIndex = 1
     OPTION (RECOMPILE)

The plan (using ApexSQL):

I already changed the PK index on the Message table from Clustered to Non-clustered and added a Clustered index on Created.

This is the 'original' plan without the Clustered index on Created:

Edit
Here's a link to the plan of the above query: https://www.brentozar.com/pastetheplan/?id=Hkw-4TxPB
Here's the plan of the full query: https://www.brentozar.com/pastetheplan/?id=SyYrDplPH

Background (can be skipped)
I'm not a DBA, I'm just the guy that knows most about databases not meaning I know a lot about databases.
The tables are filled using a custom upload application. The files uploaded are XML-files. Each file is a Message and a Message can have multiple TimePeriods. The Message can also have corrected TimePeriods of earlier Messages. I'm only interested in the last updated period. That's why I use ROW_NUMBER OVER and PARTITION BY. The resulting data of the query is correct. That has been verified.

The full query:

WITH currentPeriod AS
    (SELECT ROW_NUMBER() OVER (PARTITION BY ce.[ReferenceCode], tp.[StartDate], e.[Id], ep.[ContractNumber]
        ORDER BY m.[Created] DESC, ep.[ContractNumber] DESC, IIF(fp.[LbTab] = N'010', 1, 0), fp.[DatAanv] DESC) AS rowIndex
        , ce.[ReferenceCode], ce.[Name] as [EntityName]
        , e.[SocialSecurityNumber], e.[EmployeeNumber], e.[Initials], e.[Firstname], e.[Prefix], e.[Surname], e.[BirthDate]
        , CAST((DATEDIFF(DAY, e.[BirthDate], ep.[HireDate]) / 365.24) as FLOAT) [AgeHired], e.[Gender], e.[PhoneNumber], e.[PhoneNumber2], e.[Email]
        , TRIM(CONCAT_WS(' ', a.[Street], COALESCE(a.[Number], COALESCE(a.[NumberString], '')), COALESCE(a.[NumberExtension], ''))) [StreetLine]
        , a.[ZipCode], a.[City], a.[CountryCode]
        , ep.[HireDate], ep.[DepartureDate]
        , f.[AantVerlU], f.[LnSv], f.[AantSV]
        , fp.[IndAvrLkvOudrWn], fp.[IndAvrLkvAgWn], fp.[IndAvrLkvDgBafSb], fp.[IndAvrLkvHpAgWn], fp.[DatAanv]
        , fp.[LbTab], fp.[IndWAO], fp.[IndWW], fp.[IndZW]
        , tp.[StartDate], tp.[EndDate], tp.[IsCorrection]
        FROM [TimePeriods] tp
        JOIN [Messages] m on m.Id = tp.[MessageId]
        JOIN [CorporateEntities] ce on m.[CorporateEntityId] = ce.Id
        JOIN [Financials] f on f.[TimePeriodId] = tp.Id
        JOIN [FinancialPeriods] fp on fp.[FinancialId] = f.Id
        JOIN [EmployementPeriods] ep on ep.Id = f.[EmployementPeriodId]
        JOIN [Employees] e on e.Id = ep.[EmployeeId]  
        JOIN [Addresses] a on a.Id = e.Id
        WHERE NOT EXISTS (SELECT 1 
                            FROM [dbo].[Withdrawals] w 
                            JOIN [dbo].[TimePeriods] tp2 ON tp2.Id = w.[TimePeriodId] 
                            JOIN [dbo].[Messages] m2 ON m2.id = tp2.MessageId AND m2.[CorporateEntityId] = m.[CorporateEntityId]
                            WHERE w.SofiNr is not null AND w.SofiNr = e.[SocialSecurityNumber] AND tp2.StartDate = tp.[StartDate] 
                            AND w.[NumIv] = ep.[ContractNumber] AND m2.[Created] > m.[Created] AND CONVERT(date, m.[Created]) <= CONVERT(date,@messageDate)) 
        AND CONVERT(date, tp.StartDate) >= CONVERT(date, @date26Param) 
        AND CONVERT(date, tp.EndDate) <= CONVERT(date, @endDate) AND CONVERT(date, m.[Created]) <= CONVERT(date, @messageDate)) 

        SELECT *
        FROM CurrentPeriod cur
       WHERE rowIndex = 1 AND [AantVerlU] > 0 AND [LbTab] != N'010'
         AND (SELECT count(1) 
                FROM CurrentPeriod cur2 
                WHERE cur2.[StartDate] > DATEADD(week, -26, cur.[StartDate]) AND cur2.[StartDate] < cur.[StartDate]
                    AND COALESCE(cur2.[SocialSecurityNumber],  LEFT(cur2.[ReferenceCode], 9) + '-' + cur2.[EmployeeNumber]) = COALESCE(cur.[SocialSecurityNumber],  LEFT(cur.[ReferenceCode], 9) + '-' + cur.Employeenumber)
                    AND cur2.[AantVerlU] > 0 AND cur2.[LbTab] != N'010') = 0
         AND StartDate >= @startDate 
    ORDER BY [SocialSecurityNumber], [StartDate];

Best Answer

The optimiser might be having a hard time to optimise; spool the CTE part into a #Table and query that instead.

Question: Clustered Index Scan bad for performance? Well, not necessarily. Means you don't have an index the query can use that is better (more selective) though. Also performs worse for larger tables... but it could be worse - a heap (no order to the data at all).

AND (SELECT 

    count(1) 
                FROM CurrentPeriod cur2 
                WHERE cur2.[StartDate] > DATEADD(week, -26, cur.[StartDate]) AND cur2.[StartDate] < cur.[StartDate]
                    AND COALESCE(cur2.[SocialSecurityNumber],  LEFT(cur2.[ReferenceCode], 9) + '-' + cur2.[EmployeeNumber]) = COALESCE(cur.[SocialSecurityNumber],  LEFT(cur.[ReferenceCode], 9) + '-' + cur.Employeenumber)
                    AND cur2.[AantVerlU] > 0 AND cur2.[LbTab] != N'010') = 0

Your logic here dictates for each row it's got to compute the validity of whether the dates are applicable... You might want to store the result of the join criteria in the #table also

            AND COALESCE(cur2.[SocialSecurityNumber],  LEFT(cur2.[ReferenceCode], 9) + '-' + cur2.[EmployeeNumber]) = COALESCE(cur.[SocialSecurityNumber],  LEFT(cur.[ReferenceCode], 9) + '-' + cur.Employeenumber)

SQL is set orientated and super efficient within big bunches of information; but details on row level basis in massive tables... spells for RBAR hell - row by agonising row!

Related Solutions

Sql-server – SQL datatype choice for flagging a read/ unread message

Performance of a particular design depends on the distribution of data to a large extent, and the access path. Are you expecting most messages to have the flag, or most messages not to? From the docs:

My tests show that a table scan often starts to perform better than a nonclustered index access when at least 10 percent of the rows are selected. I also found that the optimizer switches from nonclustered index access to table scan prematurely (i.e., when nonclustered index access still shows better response time than the corresponding table scan). In many cases, the optimizer forces a table scan for queries with result sets of approximately 5 percent, although the table scan becomes more efficient than index access at selectivities of 8 to 10 percent.

And of course, if there are any other predicates on the query, and the clustered index if any. For example, are you likely to want to access all unread messages within a certain timeframe? Or sent to/from a particular user? How big are the message bodies and are they stored inline? These are just rhetorical questions, mind.

So which is better, a BIT and a DATETIME or just a DATETIME for performance? You will have to benchmark with some representative data and access patterns to find out. I'm afraid this one can't be answered with just theory. But you can of course create the table with the boolean in and just not use it if you find the latter is better - it only adds an overhead of 1 byte per row.

If you were asking the opposite question, how to efficiently query the not-NULL rows, I would have said use a filtered index.

Sql-server – How to speed up query on table with millions of rows

On reason this can happen is that you're using local variables.

The problem is that this query takes so much time to go, despite all of the indexes i've made on different columns.

Here's an example using a similar setup. In the Stack Overflow schema there's a narrow-ish table called Votes that looks like this.

With no index on CreationDate, our only option would be to scan the Clustered Index. But if we create one only on CreationDate, the optimizer can choose to use that index if it thinks doing a Key Lookup for the rest of the columns is cheaper than scanning the Clustered Index and applying a predicate.

CREATE INDEX ix_yourmom ON dbo.Votes(CreationDate)

For this query:

DECLARE @StartDate DATETIME = '2010-07-01';
DECLARE @EndDate DATETIME = '2010-07-02';

SELECT *
FROM   dbo.Votes AS v
WHERE  v.CreationDate BETWEEN @StartDate AND @EndDate;
GO

The cardinality estimate for unknown variables using between is 16.4317%. That leads to a clustered index scan and a missing index request for an index that covers the entire query.

If you run the query with RECOMPILE, you allow for the parameter embedding optimization.

DECLARE @StartDate DATETIME = '2010-07-01';
DECLARE @EndDate DATETIME = '2010-07-02';

SELECT *
FROM   dbo.Votes AS v
WHERE  v.CreationDate BETWEEN @StartDate AND @EndDate
OPTION ( RECOMPILE );

Which gives us a different query plan, and a more accurate estimate.

Hope this helps!

Best Answer

Related Solutions

Sql-server – SQL datatype choice for flagging a read/ unread message

Sql-server – How to speed up query on table with millions of rows

Related Question