SQL Server – Why Execution Plan Chooses Scan Over Index?

execution-planindexsql server

Here is my query (It is a Microsoft Axapta Query):

(@P1 bigint)
SELECT TOP 1 T1.JOURNALNUM,T1.LINENUM,T1.ACCOUNTTYPE,T1.COMPANY,T1.TXT,
T1.AMOUNTCURDEBIT,T1.CURRENCYCODE,T1.EXCHRATE,T1.TAXGROUP,
T1.CASHDISCPERCENT,T1.QTY,T1.BANKNEGINSTRECIPIENTNAME,
-- *Snipped lots of columns in T1* --
T1.MODIFIEDDATETIME,T1.RECVERSION,T1.PARTITION,T1.RECID 
FROM LEDGERJOURNALTRANS T1 
WHERE (((PARTITION=123123123) AND (DATAAREAID=N'test')) AND (REVRECID=@P1))

Current execution plan :

Current plan

Actually, there is a appropriate index on table.

Index columns : (PARTITION,DATAAREAID,REVRECID)

Fragmentation :

I tried index force. This execution plan (index seek+key lookup) is faster than after plan (index scan):

Force Index Plan

And I tried to :

UPDATE STATISTICS
Changed it the column order, for example
(REVRECID,PARTITION,DATAAREAID)

Why does MSSQL choose clustered index?

Best Answer

Estimates, a huge amount of columns selected and predicate pushdown

The query's estimates are not accounting for the residual predicate on the scan being of a higher cost than the seek + key lookup to get all these extra columns from the clustered index. This results in the clustered index scan + residual predicate being chosen instead of the index seek.

My version is Microsoft SQL Server 2016 (RTM-GDR)

These estimates on predicate pushdown where improved in SQL server 2016 SP1

Update to improve diagnostics for query execution plans that involve residual predicate pushdown in SQL Server 2016

To improve diagnostics for the scenario that's described in the "Symptoms" section, SQL Server 2016 Service Pack 1 (SP1) introduces a new showplan XML attribute, Estimated Rows Read. This attribute provides the estimated count of rows will be read by the operator before the residual predicate is applied. This update is a complement to KB 3107397.

This adds the EstimatedRowsRead="" to the query plan XML, in your case this would be close or matching the residual predicate if the scan is chosen.

This should fix your issue

Residual predicate example

Reading 1.2M rows to return 0

Index scan query Estimated total cost

EstimatedTotalSubtreeCost="0.00449281">

Index seek query Estimated total cost

EstimatedTotalSubtreeCost="0.00672858">

Which is higher than the index scan estimations due to not accounting for the residual predicate, and that is why the lesser performing plan was chosen.

The main solution

The main solution would be upgrading to at least SP1 to add the:

Update to improve diagnostics for query execution plans that involve residual predicate pushdown in SQL Server 2016

You should patch sooner and more often, since SP2 CU6 is out as of March 19, 2019, this would be a far better choice.

Another side note, SP1 for SQL Server 2016 adds many additional features such as In memory OLTP, Compression, Columnstore indexes, ....

Other workarounds that may or may not be worth mentioning

Selecting less columns if they are not needed
Adding all these columns to the NC index's included columns
You could try disabling row goals with OPTION(QUERYTRACEON 4138) (maybe)
Using the WITH(INDEX)) Hint

Comparison with SQL Server 2016 SP1

when running a query alike yours, forcing the clustered index to be used on a SQL2016 SP1 version:

The estimated subtreecost is much higher.

EstimatedTotalSubtreeCost="93.6951"

Where your estimated subtreecost for the clustered index scan

<RelOp AvgRowSize="4788" EstimateCPU="1.36996" EstimateIO="185.267" EstimateRebinds="0" EstimateRewinds="0" EstimatedExecutionMode="Row" EstimateRows="1" LogicalOp="Clustered Index Scan" NodeId="1" Parallel="false" PhysicalOp="Clustered Index Scan" EstimatedTotalSubtreeCost="0.00448209" TableCardinality="1245280">

Is low

EstimatedTotalSubtreeCost="0.00448209"

With the main difference being

EstimatedRowsRead="1000000"

shown when executing the query on the SQL 2016 with SP1 applied.

And when testing with the NC index specified

CREATE INDEX IX_PARTITION_DATAAREAID_REVRECID
ON dbo.LEDGERJOURNALTRANS(PARTITION,DATAAREAID,REVRECID);

<RelOp AvgRowSize="980" EstimateCPU="0.0001581" EstimateIO="0.003125" EstimateRebinds="0" EstimateRewinds="0" EstimatedExecutionMode="Row" EstimateRows="1" LogicalOp="Clustered Index Seek" NodeId="4" Parallel="false" PhysicalOp="Clustered Index Seek" EstimatedTotalSubtreeCost="0.0032831" TableCardinality="1000000">

the EstimatedTotalSubtreeCost for the index seek (not total for entire plan) is also low:

 EstimatedTotalSubtreeCost="0.0032831

and the total estimated subtree cost for my test query is very close to yours

EstimatedTotalSubtreeCost="0.00657048">

Notice another contrast

The table count is 4636881
The EXPLAIN plan for the first query traversed 4000010 indexed_column keys. No need to read the last 636871 keys.
The EXPLAIN plan for the second query traversed 4636881 rows ordered by indexed_column. For every row picking up the non-indexed column out of the table, the indexed column (already sorted by the index) is looked up and comes along for the ride. The tmp table gets ordered by the indexed column, and mysqld then dismisses the first 4000000 rows, leaving 10 rows. All that interaction between table and index just for 10 rows is the bottleneck.

COMMON THINGS

In both instances, the query specfies the number of rows to traverse. Since the number of rows in the table is 4636881, we should readily expect a full scan. The contrast becomes apparent when the MySQL Query Optimizer decides where to perform the full scan.

The first query is referencing an indexed column only in the SELECT list and the WHERE clause. The MySQL Query Optimizer chooses to perform a full index scan with no need to contact the table since everything needed is in the index.
The second query is referencing an indexed column in the WHERE clause. However, it must reach out the table to retrieve the corresponding non-indexed column. The MySQL Query Optmizer was tipped off by the query that it must not use the index becasue of the number of rows it was expected to read. As a rule of thumb for any RDBMS, if more than 5% of a table has to be read to fulfill a query, the MySQL Query Optimizer will just throw the index 'under the bus' and do a full table scan.

Doing the math, here is what MySQL Query optimizer computes:

5% of 4636881 is 231844
The second query is commanded to read 4000000 rows, which is way higher than 231844
MySQL Query Optimizer realizes there would be too much interaction between table (because of the non-indexed column) and index (because of the indexed column) to get the needed data. It decides to read the table only (because both the indexed and non-indexed columns are residing together in the table) rather than bounce back and forth between them.

In my honest opinion, with the row count of the table, the present indexes the table has, and the number of rows prescribed by the query, the MySQL Query Optimizer made the correct decision.

RECOMMENDATION

Create this index

ALTER TABLE `table` ADD INDEX mynewndx (indexed_column,some_other_column);

and your second query will never touch the table again going forward. The MySQL Query Optimizer will behave quite differently when it sees this new index.

Sql-server – Proper Index for WHERE X <> 1 AND Y <> 1 ORDER BY Z

In general it is not possible to use an index seek on a condition x <> 1 and y <> 1.

With an index on x,y the best you can do is convert it into two range seeks (x < 1 and x > 1) with a residual predicate on y <> 1 (and this wouldn't be able to use additional index key columns to avoid a sort)

For a bit column as it can only have three values. 0, 1, NULL logically WHERE bit_column <> 1 is equivalent to WHERE bit_column = 0 but seems SQL Server doesn't take advantage of that here and convert the <> to = conditions for you.

Adding a couple of check constraints does the job though even though these are apparently redundant in that they don't actually restrict the allowable values for the datatype in any way (for NULL if a check constraint evaluates to UNKNOWN it counts as passing)

CREATE TABLE MyTable
  (
     Foo     INT,
     IsFlag1 BIT NULL CHECK (IsFlag1 IN (0, 1)),
     IsFlag2 BIT NULL CHECK (IsFlag2 IN (0, 1)),
     SomeId  INT
  );

CREATE NONCLUSTERED INDEX ix
  ON MyTable(IsFlag1, IsFlag2, SomeId)
  INCLUDE (Foo);

The plan now does show a seek on IsFlag1 = 0 AND IsFlag2 = 0

Or alternatively this filtered index also avoids the need for a SORT

CREATE NONCLUSTERED INDEX ix 
           ON MyTable(SomeId) 
           INCLUDE (Foo,IsFlag1, IsFlag2) 
           WHERE IsFlag1 != 1 and IsFlag2 != 1

It does a scan of the filtered index (the qualifying rows ordered by SomeId) with a TOP to stop scanning after the 1,000 rows are retrieved. IsFlag1, IsFlag2 are INCLUDE-d in the index to avoid an unnecessary look up that occurs without this.

Best Answer

Related Solutions

Mysql – Why does MySQL choose this execution plan

Notice another contrast

COMMON THINGS

RECOMMENDATION

Sql-server – Proper Index for WHERE X <> 1 AND Y <> 1 ORDER BY Z

Related Question