Sql-server – Same query way faster with distinct than without distinct

performancequery-performancesql serversql server 2014t-sql

I have a query that runs very fast with distinct than without distinct. when running both queries, the one with distinct takes 13% execution time, the one without distinct takes 87% execution time.

These are TSQL query in SQL Server 2014 on a database with 110 compatibility mode.

Strange thing is that the query with distinct is using seeks while the query without distinct is using 1 scan. Both query are identical and have the same where clause.

Can you help me understand why the query with the distinct is faster and why the query without distinct is not using a seek ?

Query plans : www.brentozar.com/pastetheplan/?id=ryr18jJsb

Queries:

select 
    SMS_R_SYSTEM.ItemKey
   ,SMS_R_SYSTEM.Name0,SMS_R_SYSTEM.SMS_Unique_Identifier0
   ,SMS_R_SYSTEM.Resource_Domain_OR_Workgr0
   ,SMS_R_SYSTEM.Client0
from vSMS_R_System as SMS_R_SYSTEM
   inner join Add_Remove_Programs_DATA 
      on Add_Remove_Programs_DATA.MachineID = SMS_R_System.ItemKey
   inner join Add_Remove_Programs_64_DATA 
      on Add_Remove_Programs_64_DATA.MachineID = SMS_R_System.ItemKey
where Add_Remove_Programs_DATA.DisplayName00 = 'aze'
   or Add_Remove_Programs_64_DATA.DisplayName00 = 'aze'
;

select distinct 
    SMS_R_SYSTEM.ItemKey
   ,SMS_R_SYSTEM.Name0,SMS_R_SYSTEM.SMS_Unique_Identifier0
   ,SMS_R_SYSTEM.Resource_Domain_OR_Workgr0
   ,SMS_R_SYSTEM.Client0
from vSMS_R_System as SMS_R_SYSTEM
   inner join Add_Remove_Programs_DATA 
      on Add_Remove_Programs_DATA.MachineID = SMS_R_System.ItemKey
   inner join Add_Remove_Programs_64_DATA 
      on Add_Remove_Programs_64_DATA.MachineID = SMS_R_System.ItemKey
where Add_Remove_Programs_DATA.DisplayName00 = 'aze'
   or Add_Remove_Programs_64_DATA.DisplayName00 = 'aze' 
;

Best Answer

From the way I read your execution plans, your DISTINCT query is selecting columns only from vSMS_R_System, so the optimizer chooses to read that data first (retrieving the DISTINCT values used in your SELECT) and then uses a NESTED LOOP to join up against the other tables.

The non-DISTINCT query forces the optimizer to assume that many more rows might need to be read from vSMS_R_System, so it chooses a MERGE JOIN to scan the other tables and then a NESTED LOOP to match up against vSMS_R_System.

Related Solutions

SQL Server – Different Execution Plans Based on Columns Selected from CTE

The plan without row number is below.

This is assigned a cost of 44.866.

You have a TOP without ORDER BY so SQL Server just needs to scan the clustered index and as soon as it finds the first 30,000 rows matching the predicate it can stop.

The table has 13,283,300 rows. A full clustered index scan is costed at 730.467 + 14.6118 = 745.0788 but this gets scaled down to 43.9392 because of the TOP.

Applying the same scaling of 5.9% to the number of rows in the table this would imply that SQL Server estimates that it will only have to scan 783,350 rows before it finds 30,000 matching the WHERE and can stop scanning.

NB: You say that only 474,296 rows match this predicate in the whole table but 508,747 are estimated to. That means that on average one in every 26.1 (13283300/508747) rows is assumed to match the filter. So it is estimated that 30,000 * 26.1 rows ( = 783K) will be read.

When you select * that means that the rownum column must be calculated. the plan for this is below. It is costed at 69.1185

You have an index on COLUMNE that can be seeked into. This satisfies the range predicate on COLUMNE >= 1472738400000 AND COLUMNE <= 1475244000000 and also supplies the required ordering for your row numbering.

However it does not cover the query and lookups are needed to return the missing columns. The plan estimates that there will be 30,000 such lookups. There may in fact be more as the predicate on COLUMNF = 1 may mean some rows are discarded after being looked up (though not in this case as you say COLUMNF always has a value of 1).

If the row numbering plan was to use a clustered index scan it would need to be a full scan followed by a sort of all rows matching the predicate. 69.1185 is considerably cheaper than the 745.0788 + sort cost so the plan with lookups is chosen.

You say that the plan with lookups is in fact 5 times faster than the clustered index scan. Likely a much greater proportion of the clustered index needed to be read to find 30,000 matching rows than was assumed in the costings. You are on SQL Server 2014 SP1 CU5. On SQL Server 2014 SP2 the actual execution plan now has a new attribute Actual Rows Read which would tell you how many rows it did actually read. On previous versions you can use OPTION (QUERYTRACEON 9130) to see the same information.

Mysql – Why are these two queries having such different executions

LEFT is killing performance; remove it unless you have a good reason for keeping it. To elaborate...

((but first... The purpose for the LEFT was missing from the original query; this answer assumes the LEFT was not necessary.))

LEFT JOIN says that you want data from the right table, whether or not there was a match with the left table.

JOIN says to display only the rows that match (via ON) both tables.

(OUTER is optional and adds no semantics.)

If you are running a version before 5.6, the derived table (subquery in LEFT JOIN) will have no index, so it must be scanned repeatedly. This is a big reason to get rid of LEFT.

Without the LEFT, the Optimizer is likely to evaluate the subquery once, then efficiently JOIN to reporter to finish the query.

For JOIN, the inner query needs this composite (and covering) index INDEX(messagetypeenum, createdOn, reporterid).

There is another technique (I think)... Get rid of the inner SELECT, simply JOIN (or LEFT JOIN) to the table:

select  r.*
    from  reporter r
    left join  nubamessage m  ON m.reporterid = r.id
      AND  m.messagetypeenum = 7
      and  m.createdOn >= '2016-06-18 00:00:00'
      and  m.createdOn <= '2016-06-18 09:30:00'
    WHERE  r.enabled = 1
      AND  m.reporterid is null;

In this case, it may needs INDEX(reporterid, messagetypeenum, createdOn).

Yet another variant would use EXISTS and, I think, provides the equivalent of the LEFT

select  r.*
    from  reporter r
    WHERE  r.enabled=1
      AND NOT EXISTS 
    (
        SELECT  *
            from  nubamessage m
            where  m.messagetypeenum =7
              and  m.createdOn>='2016-06-18 00:00:00'
              and  m.createdOn<='2016-06-18 09:30:00'
              AND  m.reporterid=r.id 
    )

I can't predict which variant will be fastest. It partially depends on how "many" in the many:1 mapping of m.reporterid : r.id.

Best Answer

Related Solutions

SQL Server – Different Execution Plans Based on Columns Selected from CTE

Mysql – Why are these two queries having such different executions

Related Question