Sql-server – Why does a many to one merge join change the sort order of the data set

azure-sql-data-warehouseexecution-planoptimizationperformancequery-performancesql server

I have a query that calculates row_Number(). The table has a clustered index on the same columns (and order) as the partitioning and ordering for the row number.

When using a merge join (many to one), a sort is required, even though the clustered index is in the correct order. Removing the join also removes the sort operation.

The clustered index that should power the row_number calculation:

create clustered index [ClusteredIndex_e060df3fbf464a8eb9b6ea5d46a9a5f5] on [dbo].[log1]
(
    [client] asc,
    [orderId] asc,
    [campaign] asc,
    [id] asc,
    [DateStamp] asc
)

create clustered index [ClusteredIndex_dd0ee53e050d436cba2cab7c678a39e5] on [dbo].[LiveReference]
(
    [client] asc,
    [orderId] asc,
    [campaign] asc
)

The query:

    with cr as 
(
    select distinct client, orderId,campaign
    from LiveReference
)

select e.[DateStamp]
  ,e.[campaign]
  ,e.[client]
  ,e.[orderId]
  ,e.[ad]
  ,e.[id]
  ,e.[source]
,row_number() over (partition by e.[client] ,
                        e.[orderId] ,
                        e.[campaign] ,
                        e.[id] 
                    order by e.[DateStamp]) as num
from [dbo].[log1] e 
inner join cr on 
                        e.client = cr.client
                        and e.campaign = cr.campaign
                        and e.orderId =  cr.orderId

which gives the below plan:
enter image description here

Removing the join also removes the sort:

select e.[DateStamp]
  ,e.[campaign]
  ,e.[client]
  ,e.[orderId]
  ,e.[ad]
  ,e.[id]
  ,e.[source]
,row_number() over (partition by e.[client] ,
                        e.[orderId] ,
                        e.[campaign] ,
                        e.[id] 
                    order by e.[DateStamp]) as num
from [dbo].[log1] e 

(I know this also removes the filtering performed by the join, but this doesn't explain why excluding those rows changes the order)

enter image description here

Why would the results of a sorted join then not be in the correct order?

Best Answer

Generally speaking, merge join (including merge join concatenation) only preserves the sort order of the join keys.

The merge join keys are client, campaign, orderId. The required input sort order for the window function is client, orderId, campaign, id , datestamp.

The merge join cannot therefore provide the sort order required by your window function. You could avoid the sort with a nested loops join (e.g. using a hint).

nested loops plan

I wrote about the details in Avoiding Sorts with Merge Join Concatenation.