Sql-server – Join on primary keys to primary key vs primary key in range

sql server

Assume we have an #RelevantOrders table and an OrderRow table.

create table #RelevantOrders(Order_Id int primary key)

create talbe OrderRow (Order_Id int, OrderRow_Id, [lots of data columns], primary key (Order_Id, OrderRow_Id)

Now what I by instinct did initialy do was a simple join

select OrderRow.* from #RelevantOrders as ro join OrderRow on OrderRow.Order_Id = ro.Order_Id

Now, due to the data, the Order_Id in #RelevantOrders are several ranges of Ids, ie. 1,2,3,4,5,100,101…120, 240..250 … hope i am being clear that.

So what happens, if there are 1000 Relevant orders, execution plan does 1000 index seeks on OrderRow table, 1 per each Order_Id. This leads quite 'big' scan count and logical reads (~5000).

SO it got me thinking, i would expect that instead of doing 1000 index seeks, it would do ~80 (number of continuous ranges) index seeks.

So i tried an alternative:

create table #RelevantOrderRanges(f int, t int)

select OrderRow.* from #RelevantOrderRanges
join OrderRow on OrderRow.Order_Id between f and t

The good news is ammount of index seek and logical reads drops to 80 index seeks (equal to amount of ranges) and 609 logical reads. Much better. The only issue i have here and thats due to my experience with it causing to gobble up memory is the estimated number of rows and estimated data size, which goes up to about 3.5GB in this case instad of 49KBs.

Now i know that in case number of ranges approaches number of orders any gain is lost.

So this got me thinking and i have two questions:
1) am i overthinking the estimates too much? I know more testing is required but i am a bit afraid the estimates would bite me later due to memory hog

2) is there any way to 'better explain' SQL what to expect? Indices, hints, statistics…. ?

Thanks for discussion

PERFORMANCE UPDATE:
TC 1:
Order_Id count: 9045
Range count: 5499
Total rows produced:

Range based approach:
~800ms CPU, ~800ms ellapsed time
Scan count: 5,500   
Logical reads: 37,896

ID join approach:
~900ms CPU, ~900ms ellapsed time
Scan count: 9045
Logical reads: 56774

TC 2:
Order_Id count: 19684
Range count: 12849
Total rows produced: 124561

Range based approach:
~718ms CPU, ~770ms ellapsed time
Scan count: 12849
Logical reads: 74127

ID join approach:
~950ms CPU, ~950ms ellapsed time
Scan count: 19684
Logical reads: 110352

Best Answer

Edit. I misunderstood the question. I thought this was about app design rather than query performance.

For an App: Personally i would avoid using an intermediate table. Its just extra steps and complication.

The #table makes sense for queries.

Provided you have useful indexes and your stats are up to date, you are typically better off leaving SQL Server to decide the best execution plan.

Depending on the growth and distribution of data the best execution plan may change over time. So i would be reluctant to invest a lot of time and effort coding to squeeze the last 1% of extra performance.

Related Solutions

SQL Server Query Performance – Optimizing Group By with Many Columns

The non-clustered index you have tested is not the best for this query. It can be used for the WHERE clause and for doing an index scan instead of a full table scan but it cannot be used for the GROUP BY.

The best possible index would have to be a partial index (to filter the unwanted rows from the WHERE clause), then have all the columns used in the GROUP BY and then INCLUDE all the other columns used in the SELECT:

CREATE INDEX special_ix 
  ON dbo.Commissions_Output
    ( company, location, account, 
      salesroute, employee, producttype, 
      item, loadjdate, commissionrate ) 
INCLUDE 
  ( [Extended Sales Price], [Delivered Qty] ) 
WHERE 
  ( [Extended Sales Price] <> 0 ) ;

Sql-server – SQL Server – Logical Reads lowered, Execution time remained the same

some 7 seconds for 500k to return, and also a lot of time to render the grid you're likely displaying the results in.

You are waiting 7 seconds because that's how much it takes for SQL to push 500k rows to your client. Look at client statistics in your SSMS, see Database Engine Query Editor:

Include Client Statistics: Includes a Client Statistics window that contains statistics about the query and about the network packets, and the elapsed time of the query.

SQL Server Execution Times: CPU time = 47 ms, elapsed time = 769 ms.

Your actual query executes in 47ms. Elapsed time is much longer (still under 1 second) because of network waits. You can confirm this using wait stats analysis, read How to analyse SQL Server performance for details, including how to capture the query wait stats.

Ultimately, the problem is returning 500k rows to the client. There cannot be any reason for such operation, no human user can comprehend half a million rows. Process data on the back end.

Half a million rows is used since it is easier to notice speed changes with more than less data

Well, in this case you have sent yourself on a snipe hunt. There is no problem, other that one of your own doing in marshaling and rendering 500k rows. Is a completely bogus scenario, no app should retrieve 500k rows. And processing of 500k rows(eg. aggregates) should tests... the processing, including the aggregates.

Best Answer

Related Solutions

SQL Server Query Performance – Optimizing Group By with Many Columns

Sql-server – SQL Server – Logical Reads lowered, Execution time remained the same

Related Question