SQL Server – Using TOP and Getting Different Result Sets

sql servert-sqltop

I am trying to write an efficient query for deleting chunks of data. To this end I hoped to avoid an index scan by using the primary key to get the oldest records. However I'm seeing some unexpected results returned.

I hoped this

SELECT TOP 15 OrderID FROM [Order]

Would give me the oldest 15 records because I could rely on the primary key incrementing and therefore the order of storage would be low to high in the table.

However, this returns a different result set

SELECT TOP 15 OrderID FROM [Order] ORDER BY DateCreated ASC

Which seems to be a more accurate but more expensive way of getting the result I need.

Confusingly, this

SELECT TOP 15 * FROM [Order]

Gives a different set of OrderID s (PK) to this

SELECT TOP 15 OrderID FROM [Order]

I understand that http://msdn.microsoft.com/en-gb/library/ms189463.aspx explains that order cannot be guaranteed without an ORDER BY clause but expected the PK to order for me and can't explain the differences between the final two select clauses.

Best Answer

Take a look at the plans. When you use SELECT * it probably uses the clustered index, and when you only want one column, maybe there is a skinnier index to use.

Don't "expect" a certain order. If you don't tell SQL Server how to order, then it will use the most efficient way possible, and this can change due to probably more than 20 factors.

If you want a certain order, SAY SO. Please read #3 here:

T-SQL Tuesday #56 : SQL Server Assumptions

Also, this post by Michael Swart may be an interesting read:

Without ORDER BY, You Can’t Depend On the Order of Results

If you want your second query to be more efficient, you can consider creating an index on DateCreated (you may want to include OrderID - not sure of the present index structure).

For your actual goal of deleting n rows at a time, oldest first, and assuming OrderID is an IDENTITY column (so order create date should roughly align with that), why not use this approach (based on this great blog post, also by Michael Swart):

-- pick a datetime for the newest row you want to delete
-- let's say you want to delete all orders before Jan 1 2014:

SELECT @MaxOrderID = MAX(OrderID)
  FROM dbo.[Order] -- terrible table name, also always use dbo prefix
  WHERE DateCreated < '20140101';

DECLARE @BatchSize INT = 1000,
        @LargestOrderProcessed INT = -1,
        @NextBatchMax INT,
        @RC INT = 1;

WHILE (@RC > 0)
BEGIN
  SELECT TOP (@BatchSize) @NextBatchMax = OrderID
    FROM dbo.[Order]
    WHERE OrderID > @LargestOrderProcessed
    AND OrderID <= @MaxOrderID
    ORDER BY OrderID;

  DELETE dbo.[Order]
    WHERE OrderID > @LargestOrderProcessed
    AND OrderID <= @NextBatchMax;

  SET @RC = @@ROWCOUNT;
  SET @LargestOrderProcessed = @NextBatchMax;
END

In order to minimize the impact on the log, you may want to add some additional logic in there, from my blog post Break large delete operations into chunks. As for the dbo prefix, see Bad habits to kick : Avoiding the schema prefix.

Related Solutions

SQL Server – Does Concatenation Physical Operation Guarantee Order of Execution?

No, there is no documentation from Microsoft guaranteeing the behavior, therefore it is not guaranteed.

Additionally, assuming that the Simple Talk article is correct, and that the Concatenation physical operator always processes inputs in the order shown in the plan (very likely to be true), then without a guarantee that SQL Server will always generate plans that keep the same the order between the query text and the query plan, you're only slightly better off.

We can investigate this further though. If the query optimizer was able to reorder the Concatenation operator input, there should exist rows in the undocumented DMV, sys.dm_exec_query_transformation_stats corresponding to that optimization.

SELECT * FROM sys.dm_exec_query_transformation_stats 
    WHERE name LIKE '%CON%' OR name LIKE '%UNIA%'

On SQL Server 2012 Enterprise Edition, this produces 24 rows. Ignoring the false matches for transformations related to constants, there is one transformation related to the Concatenation Physical Operator UNIAtoCON (Union All to Concatenation). So, at the physical operator level, it appears that once a concatenation operator is selected, it will be processed in the order of the logical Union All operator it was derived from.

In fact that is not quite true. Post-optimization rewrites exist that can reorder the inputs to a physical Concatenation operator after cost-based optimization has completed. One example occurs when the Concatenation is subject to a row goal (so it may be important to read from the cheaper input first). See UNION ALL Optimization by Paul White for more details.

That late physical rewrite was functional up to and including SQL Server 2008 R2, but a regression meant it no longer applied to SQL Server 2012 and later. A fix has been issued that reinstates this rewrite for SQL Server 2014 and later (not 2012) with query optimizer hotfixes enabled (e.g. trace flag 4199).

But about the Logical Union All operator (UNIA)? There is a UNIAReorderInputs transformation, which can reorder the inputs. There are also two physical operators that can be used to implement a logical Union All, UNIAtoCON and UNIAtoMERGE (Union All to Merge Union).

Therefore it appears that the query optimizer can reorder the inputs for a UNION ALL; however, it doesn't appear to be a common transformation (zero uses of UNIAReorderInputs on the SQL Servers I have readily accessible. We don't know the circumstances that would make the optimizer use UNIAReorderInputs; though it is certainly used when a plan guide or use plan hint is used to force a plan generated using the row goal physical reordered inputs mentioned above.

Is there a way to have the engine process more than one input at a time?

The Concatenation physical operator can exist within a parallel section of a plan. With some difficulty, I was able to produce a plan with parallel concatenations using the following query:

SELECT userid, regdate  FROM (  --Users table is around 3mil rows
    SELECT  userid, RegDate FROM users WHERE userid > 1000000
    UNION 
    SELECT  userid, RegDate FROM users WHERE userid < 1000000
    UNION all
    SELECT userid, RegDate FROM users WHERE userid < 2000000
    ) d ORDER BY RegDate OPTION (RECOMPILE)

So, in the strictest sense, the physical Concatenation operator does seem to always process inputs in a consistent fashion (top one first, bottom second); however, the optimizer could switch the order of the inputs before choosing the physical operator, or use a Merge union instead of a Concatenation.

Sql-server – how to select top ID from join operation and staging tables

Window Functions, or the OVER() Clause

It appears that what you are looking for is a window function or an OVER() clause.

In your original example, you are trying to use two max() conditions, which doesn't work, because when you try to then apply GROUP BY, you can have a condition where the max of the first column ID and the second column startdate aren't in the same row, and so then GROUP BY simply can't be understood.

However, if you are looking to extract the max over a grouping, and you can define that grouping, and you want to obtain the 'column-wise' maximum for more than one column for that grouping (as you seem to want to do), then here's the solution.

CREATE TABLE test (
  jobcode integer,
  id integer, 
  a_startdate datetime,
  a_enddate datetime,
  b_startdate datetime,
  b_enddate datetime);

  INSERT INTO test VALUES (513801, 7136, '11-01-2011', '12-31-9998', '11-01-2011', '12-31-9998');
  INSERT INTO test VALUES (513801, 7137,'04-26-2014', '12-31-9998','04-26-2014', '12-31-9998');

I first made this table to recreate your input data in the picture. I then applied this query.

SELECT jobcode, max(id) OVER (PARTITION BY jobcode), 
max(a_startdate) OVER(PARTITION BY jobcode), 
a_enddate,  b_startdate, b_enddate FROM test;

Try out this SQL Fiddle, and see if it gives the results you're looking for. I did the best I could with the description I had.

Best Answer

Related Solutions

SQL Server – Does Concatenation Physical Operation Guarantee Order of Execution?

Sql-server – how to select top ID from join operation and staging tables

Window Functions, or the OVER() Clause

Related Question