SQL Server Window Function – How ORDER BY Affects MAX() OVER() Result

querysql serversql-server-2016t-sqlwindow functions

I have a table with below structure and it's data :

create table test_table
(
Item_index   int,
Item_name    varchar(50)
)

insert into test_table (Item_index,Item_name) values (0,'A')
insert into test_table (Item_index,Item_name) values (1,'B')
insert into test_table (Item_index,Item_name) values (0,'C')
insert into test_table (Item_index,Item_name) values (1,'D')
insert into test_table (Item_index,Item_name) values (0,'E')

I want to know why changing the column in order by section of the query , changes the result? In QUERY-1 , I used item_index and in the QUERY-2 I used item_name column in the order by section. I thought that both queries must generate the same result because I used item_index in both queries for partitioning! I'm completely confused now ! why the order by column should affect the final result?

QUERY-1:

select t.*,
       max(t.Item_name)over(partition by t.item_index order by item_index) new_column
from test_table t;

RESULT:

Item_index  Item_name     new_column
----------- --------------------------
0           A                E
0           C                E
0           E                E
1           D                D
1           B                D

QUERY-2:

select t.*,
       max(t.Item_name)over(partition by t.item_index order by item_name) new_column
from test_table t;

RESULT:

Item_index  Item_name  new_column
----------- -----------------------
0           A             A
0           C             C
0           E             E
1           B             B
1           D             D

Can anybody explain how exactly these two queries are being executed and why each of them generates different result?

Thanks in advance

Best Answer

max(t.Item_name)over(partition by t.item_index order by item_index) new_column

Let's take a group where t.item_index = 0. It is

Item_index	Item_name
0	A
0	C
0	E

When order by item_index is applied then all rows have the same value, hence all of them are included into the frame, and all rows values are used for MAX() selection. So the value 'E' is returned for all rows.

max(t.Item_name)over(partition by t.item_index order by item_name)

Let's take the same group.

Item_index	Item_name
0	A
0	C
0	E

Now the sorting key differs, and when the window RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW is applied then different rows are included into the frame to be investigated.

For 1st row only this row is included into the frame, and 'A' is the only value in the frame, so it is returned.

For 2nd row first 2 rows are included into the frame, the values 'A' and 'C' are compared, and 'C' is returned as maximal value in the frame.

For 3rd row all 3 rows are included into the frame, the values 'A', 'C' and 'E' are compared, and 'E' is returned as maximal value in the frame.

Related Solutions

SQL Server – Aligned Index Column Order

You need to think about your partitions as if they are separate tables.

A clustered index on a partitioned table effectively has the partition number as its first key. For you, with a partition per unique PartitionKey, the clustered index will essentially have that column in the front of it. Or as good as...

I'd expect having Id first would feel better, in case you don't know the month and need to search all partitions. It'll seek on each partition rather than scan them.

Is the WHERE-JOIN-ORDER-(SELECT) Rule for Index Column Order Incorrect?

Is the WHERE-JOIN-ORDER-(SELECT) rule for index column order wrong?

At the least it is incomplete and potentially misleading advice (I didn't bother to read the whole article). If you're going to read stuff on the Internet (including this), you should adjust your amount of trust according to how well you already know and trust the author, but always then verify for yourself.

There are a number of "rules of thumb" for creating indexes, depending on the exact scenario, but none are really a good substitute for understanding the core issues for yourself. Read up on the implementation of indexes and execution plan operators in SQL Server, go through some exercises, and come to a good solid understanding of how indexes can be used to make execution plans more efficient. There is no effective shortcut to attaining this knowledge and experience.

In general, I can say that your indexes should most often have columns used for equality tests first, with any inequalities last, and/or provided by a filter on the index. This is not a complete statement, because indexes can also provide order, which may be more useful than seeking directly to one or more keys in some situations. For example, ordering can be used to avoid a sort, to reduce the cost of a physical join option like merge join, to enable a stream aggregate, find the first few qualifying rows quickly...and so on.

I'm being a little vague here, because selecting the ideal index(es) for a query depends on so many factors - this is a very broad topic.

Anyway, it is not unusual to find conflicting signals for the 'best' indexes in a query. For example, your join predicate would like rows ordered one way for a merge join, the group by would like rows sorted another way for a stream aggregate, and finding the qualifying rows using the where clause predicates would suggest other indexes.

The reason indexing is an art as well as science is that an ideal combination is not always logically possible. Choosing the best compromise indexes for the workload (not just a single query) requires analytic skills, experience, and system-specific knowledge. If it were easy, the automated tools would be perfect, and performance-tuning consultants would be much less in demand.

As far as missing index suggestions are concerned: these are opportunistic. The optimizer brings them to your attention when it tries to match predicates and required sort order to an index that does not exist. The suggestions are therefore based on particular matching attempts in the specific context of the particular sub-plan variation it was considering at the time.

In context, the suggestions always make sense, in terms of reducing the estimated cost of data access, according to the optimizer's model. It does not do a wider analysis of the query as a whole (much less the wider workload), so you should think of these suggestions as a gentle hint that a skilled person needs to look at the available indexes, with the suggestions as a starting point (and usually no more than that).

In your case, the (Status) INCLUDE (ID) suggestion probably came about when it was looking at the possibility of a hash or merge join (example later). In that narrow context, the suggestion makes sense. For the query as a whole, maybe not. The index (ID, Status) enables a nested loop join with ID as an outer reference: equality seek on ID and inequality on Status per iteration.

One possible selection of indexes is:

CREATE INDEX i1 ON dbo.I (ID, [Status]);
CREATE INDEX i1 ON dbo.IP (Deleted, OPID, IID) INCLUDE (Q);

...which produces a plan like:

I am not saying these indexes are optimal for you; they happen to work to produce a reasonable-looking plan to me, without being able to see statistics for the tables involved, or the full definitions and existing indexing. Also, I know nothing of the wider workload or real query.

Alternatively (just to show one of the myriad additional possibilities):

CREATE INDEX i1 ON dbo.I ([Status]) INCLUDE (ID);
CREATE INDEX i1 ON dbo.IP (Deleted, IID, OPID) INCLUDE (Q);

Gives:

Execution plans were generated using SQL Sentry Plan Explorer.

Best Answer

Related Solutions

SQL Server – Aligned Index Column Order

Is the WHERE-JOIN-ORDER-(SELECT) Rule for Index Column Order Incorrect?

Related Question