I saw a concise TSQL statement that effectively splits a string into its constituent characters, one per line, for the purpose of evaluating the ascii
value on each character.
If I am reading the query correctly, effectively, 3 CTEs are being used to prepare a table of 1 column containing 10,000 rows, each with the value '0'.
A fourth CTE is defined as follows:
cteTally(n) AS(
SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) n
FROM E4
)
Subsequently, this CTE is joined to a table containing a column with the strings of interest, with the following select
:
SELECT n, SUBSTRING(LastName, n, 1), ASCII( SUBSTRING(LastName, n, 1))
That is, row number n, then the nth character in LastName, then the ascii value of that character.
My questions relate to the over
clause in the CTE above.
Essentially, what exactly is it doing?
If we are querying row_number from 10,000 identical rows, why do we need an order by
clause at all? Why is the order by
put into an over
clause rather than as an order by
clause for the select
statement – especially as the over
clause isn't even specifying any partition? (I presume this means the window over which row_number
operates is the full 10,000 rows?) And what does it mean to order by select null
?
Best Answer
ROW_NUMBER() is a ranking window function, and ranking window functions require a mandatory ORDER BY clause. If you try to write it without the ORDER BY you will get a syntax error.
The trick with the subquery was discovered by someone who blogged about it, as a performance optimization. SQL Server always performs a sort operation since constants are not allowed for the ORDER BY clause:
And neither are integers which are treated as indices:
Turns out that due to some glitch in the code, you can circumvent this limitation by using a subquery, which for some reason is allowed, and eliminates the sort operator.
You can use any constant in the sub query, the NULL is probably a reminiscent of the habit of using SELECT NULL within EXISTS predicates, which in the early history of SQL Server had a performance impact as opposed to * or any other column expression, as the optimizer wasn't smart enough to ignore it.
HTH
UPDATE @Erik-Darling commented that you can also circumvent it by using a computed expressions: