I have the following input:
id | value
----+-------
1 | 136
2 | NULL
3 | 650
4 | NULL
5 | NULL
6 | NULL
7 | 954
8 | NULL
9 | 104
10 | NULL
I expect the following result:
id | value
----+-------
1 | 136
2 | 136
3 | 650
4 | 650
5 | 650
6 | 650
7 | 954
8 | 954
9 | 104
10 | 104
The trivial solution would be join the tables with a <
relation, and then selecting the MAX
value in a GROUP BY
:
WITH tmp AS (
SELECT t2.id, MAX(t1.id) AS lastKnownId
FROM t t1, t t2
WHERE
t1.value IS NOT NULL
AND
t2.id >= t1.id
GROUP BY t2.id
)
SELECT
tmp.id, t.value
FROM t, tmp
WHERE t.id = tmp.lastKnownId;
However, the trivial execution of this code would create internally the square of the count of the rows of the input table ( O(n^2) ). I expected t-sql to optimize it out – on a block/record level, the task to do is very easy and linear, essentially a for loop ( O(n) ).
However, on my experiments, the latest MS SQL 2016 can't optimize this query correctly, making this query impossible to execute for a large input table.
Furthermore, the query has to run quickly, making a similarly easy (but very different) cursor-based solution infeasible.
Using some memory-backed temporary table could be a good compromise, but I am not sure if it can be run significantly quicker, considered that my example query using subqueries didn't work.
I am also thinking on to dig out some windowing function from the t-sql docs, what could be tricked to do what I want. For example, cumulative sum is doing some very similar, but I couldn't trick it to give the latest non-null element, and not the sum of the elements before.
The ideal solution would be a quick query without procedural code or temporary tables. Alternatively, also a solution with temporary tables is okay, but iterating the table procedurally is not.
Best Answer
A common solution to this type of problem is given by Itzik Ben-Gan in his article The Last non NULL Puzzle:
Demo: db<>fiddle