Sql-server – How to select from a table without including repeated column values

sql-server-2008

In a previous question How to merge data sets without including redundant rows? I asked about filtering redundant historical data during import, but @DavidSpillett correctly replied that I couldn't do what I was trying to do.

Instead of filtering the table during import, I now want to create a view on the table that returns only records where the price has changed.

Here's the original scenario rephrased to suite this question:

We have a table of historical prices for items. The table contains rows where the same price is recorded for multiple dates. I want to create a view on this data which only shows price changes over time, so if a price changes from A to B I want to see it, but if it "changes" from B to B then I don't want to see it.

Example: if the price yesterday was $1, and the price today is $1, and there were no other price changes, then the price today can be inferred from the price yesterday so I only need the record from yesterday.

Example (http://sqlfiddle.com/#!3/c95ff/1):

Table data:

Effective            Product  Kind  Price
2013-04-23T00:23:00  1234     1     1.00
2013-04-24T00:24:00  1234     1     1.00 -- redundant, implied by record 1
2013-04-25T00:25:00  1234     1     1.50
2013-04-26T00:26:00  1234     1     2.00
2013-04-27T00:27:00  1234     1     2.00 -- redundant, implied by record 4
2013-04-28T00:28:00  1234     1     1.00 -- not redundant, price changed back to 1.00

Expected view data:

Effective            Product  Kind  Price
2013-04-23T00:23:00  1234     1     1.00
2013-04-25T00:25:00  1234     1     1.50
2013-04-26T00:26:00  1234     1     2.00
2013-04-28T00:28:00  1234     1     1.00

My initial attempt used ROW_NUMBER:

SELECT
    Effective,
    Product,
    Kind,
    Price
FROM
(
    SELECT
        History.*,
        ROW_NUMBER() OVER
        (
            PARTITION BY
                Product,
                Kind,
                Price
            ORDER BY
                Effective ASC
        ) AS RowNumber
    FROM History
) H
WHERE RowNumber = 1
ORDER BY Effective

Which returned:

Effective               Product  Kind  Price
2013-04-23T00:23:00     1234     1     1.00
                                             -- not 2013-04-24, good
2013-04-25T00:25:00     1234     1     1.50
2013-04-26T00:26:00     1234     1     2.00
                                             -- not 2013-04-27, good
                                             -- not 2013-04-28, bad

I tried searching for a similar question/answer but it's hard to work out how to phrase the search, an example is worth a lot of words.

Any suggestions appreciated. Thanks

Best Answer

SQLfiddle

SELECT H.*
FROM History AS H
OUTER APPLY
(
    SELECT TOP (1)
        H2.Price
    FROM History AS H2
    WHERE
        H2.Product = H.Product
        AND H2.Kind = H.Kind
        AND H2.Effective < H.Effective
    ORDER BY
        H2.Effective DESC
) AS X
WHERE
    NOT EXISTS (SELECT X.Price INTERSECT SELECT H.Price);

Execution plan

This is a good execution plan for the small number of rows given in the question. For a larger table, the ideal index for this query is:

CREATE UNIQUE INDEX [dbo.History Product, Kind, Effective]
ON dbo.History (Product, Kind, Effective DESC)
INCLUDE (Price);

Execution plan 2

That index is essentially the clustered index keys in a more helpful order. Depending on how the table is used for other queries, it might be better to replace the clustered index instead of creating this new index.

Do not use batch estimated cost percentages to compare different queries. This is not a valid comparison in general. The costs are always optimizer estimates, and not intended to be used this way. Check actual performance metrics (elapsed time, I/Os, CPU usage, memory usage) by all means, but do not put your faith in the percentages.

Whether this query or the one based on ROW_NUMBER is more efficient depends on the distribution of the data, and other factors. Each has their advantages and disadvantages. This query has advantages when there are many rows for each (product, kind) combination.