Sql-server – Method for Finding Top Record with WHERE conditions in Child Tables

optimizationsql servert-sql

Setup

Recently I worked on an optimization for a SQL Query which returned the Prescription Product most recently used by a Patient. The original query handled this via 3 nested CTE's and had really poor performance after migrating to a new server. Instead of trying to make the original version work with either server tuning or index changes we modified the query using ROW_NUMBER() partitions. This took the query execution time from about 10 minutes to about 6 seconds for approximately 215,000 rows in the final output.

Question

I am looking to see if there is a better way than how I re-wrote it since in my mind this almost seems like a hack. I just want to see if this is only working well because of the size of our data set or if this really is a good solution. I would assume we would rather use some type of solution that utilizes a couple of different JOINs that find this kind of information instead of using ROW_NUMBER() and a CTE.

Additional Details

For the sake of this question I am going to leave out certain details in our database that are less than desirable (composite primary keys, using VarChar(1) instead of a BIT, etc…) so that the root of the question can be better addressed.

We have 4 tables in scope:

Patient – we are only going to return the Identity Column from this table
OrderHeader – holds information on the package, like the address it is going to. And is the parent record to OrderDetail. This table is tied to Patient via a PatientID column.
OrderDetail – creates a relationship between OrderHeader and Product to indicate which Products went in the actual box when it was shipped out
Product – holds the list of products

We are needing a query that returns for each patient, the top OrderDetail record, for the top OrderHeader record, where the Product record tied to the OrderDetail record Product.IsRx = 1.

Unfortunately we can't just get the MAX value for OrderDetail.ID where Product.IsRx = 1 since that OrderDetail.ID may not belong to the MAX value for OrderHeader.ID. We also can't rely on MAX value for OrderHeader.ID since it isn't guaranteed to have a record in OrderDetail nor is any record in there guaranteed to have a Product.IsRx = 1

I solved this problem with the below query:

WITH CTE
(
    PatientID,
    OrderHeaderID,
    OrderDetaillID,
    ProductNDCCode,
    ProductNDCDescription,
    RowNumber
)
AS
(
    SELECT P.ID AS 'PatientID',
    H.ID AS 'OrderHeaderID',
    D.ID AS 'OrderDetaillID',
    P.NDCCode AS 'ProductNDCCode',
    P.NDCDescription AS 'ProductNDCDescription',
    ROW_NUMBER() 
    OVER
    (
    PARTITION BY P.ID
    ORDER BY 
        P.ID ASC, --This part in the ORDER BY may not be needed, I apologize if it is unnecessary
        SH.OrderHdrID DESC, 
        SD.OrderDtlID DESC
    ) AS 'RowNumber'
    FROM Patient P
        INNER JOIN OrderHeader H
            ON P.ID = H.PatientID
        INNER JOIN OrderDetail D
            ON H.ID = D.OrderHeaderID
        INNER JOIN Product PR
            ON PR.ID = D.ProductID
    WHERE PR.IsRx = 1
)
SELECT PatientID,
OrderHeaderID,
OrderDetaillID,
ProductNDCCode,
ProductNDCDescription
FROM CTE
WHERE RowNumber = 1

An guidance or expertise I could gain would be very helpful.

Best Answer

I wouldn't consider this a hack. Windowing functions, including ROW_NUMBER are fairly efficient and generally perform equal to or better than alternatives. On new instances of SQL Server, I assume that FIRST_VALUE would be more performant, but have not specifically confirmed this.

Note: I've seen LAST_VALUE produce incorrect results, and would recommend staying clear of it, opting instead to use FIRST_VALUE with the opposite sort.

A word of caution, however. If your ordering criteria are not specific enough to produce unique a single "best" record (e.g. suppose you omitted OrderDtlID, any of the detail records for that order could be first), then which item is chosen is functionally random (or, more accurately, based upon hidden variables that you can neither easily discover nor control in many cases). If you change from ROW_NUMBER to RANK and your candinality changes (you start getting a different number of results) this is a good indicator of a problem. Also worth pointing out this can be a problem with other means of picking the latest/best record, not just ROW_NUMBER.

Even if you are indifferent to which would be selected, the results of such a query may not be repeatable, though you may no see it right away (the hidden variables don't general change after each query). It can cause problems troubleshooting issues later and, if you run any sort of parallel environment (dev/qa), make accurate comparison between the two impossible.

Related Solutions

SQL Server Execution Plan – SHOWPLAN vs ‘Include Execution Plan’ Warnings for the Same Query

This:

SET SHOWPLAN_XML ON;
GO
SELECT * FROM sys.objects;
GO

Is equivalent to pressing Display Estimated Execution Plan on the toolbar (or hitting Ctrl + L). You'll notice that no rows are returned from the query, like there is when you use Include Actual Execution Plan (Ctrl + M).

The spill warning is only a runtime warning. There is no way that SQL Server can know, when displaying the estimated plan, that a spill will happen at runtime. This is because a spill is caused by factors that might only be present during certain invocations of the query (for example, when there is memory pressure). The estimated plan knows roughly how much memory it's going to ask for, but it can't know until execution that it isn't going to get it.

As an aside, may I recommend* our free tool, SQL Sentry Plan Explorer? I think it provides much more obvious information than Management Studio. I recently wrote a lengthy blog post that can act as a tutorial, and Jonathan Kehayias has a great PluralSight course on it as well.

_{* Disclaimer: I work for SQL Sentry.}

T-SQL – How to Query Orders Containing a Product Without Sub-Query

I have no idea about efficiency^* but a view like this:

CREATE VIEW 
    Orders_with_a_product_View AS
SELECT 
    p.ProductID, o.*
FROM
    dbo.Product AS p
  JOIN
    dbo.[Order] AS o
      ON EXISTS
         ( SELECT 1
           FROM dbo.OrderDetail AS od
           WHERE od.OrderID = o.OrderID
             AND od.ProductId = p.ProductId
         ) ;

would allow you to use it with:

SELECT * 
FROM Orders_with_a_product_View 
WHERE ProductID = X ;

(*Regarding efficiency: Hopefully the optimizer will "push-down" the condition and not do a cross join of all products and orders.)

It might be better to follow a simpler road, like this:

CREATE VIEW 
    Order_Product_IDs AS
SELECT DISTINCT 
    ProductID, OrderID
FROM
    dbo.OrderDetail ;

and the utilize it with:

SELECT o.* 
FROM dbo.Order_Product_IDs AS op 
  JOIN dbo.[Order] AS o
    ON o.OrderID = op.OrderID
WHERE op.ProductID = X ;

Best Answer

Related Solutions

SQL Server Execution Plan – SHOWPLAN vs ‘Include Execution Plan’ Warnings for the Same Query

T-SQL – How to Query Orders Containing a Product Without Sub-Query

Related Question