Sql-server – Method for Finding Top Record with WHERE conditions in Child Tables

optimizationsql servert-sql

Setup

Recently I worked on an optimization for a SQL Query which returned the Prescription Product most recently used by a Patient. The original query handled this via 3 nested CTE's and had really poor performance after migrating to a new server. Instead of trying to make the original version work with either server tuning or index changes we modified the query using ROW_NUMBER() partitions. This took the query execution time from about 10 minutes to about 6 seconds for approximately 215,000 rows in the final output.

Question

I am looking to see if there is a better way than how I re-wrote it since in my mind this almost seems like a hack. I just want to see if this is only working well because of the size of our data set or if this really is a good solution. I would assume we would rather use some type of solution that utilizes a couple of different JOINs that find this kind of information instead of using ROW_NUMBER() and a CTE.

Additional Details

For the sake of this question I am going to leave out certain details in our database that are less than desirable (composite primary keys, using VarChar(1) instead of a BIT, etc…) so that the root of the question can be better addressed.

We have 4 tables in scope:

  • Patient – we are only going to return the Identity Column from this table
  • OrderHeader – holds information on the package, like the address it is going to. And is the parent record to OrderDetail. This table is tied to Patient via a PatientID column.
  • OrderDetail – creates a relationship between OrderHeader and Product to indicate which Products went in the actual box when it was shipped out
  • Product – holds the list of products

We are needing a query that returns for each patient, the top OrderDetail record, for the top OrderHeader record, where the Product record tied to the OrderDetail record Product.IsRx = 1.

Unfortunately we can't just get the MAX value for OrderDetail.ID where Product.IsRx = 1 since that OrderDetail.ID may not belong to the MAX value for OrderHeader.ID. We also can't rely on MAX value for OrderHeader.ID since it isn't guaranteed to have a record in OrderDetail nor is any record in there guaranteed to have a Product.IsRx = 1

I solved this problem with the below query:

WITH CTE
(
    PatientID,
    OrderHeaderID,
    OrderDetaillID,
    ProductNDCCode,
    ProductNDCDescription,
    RowNumber
)
AS
(
    SELECT P.ID AS 'PatientID',
    H.ID AS 'OrderHeaderID',
    D.ID AS 'OrderDetaillID',
    P.NDCCode AS 'ProductNDCCode',
    P.NDCDescription AS 'ProductNDCDescription',
    ROW_NUMBER() 
    OVER
    (
    PARTITION BY P.ID
    ORDER BY 
        P.ID ASC, --This part in the ORDER BY may not be needed, I apologize if it is unnecessary
        SH.OrderHdrID DESC, 
        SD.OrderDtlID DESC
    ) AS 'RowNumber'
    FROM Patient P
        INNER JOIN OrderHeader H
            ON P.ID = H.PatientID
        INNER JOIN OrderDetail D
            ON H.ID = D.OrderHeaderID
        INNER JOIN Product PR
            ON PR.ID = D.ProductID
    WHERE PR.IsRx = 1
)
SELECT PatientID,
OrderHeaderID,
OrderDetaillID,
ProductNDCCode,
ProductNDCDescription
FROM CTE
WHERE RowNumber = 1

An guidance or expertise I could gain would be very helpful.

Best Answer

I wouldn't consider this a hack. Windowing functions, including ROW_NUMBER are fairly efficient and generally perform equal to or better than alternatives. On new instances of SQL Server, I assume that FIRST_VALUE would be more performant, but have not specifically confirmed this.

Note: I've seen LAST_VALUE produce incorrect results, and would recommend staying clear of it, opting instead to use FIRST_VALUE with the opposite sort.

A word of caution, however. If your ordering criteria are not specific enough to produce unique a single "best" record (e.g. suppose you omitted OrderDtlID, any of the detail records for that order could be first), then which item is chosen is functionally random (or, more accurately, based upon hidden variables that you can neither easily discover nor control in many cases). If you change from ROW_NUMBER to RANK and your candinality changes (you start getting a different number of results) this is a good indicator of a problem. Also worth pointing out this can be a problem with other means of picking the latest/best record, not just ROW_NUMBER.

Even if you are indifferent to which would be selected, the results of such a query may not be repeatable, though you may no see it right away (the hidden variables don't general change after each query). It can cause problems troubleshooting issues later and, if you run any sort of parallel environment (dev/qa), make accurate comparison between the two impossible.