Setup
Recently I worked on an optimization for a SQL Query which returned the Prescription Product most recently used by a Patient. The original query handled this via 3 nested CTE
's and had really poor performance after migrating to a new server. Instead of trying to make the original version work with either server tuning or index changes we modified the query using ROW_NUMBER()
partitions. This took the query execution time from about 10 minutes to about 6 seconds for approximately 215,000 rows in the final output.
Question
I am looking to see if there is a better way than how I re-wrote it since in my mind this almost seems like a hack. I just want to see if this is only working well because of the size of our data set or if this really is a good solution. I would assume we would rather use some type of solution that utilizes a couple of different JOIN
s that find this kind of information instead of using ROW_NUMBER()
and a CTE
.
Additional Details
For the sake of this question I am going to leave out certain details in our database that are less than desirable (composite primary keys, using VarChar(1)
instead of a BIT
, etc…) so that the root of the question can be better addressed.
We have 4 tables in scope:
Patient
– we are only going to return the Identity Column from this tableOrderHeader
– holds information on the package, like the address it is going to. And is the parent record toOrderDetail
. This table is tied toPatient
via aPatientID
column.OrderDetail
– creates a relationship betweenOrderHeader
andProduct
to indicate which Products went in the actual box when it was shipped outProduct
– holds the list of products
We are needing a query that returns for each patient, the top OrderDetail
record, for the top OrderHeader
record, where the Product
record tied to the OrderDetail
record Product.IsRx
= 1.
Unfortunately we can't just get the MAX
value for OrderDetail.ID
where Product.IsRx
= 1 since that OrderDetail.ID
may not belong to the MAX
value for OrderHeader.ID
. We also can't rely on MAX
value for OrderHeader.ID
since it isn't guaranteed to have a record in OrderDetail
nor is any record in there guaranteed to have a Product.IsRx
= 1
I solved this problem with the below query:
WITH CTE
(
PatientID,
OrderHeaderID,
OrderDetaillID,
ProductNDCCode,
ProductNDCDescription,
RowNumber
)
AS
(
SELECT P.ID AS 'PatientID',
H.ID AS 'OrderHeaderID',
D.ID AS 'OrderDetaillID',
P.NDCCode AS 'ProductNDCCode',
P.NDCDescription AS 'ProductNDCDescription',
ROW_NUMBER()
OVER
(
PARTITION BY P.ID
ORDER BY
P.ID ASC, --This part in the ORDER BY may not be needed, I apologize if it is unnecessary
SH.OrderHdrID DESC,
SD.OrderDtlID DESC
) AS 'RowNumber'
FROM Patient P
INNER JOIN OrderHeader H
ON P.ID = H.PatientID
INNER JOIN OrderDetail D
ON H.ID = D.OrderHeaderID
INNER JOIN Product PR
ON PR.ID = D.ProductID
WHERE PR.IsRx = 1
)
SELECT PatientID,
OrderHeaderID,
OrderDetaillID,
ProductNDCCode,
ProductNDCDescription
FROM CTE
WHERE RowNumber = 1
An guidance or expertise I could gain would be very helpful.
Best Answer
I wouldn't consider this a hack. Windowing functions, including ROW_NUMBER are fairly efficient and generally perform equal to or better than alternatives. On new instances of SQL Server, I assume that FIRST_VALUE would be more performant, but have not specifically confirmed this.
Note: I've seen LAST_VALUE produce incorrect results, and would recommend staying clear of it, opting instead to use FIRST_VALUE with the opposite sort.
A word of caution, however. If your ordering criteria are not specific enough to produce unique a single "best" record (e.g. suppose you omitted OrderDtlID, any of the detail records for that order could be first), then which item is chosen is functionally random (or, more accurately, based upon hidden variables that you can neither easily discover nor control in many cases). If you change from ROW_NUMBER to RANK and your candinality changes (you start getting a different number of results) this is a good indicator of a problem. Also worth pointing out this can be a problem with other means of picking the latest/best record, not just ROW_NUMBER.
Even if you are indifferent to which would be selected, the results of such a query may not be repeatable, though you may no see it right away (the hidden variables don't general change after each query). It can cause problems troubleshooting issues later and, if you run any sort of parallel environment (dev/qa), make accurate comparison between the two impossible.