Sql-server – Indexing strategy when using the between operator SQL Server 2008

indexoptimizationsql server

I have a large table ~25 million rows with the structure

CREATE TABLE [dbo].[rx](
            [pat_id] [int] NOT NULL,
            [fill_Date] [date] NOT NULL,
            [script_End_Date]  AS (dateadd(day,[dayssup],[filldate])) persisted,
            [drug_Name] [varchar](50) NULL,
            [days_Sup] [int] NOT NULL,
            [quantity] [float] NOT NULL,
            [drug_Class] [char](3) NOT  NULL,
            CHECK(fill_Date <=script_End_Date
PRIMARY KEY NONCLUSTERED 
(
          [clmid]
)


create clustered index ix_rx_temporal on rx(fill_date asc, script_end_date asc, pat_id asc)

The primary key on this table is never queried on. This table by far is used most often with queries that involve date ranges. I have a calendar table with the structure

CREATE TABLE [dbo].[Calendar](
             [cal_date] [date] PRIMARY KEY,
[Year] AS YEAR(cal_date) PERSISTED,
[Month] AS MONTH(cal_date) PERSISTED,
[Day] AS DAY(cal_date) PERSISTED,
             [julian_seq] AS 1+DATEDIFF(DD, CONVERT(DATE, CONVERT(varchar,YEAR(cal_date))+'0101'),cal_date));

The query that I'm trying to speed up is:

;WITH x 
     AS ( 
        --join finds the amount of distinct drugs that a person was prescribed on a given day
        SELECT rx.pat_id, 
               c.cal_date, 
               Count(DISTINCT rx.drug_name) AS distinctDrugs 
         FROM   rx, 
                calendar AS c 
         WHERE  c.cal_date BETWEEN rx.fill_date AND rx.script_end_date 
         GROUP  BY rx.pat_id, 
                   c.cal_date), 
     y 
     AS ( 
        --makes a sequence so that contiguous dates can be grouped together as a date range 
        SELECT x.pat_id, 
               x.distinctdrugs, 
               c2.julian_seq- Row_number() 
                         OVER( 
                           partition BY x.pat_id, distinctdrugs 
                           ORDER BY x.cal_date) AS rn, 
               x.cal_date 
         FROM   x, 
                calendar AS c2 
         WHERE  c2.cal_date = x.cal_date) 
--finds the max and minimum dates which a person was taking X amoung of drugs 
SELECT y.pat_id, 
       Min(y.cal_date)    AS min_overlap, 
       Max(y.cal_date)    AS max_overlap, 
       Min(distinctdrugs) AS distinctDrugs 
FROM   y 
GROUP  BY y.pat_id, 
          rn 

I tried a couple of the following indexes

CREATE NONCLUSTERED INDEX [ix_rx2] ON [dbo].[rx] 
(
[drug_name] ASC,
[drug_class] ASC,
[fill_date] ASC,
[script_end_date] ASC
)

and

CREATE NONCLUSTERED INDEX [ix_rx3] ON [dbo].[rx] 
(
[fill_date] ASC,
[script_end_date] ASC
)
INCLUDE ( [pat_id],
[drug_class],
[drug_name])

Whenever I check out the execution plan, none of the indexes on the rx table are being used. There are other aspects of the execution plan, but the bulk of it looks like
enter image description here

I even tried removing the current clustered index and just using (fill_date,admi_date) and yet I still can't find a way to keep from running into that hash match operator in the execution plan that is taking up the vast majority of the resources for this query. The primary key on the rx table is never queried on, and I'll primarily be doing queries involving date ranges with this table. The indexes on the tables have nominal fragmentation and the densities of the indexes are all very low. What can I do do speed up this query, or am I going to be limited by hardware?

Best Answer

This is not a complete answer, I do not have the time now, so let me just share a few thoughts. The complete answer would be huge, and I am not sure you want to know the details.

I have been working with various temporal queries for several years already, and learned a lot in the process. As such, I would rather not have to optimize your query in my production system. I would try very hard to avoid solving it with T-SQL. It is a complex problem. Itzik Ben-Gan has written about "gaps and islands" several times, including a chapter in his latest book on OLAP functions. Your problem is a variation of gaps and islands.

First, I would consider reading all the data to the client and solve it there using loops. I know, it requires sending data over the network, but fast loops in Java/C++/C# work very well for me most of the time. For instance, once I was struggling with a query involving time series and temporal data. When I moved most of the logic to the client, the C# solution was several times shorter and it ran 20,000 times faster. That's not a typo - twenty thousand times faster.

There is another problem with solving such problems in T-SQL - your performance may be unstable. If a query is complex, all of a sudden the optimizer can choose another plan and it will run many times slower, and we have to optimize it again.

Alternatively, I would consider storing data differently. Right now I see two possible approaches.

First, instead of storing intervals we could use this table:

ClientId,
PrescriptionId,
DrugId,
Date

We can use trusted constraints to make sure each PrescriptionId covers a range of dates without gaps and overlaps, so that one interval stores as one unbroken sequence of dates.

Note: I know you are using DISTINCT in your first subquery, so you are assuming that one person can take one drug on one day from more than one prescription. I am not mkaing this assumption, for simplicity. Are you sure it is a correct assumption? If yes, we will have to change the design.

Once we have this table, we can essentially materialize your first subquery as an indexed view:

SELECT ClientId,Date,COUNT_BIG(*) AS DistinctDrugs
GROUP BY ClientId,
Date

THat done, you can use your second subquery to group data points into interval, or just run it on a client, where is can be solved as one trivial loop.

Second approach: instead of intervals I would store sequence of events. there woudl be two kinds of events: interval start and interval end. With every I would store a running total, the number of open events after this event has happened. Essentially this running total is the number of prescriptions active after the event has happened.

As in the previous approach, much of the data you calculate on the fly every time you run your query is pre-calculated in this table. We can use trusted constraints to ensure the integrity of precalculated data.

I can describe it in more detail later if you are interested.