This is not a complete answer, I do not have the time now, so let me just share a few thoughts. The complete answer would be huge, and I am not sure you want to know the details.
I have been working with various temporal queries for several years already, and learned a lot in the process. As such, I would rather not have to optimize your query in my production system. I would try very hard to avoid solving it with T-SQL. It is a complex problem. Itzik Ben-Gan has written about "gaps and islands" several times, including a chapter in his latest book on OLAP functions. Your problem is a variation of gaps and islands.
First, I would consider reading all the data to the client and solve it there using loops. I know, it requires sending data over the network, but fast loops in Java/C++/C# work very well for me most of the time. For instance, once I was struggling with a query involving time series and temporal data. When I moved most of the logic to the client, the C# solution was several times shorter and it ran 20,000 times faster. That's not a typo - twenty thousand times faster.
There is another problem with solving such problems in T-SQL - your performance may be unstable. If a query is complex, all of a sudden the optimizer can choose another plan and it will run many times slower, and we have to optimize it again.
Alternatively, I would consider storing data differently. Right now I see two possible approaches.
First, instead of storing intervals we could use this table:
ClientId,
PrescriptionId,
DrugId,
Date
We can use trusted constraints to make sure each PrescriptionId covers a range of dates without gaps and overlaps, so that one interval stores as one unbroken sequence of dates.
Note: I know you are using DISTINCT in your first subquery, so you are assuming that one person can take one drug on one day from more than one prescription. I am not mkaing this assumption, for simplicity. Are you sure it is a correct assumption? If yes, we will have to change the design.
Once we have this table, we can essentially materialize your first subquery as an indexed view:
SELECT ClientId,Date,COUNT_BIG(*) AS DistinctDrugs
GROUP BY ClientId,
Date
THat done, you can use your second subquery to group data points into interval, or just run it on a client, where is can be solved as one trivial loop.
Second approach: instead of intervals I would store sequence of events. there woudl be two kinds of events: interval start and interval end. With every I would store a running total, the number of open events after this event has happened. Essentially this running total is the number of prescriptions active after the event has happened.
As in the previous approach, much of the data you calculate on the fly every time you run your query is pre-calculated in this table. We can use trusted constraints to ensure the integrity of precalculated data.
I can describe it in more detail later if you are interested.
Best Answer
You may believe primary key is unnecessary, but in my experience it is critical. You may never use it.
But the database itself uses the primary key to determine PHYSICAL organization of the data.Correction: I have been corrected here - in SQL Server, the physical organization is based on the clustering index. By default a primary key IS the clustering index, so by adding a primary key, the clustered index is added and the rest of my point is still correct. But if you don't want to use an autonumber or some other primary key as the clustered index, you CAN still use some other column to cluster. You should definitely pick one, though. Check out this link: http://technet.microsoft.com/en-us/library/ms186342.aspxWithout a
primary keyclustered index you risk corruption of your data. And in some cases, when it seems like totally unnecessary, adding aprimary keythat you don't even search on will dramatically improve query performance.here is an excellent description of the basic indexes in SQL Server. https://www.simple-talk.com/sql/learn-sql-server/sql-server-index-basics/
Based on the basics you have given us, I would say you need something like a primary key (auto number works) that will be your clustered index, then several nonclustered indexes (on all three of your search fields), which are the typical index when you want to search on various columns that are not necessarily the primary key.
If data will be edited a lot (like when a software application is used to manage data), a lot of indexes (i.e., nonclustered indexes) will slow down insert/update/delete activity, because the indexes are updated when the table is. But if your focus is on selecting data, indexes are critical.
If this is updated in batch daily and is not part of a software system, you could have a routine which drops the indexes, does a batch insert, then recreates the indexes.
But regardless - what you want for your search fields are nonclustered indexes. Clustered indexes "cluster" an entire record around the primary key (or whatever field is chosen as the cluster basis). Nonclustered indexes are what are created when you do indexes on any other field.