Sql-server – Which should I use on the DB? – Clustered Index or Non Clustered Index or Both

clustered-indexindexsql server

I just need advice on how I am going to optimize my database.

Date ---- Time ---- Area ---- Block ---- Data1 ---- Data2 ---- Data3 ---- DataN

I have that format of tables on my database. Each tables has 30-days of records with hundred thousands of records each data. Uploading of data to database is every morning. The data of yesterday will be uploaded.

The most common queries to run is grouped by Date or Date and Area or Date and Area and Block. Now, in order to make queries faster, I found out using indexes. I used indexes before. But that was when I am using a primary key. In this table, I didn't use primary key because it is totally unnecessary.

Now, I am really confused on how I am going to optimize this. Based on this reference on Column Considerations part, clustered index is not a good choice on Columns that undergo frequent changes. I am really confused.

Please help me if I need to use clustered index here or just non clustered or both.

Best Answer

You may believe primary key is unnecessary, but in my experience it is critical. You may never use it. ~~But the database itself uses the primary key to determine PHYSICAL organization of the data.~~ Correction: I have been corrected here - in SQL Server, the physical organization is based on the clustering index. By default a primary key IS the clustering index, so by adding a primary key, the clustered index is added and the rest of my point is still correct. But if you don't want to use an autonumber or some other primary key as the clustered index, you CAN still use some other column to cluster. You should definitely pick one, though. Check out this link: http://technet.microsoft.com/en-us/library/ms186342.aspx

Without a ~~primary key~~ clustered index you risk corruption of your data. And in some cases, when it seems like totally unnecessary, adding a ~~primary key~~ that you don't even search on will dramatically improve query performance.

here is an excellent description of the basic indexes in SQL Server. https://www.simple-talk.com/sql/learn-sql-server/sql-server-index-basics/

Based on the basics you have given us, I would say you need something like a primary key (auto number works) that will be your clustered index, then several nonclustered indexes (on all three of your search fields), which are the typical index when you want to search on various columns that are not necessarily the primary key.

If data will be edited a lot (like when a software application is used to manage data), a lot of indexes (i.e., nonclustered indexes) will slow down insert/update/delete activity, because the indexes are updated when the table is. But if your focus is on selecting data, indexes are critical.

If this is updated in batch daily and is not part of a software system, you could have a routine which drops the indexes, does a batch insert, then recreates the indexes.

But regardless - what you want for your search fields are nonclustered indexes. Clustered indexes "cluster" an entire record around the primary key (or whatever field is chosen as the cluster basis). Nonclustered indexes are what are created when you do indexes on any other field.

Related Solutions

Sql-server – should we replace some non-clustered indexes with clustered indexes

Your problem will come in when you insert data into the ReportDataCosts table, as it will have to perform splits to insert the data, with a clustered index on the FK.

If you're so hell-bent on a clustered index on the FK, why not widen the PK to include the FK and make that clustered?

Sql-server – Indexing strategy when using the between operator SQL Server 2008

This is not a complete answer, I do not have the time now, so let me just share a few thoughts. The complete answer would be huge, and I am not sure you want to know the details.

I have been working with various temporal queries for several years already, and learned a lot in the process. As such, I would rather not have to optimize your query in my production system. I would try very hard to avoid solving it with T-SQL. It is a complex problem. Itzik Ben-Gan has written about "gaps and islands" several times, including a chapter in his latest book on OLAP functions. Your problem is a variation of gaps and islands.

First, I would consider reading all the data to the client and solve it there using loops. I know, it requires sending data over the network, but fast loops in Java/C++/C# work very well for me most of the time. For instance, once I was struggling with a query involving time series and temporal data. When I moved most of the logic to the client, the C# solution was several times shorter and it ran 20,000 times faster. That's not a typo - twenty thousand times faster.

There is another problem with solving such problems in T-SQL - your performance may be unstable. If a query is complex, all of a sudden the optimizer can choose another plan and it will run many times slower, and we have to optimize it again.

Alternatively, I would consider storing data differently. Right now I see two possible approaches.

First, instead of storing intervals we could use this table:

ClientId,
PrescriptionId,
DrugId,
Date

We can use trusted constraints to make sure each PrescriptionId covers a range of dates without gaps and overlaps, so that one interval stores as one unbroken sequence of dates.

Note: I know you are using DISTINCT in your first subquery, so you are assuming that one person can take one drug on one day from more than one prescription. I am not mkaing this assumption, for simplicity. Are you sure it is a correct assumption? If yes, we will have to change the design.

Once we have this table, we can essentially materialize your first subquery as an indexed view:

SELECT ClientId,Date,COUNT_BIG(*) AS DistinctDrugs
GROUP BY ClientId,
Date

THat done, you can use your second subquery to group data points into interval, or just run it on a client, where is can be solved as one trivial loop.

Second approach: instead of intervals I would store sequence of events. there woudl be two kinds of events: interval start and interval end. With every I would store a running total, the number of open events after this event has happened. Essentially this running total is the number of prescriptions active after the event has happened.

As in the previous approach, much of the data you calculate on the fly every time you run your query is pre-calculated in this table. We can use trusted constraints to ensure the integrity of precalculated data.

I can describe it in more detail later if you are interested.

Best Answer

Related Solutions

Sql-server – should we replace some non-clustered indexes with clustered indexes

Sql-server – Indexing strategy when using the between operator SQL Server 2008

Related Question