Sql-server – How to prevent RID lookup

clustered-indexindex-tuningsql server

I have a table that stores time-series data and has about 8 million rows. The table structure is as follows:

Timestamp | ComponenentID | Parameter1 | Parameter2 | Parameter3

I have a clustered index on the Timestamp and ComponentID column. (This is also my primary key)

The query I am trying to run is:-

SELECT * FROM table 
WHERE Timestamp BETWEEN '2020-01-01' 
AND '2020-01-02' 
AND ComponentId = 5

When I run this query an RID lookup seems to be taking place. I have read that this only happens when the index does not cover everything but since I have a clustered index I thought it should itself cover everything and prevent the lookup. How do I prevent this lookup from happening?

Query 1 (Timestamp+ComponentId):-

Query 2 (Timestamp):-

Best Answer

RID Lookups occur on a heap data structure in SQL Server (as opposed to a B-Tree). This occurs when a non-covering nonclustered index is used to fetch the data and it needs to lookup the remaining fields it's missing. Your table data is stored in a heap when there is no clustered index on that table (as the clustered index defines the ordering the records are sorted into a B-Tree normally).

If your table had a clustered index and this query wanted to use the same nonclustered index it's currently using, you'd see a Key Lookup operation in the execution plan instead.

If you created a clustered index on your table for the two fields Timestamp and ComponenentID then that'll be covering for your query and you should see the clustered index used in the execution plan instead which will eliminate any kind of additional lookup operation.

Regarding your second question in the comments, based on your recent comment updates, it sounds like the difference in runtime you're seeing is due to the first run pulling the data into memory from disk (which is generally the most bottlenecking part of the process, from a hardware perspective) and the second run leveraging the existing data in memory.

Depending on how big your Table and Page Sizes are, this normally shouldn't be too much of a concern (based on the number of rows I see your query is returning from its execution plan). All subsequent runs of the query (while the data is still in memory) will have optimal performance.

If the initial run to pull the data off the disk becomes a problem then you can either look into Compression or analyze if you can upgrade your disks to something faster (not sure if you're still on a mechanical hard drive currently, and can switch to an SSD or even better is an NVMe).

I'll also add a final note that was made in the comments that is true and related but not the root issue itself, that your query which does additional filtering on ComponentId results in more data being read. This is evident in your IO statistics screenshots if you compare the grand totals for Logical Reads between the two queries. The one query filtering on Timestamp and ComponentId results in 396 Logical Reads whereas the other query filtering on just Timestamp only results in 90 Logical Reads. Logical Reads are the number of 8 KB Pages that are read from memory. Physical Reads are the number of 8 KB Pages that were read from the disk. It's not a huge difference, but would account for a small difference in time between running the first query vs the second query, on their first runs. This becomes a moot point on subsequent runs of both queries when the data is already in memory from disk (as discussed in my previous few paragraphs).

Related Solutions

Sql-server – RID vs INCLUDE on a large field

Since you said that most queries would generally return few rows, letting the query use a RID lookup (key lookup in this case as the table has a clustered index) is perfectly fine for retrieving a potentially large field. For a highly-available system, I couldn't recommend putting a LOB type in an index anyway as this prevents online rebuilds (for versions of SQL Server before 2012). Also, you need to be very careful that the query plan always sticks to a seek-type plan and not tip into a table scan which could be very expensive. This is a case where I might use a table hint (or a plan guide if the query can't be modified) even if it isn't absolutely necessary.

Another option is to recreate the clustered index on the combination of parentId and parentType if this combination of values is static and generally increasing over time. It would be better if parentType was an integral type, though, and you may want to look into changing that anyway to save storage space if the base table is, or will become, large. Considering this change also involves looking at how it might affect indexing for the other classes of queries that run against this table.

If either of those two methods aren't fast enough for the workload, look into implementing a data caching solution using something like AppFabric, which scales much more readily than running a SQL query every time you need data. This can be a huge payoff; the cost is added complexity.

Sql-server – How to get rid of clustered index scan

The nested loop is happening at this part, where you're joining the a2_AssessmentQuestion to itself:

SELECT
   [Extent18].[ID] AS [ID]
FROM
   [dbo].[a2_AssessmentQuestion] AS [Extent17]
   INNER JOIN [dbo].[a2_AssessmentQuestion] AS [Extent18]
      ON [Extent17].[PartID] = [Extent18].[PartID]
WHERE
   [Extent1].[AssessmentQuestionID] = [Extent17].[ID]

The where condition is not applied at this part, as you can see from the whole query you posted.

Looking at the predicates and outputs of the nested loop and the clustered index scan, I suggest you try to create a nonclustered index on a2_AssessmentQuestion.PartID.

Nested loop details from the execution plan Clustered index scan details from the execution plan

Best Answer

Related Solutions

Sql-server – RID vs INCLUDE on a large field

Sql-server – How to get rid of clustered index scan

Related Question