SQL Server Temporal Tables – Improving Performance on Older Values

performancesql servertemporal-tables

I'm encountering a strange issue occurring when accessing historical records within a temporal table. Queries that access the older entries in the temporal table via the AS OF sub-clause take longer than queries on recent historical entries.

The historical table was generated by SQL Server (includes a clustered index on the date columns and uses page compression), I have added 50 million rows to the historical table, and my queries were retrieving about 25,000 rows.

I have tried to determine the root cause of the issue but have not been able to identify it. So far I have tested:

Creating a test table with 50 million rows with a clustered index to see if the slow down was simply due to volume. I was able to retrieve 25K rows at constant time (~400ms).
Removing page compression from the historical table. That had no effect on the retrieval time but did significantly increase the size of the table.
I tried accessing the rows of the history table directly using an ID column vs the date columns. This is where things were a bit more interesting. I could access older rows in the table at ~400ms where as with the AS OF sub clause it would take ~1200ms. I tried filtering on my test table on the date column and noticed a similar slowdown when compared to filtering on the ID column. This leads me to believe that the date comparisons are behind some of the slowdown.

I want to look at this more but I also want to make sure that I am not barking up the wrong tree. First, has anyone else experienced this same behavior when accessing older historical data in a temporal table (we only noticed slow downs passed 10 million rows)? Second, what are some strategies I can use to further isolate the root cause of the performance issue (I just started looking into execution plans but it is still a bit cryptic to me)?

Execution plans

These are simple retrieval queries: the first accesses older rows, the second accesses newer rows.

Older Rows ~1200ms execution time

Recent Rows ~350ms execution time

Table details

These are the columns in the temporal table. The history table has the same columns but does not have a primary key (as per the history table requirements):

Below are the indices on the history table:

Best Answer

In a comment from Zane on your question, he stated:

...It seems like part of your problem is you're reading 50 million rows in order to return 20K in the plan.

This is, indeed, the problem. There's no index available to push some, or all, of the predicates down to the storage engine. Microsoft recommends this baseline indexing strategy for temporal tables in the Docs article Temporal Table Considerations and Limitations:

An optimal indexing strategy will include a clustered columns store index and / or a B-tree rowstore index on the current table and a clustered columnstore index on the history table for optimal storage size and performance. If you create / use your own history table, we strongly recommend that you create this type of index consisting of period columns starting with the end of period column to speed up temporal querying as well as speeding up the queries that are part of the data consistency check. The default history table has a clustered rowstore index created for you based on the period columns (end, start). At a minimum, a non-clustered rowstore index is recommended

The phrasing of that is a little confusing (to me, anyway). But the takeaway is that you could create these indexes to improve performance some, if not quite a lot:

NC index on the current table, leading with `SysEndTime`:

CREATE NONCLUSTERED INDEX IX_SysEndTime_SysStartTime 
ON dbo.Benefits (SysEndTime, SysStartTime)
/*INCLUDE (ideally, include your other important fields here)*/;

This will allow you to avoid reading some of the rows in the current table by seeking to the appropriate end time.

CCI on the history table

CREATE CLUSTERED COLUMNSTORE INDEX ix_BenefitsHistory
ON dbo.BenefitsHistory
WITH (DROP_EXISTING = ON);

This will let you get batch mode on the history table, which should make the scans much faster.

NC index on the current table, leading with `SysStartTime`:

See Paul's answer to the question Most Efficient Way to Retrieve Date Ranges for more details on why indexing for date range queries is hard. Based on the logic there, it makes sense to add another NC index on the current table that leads with SysStartTime, so that the optimizer can choose which one to use based on statistics and the specific parameters of your query:

CREATE NONCLUSTERED INDEX IX_SysStartTime_SysEndTime
ON dbo.Benefits (SysStartTime, SysEndTime)
/*INCLUDE (ideally, include your other important fields here)*/;

Creating the 3 indexes outlined above made a significant difference in resource usage in my test cases. I set up a test case which runs two queries that return 1.5 million total rows. Both the history and current tables have 50 million rows).

Note: To reduce SSMS overhead, I ran the test with "Discard results after execution" option enabled.

Execution Plan - Default Indexes

Logical reads: 1,330,612
CPU time: 00:00:14.718
Elapsed time: 00:00:06.198

Execution Plan - With Indexes Described Above

Logical reads: 27,656 (8,111 row store + 19,545 columnstore)
CPU time: 00:00:01.828
Elapsed time: 00:00:01.150

As you can see, all 3 measures dropped significantly - including total elapsed time, from 6 seconds to 1 second.

The other option presented by the Docs article is to forgo the two NC indexes on the current table in favor of a clustered columnstore index. In my test, performance was very similar to the indexing solution described above.

Related Solutions

SQL Server Performance – Improving Inner Join Performance Using Dates and Between

Much more efficient to do this without having to go back and join to the periods table.

DECLARE @StartDate DATE, @EndDate DATE;

Select @StartDate = Min(StartDate), @EndDate = MAX(EndDate) 
from dbo.PeriodCalendar_Weeks pcw
where (pcw.Year = @Year and pcw.Period < @Period) 
  or  (pcw.Year = @Year and pcw.Period = @Period and pcw.Week <= @Week) 
  or (pcw.Year = @Year -1 and pcw.Period >= @Period);

SELECT 
  WeekEndDate = DATEADD(DAY, 6, DATEADD(WEEK, SalesWeek, @StartDate)), 
  Store, 
  DeliveryChargesTotal = dct
FROM 
(
  SELECT DATEDIFF(DAY, @StartDate, SalesDate)/7, Store, SUM(DeliveryChargesTotal)
  FROM dbo.Daily_GC_Headers
  WHERE SalesDate BETWEEN @StartDate AND @EndDate AND isCanceled = 0
  GROUP BY DATEDIFF(DAY, @StartDate, SalesDate)/7, Store
) AS x (SalesWeek, Store, dct)
ORDER BY WeekEndDate, Store;

A filtered index may help, if many rows exist where isCanceled = 1 (these are just possible suggestions, depending on cardinality of Store, and may not be the most optimal):

CREATE INDEX x ON dbo.Daily_GC_Headers
  (SalesDate) INCLUDE (Store, DeliveryChargesTotal)
  WHERE isCanceled = 0;

If there are very few rows where isCanceled = 1, this may be better:

CREATE INDEX x ON dbo.Daily_GC_Headers
  (SalesDate, IsCanceled) INCLUDE (Store, DeliveryChargesTotal);

Both are worth trying on a test system, as well as moving Store into the key in either case, or moving IsCanceled to the INCLUDE list in the latter case. On my system, I found the best results with everything but the date in the INCLUDE list:

CREATE INDEX x ON dbo.Daily_GC_Headers
  (SalesDate) INCLUDE (Store, IsCanceled, DeliveryChargesTotal);

Again, you will need to test if any of these work out, or if the query above gives a different/better recommendation directly from SQL Server.

MySQL – Improving Performance of JOIN, WHERE, ORDER BY

With your datasets, MySQL has to obtain those 450,000 records from posts (in 1000 little chunks from each matching source_id), sort it, and then return the top 10. It is a costly exercise.

You could resort to using a stored procedure, and accumulate results going back in time, say daily or weekly, looping until obtaining at least 10 records, and then returning the 10 most recent ones. You'll need an index on posts by (date, source_id). It would return quickly for the most recently active users, but take much longer for users without recent posts. Something like the following:

DELIMITER ;;
CREATE DEFINER=CURRENT_USER PROCEDURE stored_procedure_name(u_id int)
BEGIN
  DECLARE fd DATE;
  DECLARE d DATE;

  SELECT MIN(date), MAX(date) INTO fd, d FROM posts;

  CREATE TEMPORARY TABLE last_posts (id int);

  WHILE d > fd AND (SELECT COUNT(*) FROM last_posts) < 10 DO
    INSERT INTO last_posts (id)
    SELECT p.id
    FROM relations AS r
    JOIN posts AS p ON (p.source_id = r.source_id AND
                        p.date > date_sub(d, interval 7 day) AND p.date <= d)
    WHERE r.user_id = u_id
    ORDER BY p.date DESC
    LIMIT 10;

    SET d = date_sub(d, interval 7 day);
  END WHILE;

  SELECT p.id, p.title, p.content, r.id AS rid
  FROM posts p
  JOIN relations AS r ON (r.source_id = p.source_id)
  WHERE p.id IN (SELECT * FROM last_posts)
  ORDER BY p.date DESC
  LIMIT 10;

  DROP TABLE last_posts;
END;;
DELIMITER ;

You could play with the intervals, reducing them to 1 day (more cycles with faster queries). Ensure you have an index on posts (date).