Sql-server – Query strategies using SQL Server 2016 system-versioned temporal tables for Slowly-Changing Dimensions

When using a system-versioned temporal table (new in SQL Server 2016), what are the query authoring and performance implications when this feature is used to handle Slowly Changing Dimensions in a large relational data warehouse?

For example, assume I have a 100,000-row Customer dimension with a Postal Code column and a multi-billion-row Sales fact table with a CustomerID foreign key column. And assume I want to query "Total 2014 sales by customer's postal code". Simplified DDL is like this (omitting many columns for clarity):

CREATE TABLE Customer
(
    CustomerID int identity (1,1) NOT NULL PRIMARY KEY CLUSTERED, 
    PostalCode varchar(50) NOT NULL,
    SysStartTime datetime2 GENERATED ALWAYS AS ROW START NOT NULL, 
    SysEndTime datetime2 GENERATED ALWAYS AS ROW END NOT NULL,   
    PERIOD FOR SYSTEM_TIME (SysStartTime, SysEndTime) 
)
WITH (SYSTEM_VERSIONING = ON);

CREATE TABLE Sale
(
    SaleId int identity(1,1) NOT NULL PRIMARY KEY CLUSTERED,
    SaleDateTime datetime2 NOT NULL,
    CustomerId int NOT NULL FOREIGN KEY REFERENCES Customer(CustomerID),
    SaleAmount decimal(10,2) NOT NULL
);

Where it gets interesting is that customers may have moved during the year so the same customer may have different postal codes. And it's even remotely possible that a customer moved away and then moved back, meaning there could be multiple history records for the same customer with the same postal code! My query of "sales by postal code" should be able to calculate correct results regardless of how customers' postal codes change over time.

I understand how to use temporal tables to query the customer dimension alone (e.g. SELECT * FROM Customer FOR SYSTEM_TIME FROM '2014-1-1' TO '2015-1-1') but I'm not sure how to most accurately and efficiently join to the fact table.

Is this how I should query it?

SELECT c.PostalCode, sum(s.SaleAmount) SaleAmount
FROM Customer c FOR SYSTEM_TIME FROM '2014-1-1' TO '2015-1-1'
    JOIN Sale s ON s.CustomerId = c.CustomerId
WHERE s.SaleDateTime >= '2014-1-1' AND s.SaleDateTime < '2015-1-1'
    AND c.SysStartTime >= s.SaleDateTime
    AND c.SysEndTime < s.SaleDateTime
GROUP BY c.PostalCode

And what are the performance considerations that I should watch out for when making queries like this?

SELECT c.postalcode , sum(s.SaleAmount) SaleAmount , count(postcode_mutations.customerid) as CntCustomerChangedPostCode FROM dbo.Sale s JOIN dbo.Customer c on s.customerid = c.customerid LEFT JOIN ( SELECT CustomerID FROM [dbo].[Customer] FOR SYSTEM_TIME FROM '20140101' TO '20150101' GROUP BY CustomerID HAVING COUNT(DISTINCT PostalCode) > 1 ) postcode_mutations on s.customerid = postcode_mutations.customerid WHERE s.SaleDateTime >= '2014-1-1' AND s.SaleDateTime < '2015-1-1' GROUP BY c.PostalCode

Best Answer

I think, in your case a derived table is necessary to isolate querying number of mutations of postcodes per customer:

upd: Since the query is supposed to serve DWH/Analytics scenarios, the columnstore indexing is an option to check. I also made some benchmarks previously for a 10 million rows table.

Best Answer

Related Solutions

Sql-server – What are the pros and cons of using the customer codes as a primary key

Sql-server – Will Temporal Tables be included in the Standard Edition of SQL Server 2016

Related Question