Sql-server – Query strategies using SQL Server 2016 system-versioned temporal tables for Slowly-Changing Dimensions

data-warehouseslowly-changing-dimensionsql serversql-server-2016temporal-tables

When using a system-versioned temporal table (new in SQL Server 2016), what are the query authoring and performance implications when this feature is used to handle Slowly Changing Dimensions in a large relational data warehouse?

For example, assume I have a 100,000-row Customer dimension with a Postal Code column and a multi-billion-row Sales fact table with a CustomerID foreign key column. And assume I want to query "Total 2014 sales by customer's postal code". Simplified DDL is like this (omitting many columns for clarity):

CREATE TABLE Customer
(
    CustomerID int identity (1,1) NOT NULL PRIMARY KEY CLUSTERED, 
    PostalCode varchar(50) NOT NULL,
    SysStartTime datetime2 GENERATED ALWAYS AS ROW START NOT NULL, 
    SysEndTime datetime2 GENERATED ALWAYS AS ROW END NOT NULL,   
    PERIOD FOR SYSTEM_TIME (SysStartTime, SysEndTime) 
)
WITH (SYSTEM_VERSIONING = ON);

CREATE TABLE Sale
(
    SaleId int identity(1,1) NOT NULL PRIMARY KEY CLUSTERED,
    SaleDateTime datetime2 NOT NULL,
    CustomerId int NOT NULL FOREIGN KEY REFERENCES Customer(CustomerID),
    SaleAmount decimal(10,2) NOT NULL
);

Where it gets interesting is that customers may have moved during the year so the same customer may have different postal codes. And it's even remotely possible that a customer moved away and then moved back, meaning there could be multiple history records for the same customer with the same postal code! My query of "sales by postal code" should be able to calculate correct results regardless of how customers' postal codes change over time.

I understand how to use temporal tables to query the customer dimension alone (e.g. SELECT * FROM Customer FOR SYSTEM_TIME FROM '2014-1-1' TO '2015-1-1') but I'm not sure how to most accurately and efficiently join to the fact table.

Is this how I should query it?

SELECT c.PostalCode, sum(s.SaleAmount) SaleAmount
FROM Customer c FOR SYSTEM_TIME FROM '2014-1-1' TO '2015-1-1'
    JOIN Sale s ON s.CustomerId = c.CustomerId
WHERE s.SaleDateTime >= '2014-1-1' AND s.SaleDateTime < '2015-1-1'
    AND c.SysStartTime >= s.SaleDateTime
    AND c.SysEndTime < s.SaleDateTime
GROUP BY c.PostalCode

And what are the performance considerations that I should watch out for when making queries like this?

Best Answer

I think, in your case a derived table is necessary to isolate querying number of mutations of postcodes per customer:

SELECT c.postalcode 
, sum(s.SaleAmount) SaleAmount
, count(postcode_mutations.customerid) as CntCustomerChangedPostCode   
FROM dbo.Sale s
JOIN dbo.Customer c on s.customerid = c.customerid

LEFT JOIN (
SELECT 
    CustomerID
FROM [dbo].[Customer]
FOR SYSTEM_TIME FROM '20140101' TO '20150101'
GROUP BY CustomerID
HAVING COUNT(DISTINCT PostalCode) > 1
) postcode_mutations on s.customerid = postcode_mutations.customerid

WHERE s.SaleDateTime >= '2014-1-1' AND s.SaleDateTime < '2015-1-1'
GROUP BY c.PostalCode

upd: Since the query is supposed to serve DWH/Analytics scenarios, the columnstore indexing is an option to check. I also made some benchmarks previously for a 10 million rows table.