Data Warehouse – Strategies to Validate Slowly Changing Dimension

business-intelligencedata-warehouseslowly-changing-dimensionssasstar-schema

I have a customer dimension in our data warehouse (SSAS 2014 Multidimensional). It is set up as a Type II SCD, with just Start and End Date fields to track changes. (null end date = current).

As our daily load runs and I see the changes flowing in, I sometimes worry that it's creating new records when it shouldn't.

What kind of scripts can be run to compare each customer over it's respective history and make sure it's changing appropriately?

Also, and I think I'm touching on the gaps and islands problem, what strategies are there for verifying that the start dates and end dates are all contiguous?

I want to feel confident that the table is reporting correct data day to day without becoming a beast due to some null check causing a ton of records to be created each day.

Thanks for any help you can offer.

Best Answer

Checking whether the dates are contiguous

You don't say which DBMS you're using here, but you're using SSAS so I'm guessing SQL Server. If you're on a recent enough version, using LAG and LEAD in window functions can be really handy for this kind of task. You can order the rows by the start date or by an incremental ID if you have one, and then use these to check on values in the next or previous row.

So for instance, you could set up a window function which is partitioned by your business key/surrogate durable key (I'll just use the term durable key from here on out), and ordered by your start date (ascending). You could then use LEAD to check that the start date on the next row was equal to the end date + 1 on the current row. You can't use window functions in the WHERE clause, so perhaps you could create a case statement which flags up those rows which are for the same customer (i.e. have the same durable key), but where the next row's StartDate isn't equal this row's EndDate + 1. You could do the same in reverse to flag up the following row as well. An outer query could then only select the rows which need investigation.

Checking the SCD is working correctly

To some extent the checks you can do are going to depend on your particular situation, but I'll try to tackle this from a generic standpoint that might be helpful to you (if you still need help, 7 months later!), and which might help other people having similar issues. If you return and give more specific detail, let me know and I'll be happy to have a read through and improve my answer if possible.

One option might be to run some queries which select a few COUNT values on different aspects of the dimension. So for instance, you could group by the durable key, and get a count of how many rows exist for that durable key. If you don't have an understanding of how many changes are reasonable, you could take an average count and then look for customers which are over that average (or over that average by some percentage or set amount).

From there, having a dig through the data belonging to a handful of suspicious-looking customers might turn up obvious issues which are worth further investigation. Without specifics I don't know what issues you might be running into, but as an example lets say you notice a customer which has hundreds of rows, with a value which is changing back and forth between two values. From there, you could write up some queries to look for that problem in the rest of the table - for instance, coming up with a query which would let you look at the number of rows for each customer vs. the number of unique rows for each customer (looking only at the actual attribute columns, so not the dates or any other metadata columns you might have).

Checking on whether the customer is changing correctly over time will depend on what historic data you have held in your source(s) or staging area. If you do have historic data in one or both of those places, you might be able to compare the values with your dimension rows either directly by date (if said historic data includes dates), or perhaps by the order those changes happened in.

On a slight tangent, but worth mentioning: I'm guessing from the fact that you're not sure whether this dimension is behaving correctly that this is something you've perhaps inherited from another developer? If you're really worried about the size of this dimension and if you don't know the full reason for its current design, it might be worth checking whether all of the Type II SCD fields really should be Type II SCD.

Some data warehouse developers will tend to presume SCD II by default, which can really be problematic. Aside from the space issues, as the link goes into, the use of SCD II can lead to some unexpected results for the people consuming the data. This might not be relevant in your case (the SCD II attributes might be well-thought-out and suitable for your users' requirements), but if it is relevant it might be possible to alleviate your concern about the size of the dimension and also remove the need to do those data quality checks.

Related Solutions

Sql-server – Query strategies using SQL Server 2016 system-versioned temporal tables for Slowly-Changing Dimensions

I think, in your case a derived table is necessary to isolate querying number of mutations of postcodes per customer:

SELECT c.postalcode 
, sum(s.SaleAmount) SaleAmount
, count(postcode_mutations.customerid) as CntCustomerChangedPostCode   
FROM dbo.Sale s
JOIN dbo.Customer c on s.customerid = c.customerid

LEFT JOIN (
SELECT 
    CustomerID
FROM [dbo].[Customer]
FOR SYSTEM_TIME FROM '20140101' TO '20150101'
GROUP BY CustomerID
HAVING COUNT(DISTINCT PostalCode) > 1
) postcode_mutations on s.customerid = postcode_mutations.customerid

WHERE s.SaleDateTime >= '2014-1-1' AND s.SaleDateTime < '2015-1-1'
GROUP BY c.PostalCode

upd: Since the query is supposed to serve DWH/Analytics scenarios, the columnstore indexing is an option to check. I also made some benchmarks previously for a 10 million rows table.

Building Slowly Changing Dimension on a Fact/Dimension Star Schema

Usually, dimension tables contain a single valid time (start and end date) for all fields and SCD2 would apply to the complete record. It is good practice to use an non-null end value ahead in time to mark currently valid records as this simplifies queries. An end date in the past would signify deletion or any other semantic you define (like person left country or is not employed anymore). Also add surrogate keys to your dimension tables to uniquely identify records.

Fact tables usually contain "measures" like sales or cost or signify events like a placed call or durations of these calls. One would usually use aggregates on these columns in reports.

A star-schema is a way to model a sparsely populated "cube", where each axis of the coordinate system is given by one of the dimension tables. "Slice and dice" operations and "drill up / drill down" operations in reports translate nicely into SQL using this model.

In your cars and people example, I would use two dimension tables, one for cars and one for people, each historized (according to SCD2), and a factless fact table comprising foreign keys to the dimension tables, referencing the respective identifier (entity identifier), and valid time columns (SCD2). You would not add a record according to SCD2 rules in the fact table, if one of the dimension tables changes, in this design.

This way you can model changes in each entity, like name changes in people, color changes in cars and the relationship between cars and people, for example ownership. Each table would use non-overlapping valid times (start and end values) for each business key, recording the history of these entities independently. The fact table would in this model basically be a m:n linkage table, for which a separate history of valid times is kept.

You would identify the current and past lists by using x between start and end on each table for as-of now (or past) queries (answers your (1) and (2) - ignoring if the intervall is right open or left open).

Summary statistics like how many cars do we have in some city with full history (assuming city is part of the people dimension table), can now be answered using temporal joins and "sequenced" queries, which are sometimes also called "coalesce" queries, see Snodgrass, chapter 6.3ff