Sql-server – Create Slowly Changing Dimension table from Repeated Data

data-warehouseperformanceperformance-tuningslowly-changing-dimensionsql serversql-server-2016

I am receiving transaction snapshot data from a file, it contains a history of repeated data. Currently trying to find Slowly Changing Dimensions on table with [ProductId] Business Key. Many attributes exist: ProductTitle, Category, this is a sample table, actually have around 10 more attributes. How do I create a Product Slowly Changing Dimension table query?

Searching for a performance optimized way, if I have 10 columns, not sure if Group By on 10 columns is optimal

With SQL 2016, is there a function to obtain this data? Should I use Lead/Lag Function? FirstValue/Last Value? New analytics syntax? An attempted query is below.

Note: Data comes from a 1970 legacy file system containing historical data.

Data:

create table dbo.Product
(
    ProductId int,
    ProductTitle varchar(55),
    ProductCategory varchar(255),
    Loaddate datetime
)

insert into dbo.Product
values 
 (1,'Table','ABCD','3/4/2018')
,(1,'Table','ABCD','3/5/2018')
,(1,'Table','ABCD','3/5/2018')
,(1,'Table','ABCD','3/6/2018')
,(1,'Table','XYZ','3/7/2018')
,(1,'Table','XYZ','3/8/2018')
,(1,'Table','XYZ','3/8/2018')
,(1,'Table','XYZ','3/9/2018')
,(1,'Table-Dinner', 'GHI','3/10/2018')
,(1,'Table-Dinner', 'GHI','3/11/2018')
....more data with ProductId =2,3,4, etc

Current Repeated Data in File:

Expected Output:

Attempted Query

(seems to be inefficient, especially when having 10 attribute columns)

select
    product.Productid
    ,product.ProductTitle
    ,product.ProductCategory
    ,min(product.LoadDate) as BeginDate
    ,case when max(product.LoadDate)  = (select max(subproduct.LoadDate) from dbo.Product subproduct where subproduct.productid = product.productid) then '12/31/9999' else max(product.loadDate) end as EndDate
from dbo.Product product
group by Productid, ProductTitle, ProductCategory

Best Answer

if I have 10 columns, not sure if Group By on 10 columns is optimal

Its true,Group By on so many columns is sub optimal.But there is no other way as per your data and requirement.

Window function with partition is worse than Group By.

See as per my understanding Group by on those columns is necessary to get correct output,so if you use Window Function then you have to use those columns in Partition by too .

Therefore Partition by on so any column is worse and also you have to use few more select like in example above.

So what you are already doing is nearly correct except that subquery part.

Once try this,

SELECT product.productid, 
       product.producttitle, 
       product.productcategory, 
       Min(product.loaddate) AS BeginDate 
       -- ,max(product.LoadDate) as BeginDate1 
       , 
       CASE 
         WHEN Max(product.loaddate) = Max(oa.enddate1) THEN '12/31/9999' 
         ELSE Max(product.loaddate) 
       END                   AS EndDate 
FROM   dbo.product product 
       CROSS apply(SELECT Max(subproduct.loaddate) EndDate1 
                   FROM   dbo.product subproduct 
                   WHERE  subproduct.productid = product.productid)oa 
GROUP  BY productid, 
          producttitle, 
          productcategory

If suppose your main query is really very very slow because of subquery or my cross apply then Can you divide your query in 2 steps ?

I think you should open more about your requirement.

How many rows will be updated at a time ?

If those selected rows are inserted/updated in new table then those rows should again not be selected next time when Insert happen.

What are you doing for this ?

If I am wrong about your requirement then let me know so that I correct my answer.

Best Answer

Related Solutions

Sql-server – Data Warehouse – Slowly Changing Dimensions with Many to Many Relationships

What known strategies, if any, can I use to validate a Slowly Changing Dimension

Checking whether the dates are contiguous

Checking the SCD is working correctly

Related Question