SSIS – Aggregating Duplicate Data

ssis

We have an interesting problem around duplicate data. Our source files can contain multiple rows of data for a single employee benefit. As an example, a health care benefit could be split up and reported over 3 rows (3 is just an example – it could be many more).

The data is loaded into staging with no problems and will contain a mix of single and multiple benefit rows. What I have done thus far is to perform aggregation to find identify with multiple benefits (aggregate on employeeid and benefitid) and if it is > 1, move it to a separate staging table.

What I now need to do is analyse the staging table with multiple benefits and aggregate the data based on employeeid and benefitid. Let's assume a benefit has 3 rows.

The employeeid and benefitid will be the key to identify those rows. Other rows may include the benefit provider, yearly cost, eligible & benefit name (and many more but these cover the 4 types of aggregation required to output a single row).

  1. if there are multiple providers in the list of 3, we should set the value to "Multiple" otherwise it should be the distinct value
  2. yearly cost is simple…SUM of all 3 rows
  3. if any of the active flags are true, set the value to true (1 could be true and 2 could be false)
  4. if there are multiple benefit names, we just select the 1st non null value.

After the data is aggregated (3 rows into 1, applying the rules above), it is then merged with the single row benefits table.

Best Answer

Since you're using 2012+, you have windowing functions available to you which I think make this easier to solve.

Here's my set up.

IF NOT EXISTS
(
    SELECT * 
    FROM 
        sys.tables AS T 
        INNER JOIN 
            sys.schemas AS S 
            ON s.schema_id = t.schema_id
    WHERE 
        S.name ='dbo' AND T.name = 'StagingBenefit'
)
BEGIN
    CREATE TABLE 
        dbo.StagingBenefit
    (
        StagingBenefitSK int identity(1,1) NOT NULL

    ,   EmployeeID int NOT NULL
    ,   BenefitID int NOT NULL

    ,   BenefitProvider char(1) NOT NULL
    ,   Cost int NOT NULL
    ,   FlagA bit NOT NULL
    ,   FlagB bit NOT NULL
    ,   FlagC bit NOT NULL
    ,   BenefitName varchar(30) NULL
    ,   Scenario varchar(30) NOT NULL
    );
END

TRUNCATE TABLE dbo.StagingBenefit;

INSERT INTO
    dbo.StagingBenefit
SELECT
*
FROM
(
    VALUES
        (1,1, 'A', 100, 0, 0, 0, 'Teeth', 'SingleRow')
    ,   (2,2, 'A', 200, 0, 0, 0, 'Teeth', 'DoubleRow, Samesies')
    ,   (2,2, 'A', 222, 0, 0, 0, 'Teeth', 'DoubleRow, Samesies')
    ,   (3,3, 'A', 300, 0, 0, 0, 'Teeth', 'ThreeRow,DoubleFlagged')
    ,   (3,3, 'A', 330, 1, 0, 0, 'Elbow', 'ThreeRow,DoubleFlagged')
    ,   (3,3, 'A', 333, 0, 1, 0, 'Elbow', 'ThreeRow,DoubleFlagged')
    ,   (4,4, 'A', 400, 0, 0, 0, 'Teeth', 'MultipleProviders')
    ,   (4,4, 'A', 440, 0, 0, 0, 'First', 'MultipleProviders')
    ,   (4,4, 'B', 444, 0, 0, 1,    NULL, 'MultipleProviders')
)D
(
    EmployeeID
,   BenefitID

,   BenefitProvider
,   Cost
,   FlagA
,   FlagB
,   FlagC
,   BenefitName
,   Scenario
)
;

I would then look at a query something like this

SELECT DISTINCT
    SB.EmployeeID
,   SB.BenefitID
,   
    CASE 
        WHEN
            MIN(SB.BenefitProvider) OVER (PARTITION BY SB.EmployeeID, SB.BenefitID) 
            <> MAX(SB.BenefitProvider) OVER (PARTITION BY SB.EmployeeID, SB.BenefitID)
            THEN 'Multiple'
        ELSE
            MIN(SB.BenefitProvider) OVER (PARTITION BY SB.EmployeeID, SB.BenefitID) 
        END AS BenefitProvider
    -- SUM this
,   SUM(SB.Cost) OVER (PARTITION BY SB.EmployeeID, SB.BenefitID) As TotalCost
,   CAST(MAX(CAST(SB.FlagA AS int)) OVER (PARTITION BY SB.EmployeeID, SB.BenefitID) AS bit) AS FlagA
,   CAST(MAX(CAST(SB.FlagB AS int)) OVER (PARTITION BY SB.EmployeeID, SB.BenefitID) AS bit) AS FlagB
,   CAST(MAX(CAST(SB.FlagC AS int)) OVER (PARTITION BY SB.EmployeeID, SB.BenefitID) AS bit) AS FlagC
,   MIN(SB.BenefitName) OVER 
    (
        PARTITION BY SB.EmployeeID, SB.BenefitID 
    ) AS FirstBenefit
,   SB.Scenario
FROM
    dbo.StagingBenefit AS SB

Picking through some of the tricks of the query

  • DISTINCT I feel this is the dirtiest part of it but it solves the problem
  • MIN/MAX BenefitProvider - I find the first and last for my window (employee and benefit ids) and if they are different, which I determine via the CASE expression, I use the text Multiple. Otherwise, I just grabbed the minimum but max would work just as well. Heck, I probably could have just skipped the second aggregation call and just made it column name.
  • MAX Flag - I made this as hard as I could by using a bit data type which you cannot aggregate on so I explicitly convert to an integer and then cast the results of that back to a bit. Same windowing though. Net result is if anywhere my flag is set to true, it'll get preserved across rows
  • MIN BenefitName - the rule here is to pick the first non-null benefit name. The aggregate function eliminates the NULL and returns the text "First"