We have an interesting problem around duplicate data. Our source files can contain multiple rows of data for a single employee benefit. As an example, a health care benefit could be split up and reported over 3 rows (3 is just an example – it could be many more).
The data is loaded into staging with no problems and will contain a mix of single and multiple benefit rows. What I have done thus far is to perform aggregation to find identify with multiple benefits (aggregate on employeeid and benefitid) and if it is > 1, move it to a separate staging table.
What I now need to do is analyse the staging table with multiple benefits and aggregate the data based on employeeid and benefitid. Let's assume a benefit has 3 rows.
The employeeid and benefitid will be the key to identify those rows. Other rows may include the benefit provider, yearly cost, eligible & benefit name (and many more but these cover the 4 types of aggregation required to output a single row).
- if there are multiple providers in the list of 3, we should set the value to "Multiple" otherwise it should be the distinct value
- yearly cost is simple…SUM of all 3 rows
- if any of the active flags are true, set the value to true (1 could be true and 2 could be false)
- if there are multiple benefit names, we just select the 1st non null value.
After the data is aggregated (3 rows into 1, applying the rules above), it is then merged with the single row benefits table.
Best Answer
Since you're using 2012+, you have windowing functions available to you which I think make this easier to solve.
Here's my set up.
I would then look at a query something like this
Picking through some of the tricks of the query