SQL Server – Combine 7 Sources Without UNION

normalizationsql serversql server 2014ssis

I have a source table that looks essentially like this:

  • EmployeeCode
  • WeekStartDate
  • HoursWorkedDay1
  • HoursWorkedDay2
  • HoursWorkedDay3
  • HoursWorkedDay4
  • HoursWorkedDay5
  • HoursWorkedDay6
  • HoursWorkedDay7

The actual table has something like 500 numbered columns (didn't really count them – there are various and numerous fields numbered 1-7, and then another handful numbered 1-25, times 7) per weekday (no, that's not my design), and there are currently something like 38,600 rows (growing every week).

I have an SSIS package that's trying to normalize this data… that currently looks like this:

union all

Each "source" is selecting one set of numbered columns from the same source table, and the UNION ALL component combines the 7 sources into one, resulting in some 258,900 rows.

The rest of the workflow adds some calculated columns, looks up surrogate keys (e.g. EmployeeCode is used to lookup an EmployeeId, and then the date is computed and used for looking up a TimeId), and then the "modified" rows get updated and the "new" ones get inserted into a normalized table; unchanged rows end up nowhere.

Is there any better way (e.g. a bit less heavy on memory pressure) to normalize the source data?

Best Answer

Without the full table definition, it is difficult to provide a perfect answer. However, in an attempt to show the differences in a limited repro, with a very small amount of data, I've created the following testbed:

IF OBJECT_ID('tempdb..#src') IS NOT NULL
DROP TABLE #src;
CREATE TABLE #src
(
    EmployeeCode INT NOT NULL
    , WeekStartDate DATE NOT NULL
    , HoursDay1 INT NOT NULL
    , HoursDay2 INT NOT NULL
    , HoursDay3 INT NOT NULL
    , HoursDay4 INT NOT NULL
    , HoursDay5 INT NOT NULL
    , HoursDay6 INT NOT NULL
    , HoursDay7 INT NOT NULL
    , Widget1Day1 INT NOT NULL
    , Widget1Day2 INT NOT NULL
    , Widget1Day3 INT NOT NULL
    , Widget1Day4 INT NOT NULL
    , Widget1Day5 INT NOT NULL
    , Widget1Day6 INT NOT NULL
    , Widget1Day7 INT NOT NULL
    , Widget2Day1 INT NOT NULL
    , Widget2Day2 INT NOT NULL
    , Widget2Day3 INT NOT NULL
    , Widget2Day4 INT NOT NULL
    , Widget2Day5 INT NOT NULL
    , Widget2Day6 INT NOT NULL
    , Widget2Day7 INT NOT NULL
    , PRIMARY KEY CLUSTERED (WeekStartDate, EmployeeCode)
);
GO
INSERT INTO #src (EmployeeCode, WeekStartDate
    , HoursDay1, HoursDay2, HoursDay3, HoursDay4, HoursDay5, HoursDay6, HoursDay7
    , Widget1Day1, Widget1Day2, Widget1Day3, Widget1Day4, Widget1Day5, Widget1Day6, Widget1Day7
    , Widget2Day1, Widget2Day2, Widget2Day3, Widget2Day4, Widget2Day5, Widget2Day6, Widget2Day7
    )
VALUES (1, '2016-09-18'
    , 0, 8, 8, 8, 8, 8, 0
    , ABS(CHECKSUM(NEWID())), ABS(CHECKSUM(NEWID())), ABS(CHECKSUM(NEWID())), ABS(CHECKSUM(NEWID())), ABS(CHECKSUM(NEWID())), ABS(CHECKSUM(NEWID())), ABS(CHECKSUM(NEWID()))
    , ABS(CHECKSUM(NEWID())), ABS(CHECKSUM(NEWID())), ABS(CHECKSUM(NEWID())), ABS(CHECKSUM(NEWID())), ABS(CHECKSUM(NEWID())), ABS(CHECKSUM(NEWID())), ABS(CHECKSUM(NEWID()))
    );

Below we are comparing the two queries; the first uses the CROSS APPLY method, detailed by me at SQLServerScience.com, and the second uses the UNION ALL method.

SELECT s.WeekStartDate
    , s.EmployeeCode
    , ItemsByDay.DayOfWeekName
    , ItemsByDay.HoursWorked
    , ItemsByDay.Widget1
    , ItemsByDay.Widget2
FROM #src s
CROSS APPLY (VALUES ('Sunday', HoursDay1, Widget1Day1, Widget2Day1)
    , ('Monday', HoursDay2, Widget1Day2, Widget2Day2)
    , ('Tuesday', HoursDay3, Widget1Day3, Widget2Day3)
    , ('Wednesday', HoursDay4, Widget1Day4, Widget2Day4)
    , ('Thursday', HoursDay5, Widget1Day5, Widget2Day5)
    , ('Friday', HoursDay6, Widget1Day6, Widget2Day6)
    , ('Saturday', HoursDay7, Widget1Day7, Widget2Day7)
    ) ItemsByDay(DayOfWeekName, HoursWorked, Widget1, Widget2);


SELECT s.EmployeeCode
    , s.WeekStartDate
    , 'Sunday'
    , s.HoursDay1
    , s.Widget1Day1
    , s.Widget2Day1
FROM #src s
UNION ALL
SELECT s.EmployeeCode
    , s.WeekStartDate
    , 'Monday'
    , s.HoursDay2
    , s.Widget1Day2
    , s.Widget2Day2
FROM #src s
UNION ALL
SELECT s.EmployeeCode
    , s.WeekStartDate
    , 'Tuesday'
    , s.HoursDay3
    , s.Widget1Day3
    , s.Widget2Day3
FROM #src s
UNION ALL
SELECT s.EmployeeCode
    , s.WeekStartDate
    , 'Wednesday'
    , s.HoursDay4
    , s.Widget1Day4
    , s.Widget2Day4
FROM #src s
UNION ALL
SELECT s.EmployeeCode
    , s.WeekStartDate
    , 'Thursday'
    , s.HoursDay5
    , s.Widget1Day5
    , s.Widget2Day5
FROM #src s
UNION ALL
SELECT s.EmployeeCode
    , s.WeekStartDate
    , 'Friday'
    , s.HoursDay6
    , s.Widget1Day6
    , s.Widget2Day6
FROM #src s
UNION ALL
SELECT s.EmployeeCode
    , s.WeekStartDate
    , 'Saturday'
    , s.HoursDay7
    , s.Widget1Day7
    , s.Widget2Day7
FROM #src s;

First thing to note, the CROSS APPLY is easier to look at. This already makes me happy.

Lets check the execution plans for the two variants:

enter image description here

The UNION ALL variant scans the source table 7 times, whereas the CROSS APPLY uses a single table scan. By using the cross apply, we're #Winning.

Let's add more data:

/* create a table with 2 years worth of week start dates */
IF OBJECT_ID('tempdb..#Weeks') IS NULL
BEGIN
    CREATE TABLE #Weeks
    (
        WeekStart DATE NOT NULL
        PRIMARY KEY CLUSTERED
    );

    INSERT INTO #Weeks (WeekStart)
    SELECT TOP(104) DATEADD(DAY, (ROW_NUMBER() OVER (ORDER BY o1.name) - 1) * 7, '2016-01-03')
    FROM sys.objects o1
        CROSS JOIN sys.objects o2;
END

/* remove the single row from the source table we inserted above */
TRUNCATE TABLE #src;

/* insert a load of rows into the #src table */
INSERT INTO #src (EmployeeCode, WeekStartDate, HoursDay1, HoursDay2, HoursDay3, HoursDay4, HoursDay5, HoursDay6, HoursDay7)
SELECT ABS(CHECKSUM(NEWID()))
    , w.WeekStart
    , 0, 8, 8, 8, 8, 8, 0
FROM #Weeks w
    CROSS JOIN sys.objects o1;

On my system the above code generated around 85,000 rows. The plans for the two queries are now:

enter image description here

SQL Sentry Plan Explorer shows the following summary information, which is invaluable:

enter image description here

This says the CPU is used more intensively by the CROSS APPLY, however there is 7 times more I/O used by the UNION ALL variant.