SQL Server – Combine 7 Sources Without UNION

normalizationsql serversql server 2014ssis

I have a source table that looks essentially like this:

EmployeeCode
WeekStartDate
HoursWorkedDay1
HoursWorkedDay2
HoursWorkedDay3
HoursWorkedDay4
HoursWorkedDay5
HoursWorkedDay6
HoursWorkedDay7

The actual table has something like 500 numbered columns (didn't really count them – there are various and numerous fields numbered 1-7, and then another handful numbered 1-25, times 7) per weekday (no, that's not my design), and there are currently something like 38,600 rows (growing every week).

I have an SSIS package that's trying to normalize this data… that currently looks like this:

union all

Each "source" is selecting one set of numbered columns from the same source table, and the UNION ALL component combines the 7 sources into one, resulting in some 258,900 rows.

The rest of the workflow adds some calculated columns, looks up surrogate keys (e.g. EmployeeCode is used to lookup an EmployeeId, and then the date is computed and used for looking up a TimeId), and then the "modified" rows get updated and the "new" ones get inserted into a normalized table; unchanged rows end up nowhere.

Is there any better way (e.g. a bit less heavy on memory pressure) to normalize the source data?

Best Answer

Without the full table definition, it is difficult to provide a perfect answer. However, in an attempt to show the differences in a limited repro, with a very small amount of data, I've created the following testbed:

IF OBJECT_ID('tempdb..#src') IS NOT NULL
DROP TABLE #src;
CREATE TABLE #src
(
    EmployeeCode INT NOT NULL
    , WeekStartDate DATE NOT NULL
    , HoursDay1 INT NOT NULL
    , HoursDay2 INT NOT NULL
    , HoursDay3 INT NOT NULL
    , HoursDay4 INT NOT NULL
    , HoursDay5 INT NOT NULL
    , HoursDay6 INT NOT NULL
    , HoursDay7 INT NOT NULL
    , Widget1Day1 INT NOT NULL
    , Widget1Day2 INT NOT NULL
    , Widget1Day3 INT NOT NULL
    , Widget1Day4 INT NOT NULL
    , Widget1Day5 INT NOT NULL
    , Widget1Day6 INT NOT NULL
    , Widget1Day7 INT NOT NULL
    , Widget2Day1 INT NOT NULL
    , Widget2Day2 INT NOT NULL
    , Widget2Day3 INT NOT NULL
    , Widget2Day4 INT NOT NULL
    , Widget2Day5 INT NOT NULL
    , Widget2Day6 INT NOT NULL
    , Widget2Day7 INT NOT NULL
    , PRIMARY KEY CLUSTERED (WeekStartDate, EmployeeCode)
);
GO
INSERT INTO #src (EmployeeCode, WeekStartDate
    , HoursDay1, HoursDay2, HoursDay3, HoursDay4, HoursDay5, HoursDay6, HoursDay7
    , Widget1Day1, Widget1Day2, Widget1Day3, Widget1Day4, Widget1Day5, Widget1Day6, Widget1Day7
    , Widget2Day1, Widget2Day2, Widget2Day3, Widget2Day4, Widget2Day5, Widget2Day6, Widget2Day7
    )
VALUES (1, '2016-09-18'
    , 0, 8, 8, 8, 8, 8, 0
    , ABS(CHECKSUM(NEWID())), ABS(CHECKSUM(NEWID())), ABS(CHECKSUM(NEWID())), ABS(CHECKSUM(NEWID())), ABS(CHECKSUM(NEWID())), ABS(CHECKSUM(NEWID())), ABS(CHECKSUM(NEWID()))
    , ABS(CHECKSUM(NEWID())), ABS(CHECKSUM(NEWID())), ABS(CHECKSUM(NEWID())), ABS(CHECKSUM(NEWID())), ABS(CHECKSUM(NEWID())), ABS(CHECKSUM(NEWID())), ABS(CHECKSUM(NEWID()))
    );

Below we are comparing the two queries; the first uses the CROSS APPLY method, detailed by me at SQLServerScience.com, and the second uses the UNION ALL method.

SELECT s.WeekStartDate
    , s.EmployeeCode
    , ItemsByDay.DayOfWeekName
    , ItemsByDay.HoursWorked
    , ItemsByDay.Widget1
    , ItemsByDay.Widget2
FROM #src s
CROSS APPLY (VALUES ('Sunday', HoursDay1, Widget1Day1, Widget2Day1)
    , ('Monday', HoursDay2, Widget1Day2, Widget2Day2)
    , ('Tuesday', HoursDay3, Widget1Day3, Widget2Day3)
    , ('Wednesday', HoursDay4, Widget1Day4, Widget2Day4)
    , ('Thursday', HoursDay5, Widget1Day5, Widget2Day5)
    , ('Friday', HoursDay6, Widget1Day6, Widget2Day6)
    , ('Saturday', HoursDay7, Widget1Day7, Widget2Day7)
    ) ItemsByDay(DayOfWeekName, HoursWorked, Widget1, Widget2);


SELECT s.EmployeeCode
    , s.WeekStartDate
    , 'Sunday'
    , s.HoursDay1
    , s.Widget1Day1
    , s.Widget2Day1
FROM #src s
UNION ALL
SELECT s.EmployeeCode
    , s.WeekStartDate
    , 'Monday'
    , s.HoursDay2
    , s.Widget1Day2
    , s.Widget2Day2
FROM #src s
UNION ALL
SELECT s.EmployeeCode
    , s.WeekStartDate
    , 'Tuesday'
    , s.HoursDay3
    , s.Widget1Day3
    , s.Widget2Day3
FROM #src s
UNION ALL
SELECT s.EmployeeCode
    , s.WeekStartDate
    , 'Wednesday'
    , s.HoursDay4
    , s.Widget1Day4
    , s.Widget2Day4
FROM #src s
UNION ALL
SELECT s.EmployeeCode
    , s.WeekStartDate
    , 'Thursday'
    , s.HoursDay5
    , s.Widget1Day5
    , s.Widget2Day5
FROM #src s
UNION ALL
SELECT s.EmployeeCode
    , s.WeekStartDate
    , 'Friday'
    , s.HoursDay6
    , s.Widget1Day6
    , s.Widget2Day6
FROM #src s
UNION ALL
SELECT s.EmployeeCode
    , s.WeekStartDate
    , 'Saturday'
    , s.HoursDay7
    , s.Widget1Day7
    , s.Widget2Day7
FROM #src s;

First thing to note, the CROSS APPLY is easier to look at. This already makes me happy.

Lets check the execution plans for the two variants:

The UNION ALL variant scans the source table 7 times, whereas the CROSS APPLY uses a single table scan. By using the cross apply, we're #Winning.

Let's add more data:

/* create a table with 2 years worth of week start dates */
IF OBJECT_ID('tempdb..#Weeks') IS NULL
BEGIN
    CREATE TABLE #Weeks
    (
        WeekStart DATE NOT NULL
        PRIMARY KEY CLUSTERED
    );

    INSERT INTO #Weeks (WeekStart)
    SELECT TOP(104) DATEADD(DAY, (ROW_NUMBER() OVER (ORDER BY o1.name) - 1) * 7, '2016-01-03')
    FROM sys.objects o1
        CROSS JOIN sys.objects o2;
END

/* remove the single row from the source table we inserted above */
TRUNCATE TABLE #src;

/* insert a load of rows into the #src table */
INSERT INTO #src (EmployeeCode, WeekStartDate, HoursDay1, HoursDay2, HoursDay3, HoursDay4, HoursDay5, HoursDay6, HoursDay7)
SELECT ABS(CHECKSUM(NEWID()))
    , w.WeekStart
    , 0, 8, 8, 8, 8, 8, 0
FROM #Weeks w
    CROSS JOIN sys.objects o1;

On my system the above code generated around 85,000 rows. The plans for the two queries are now:

SQL Sentry Plan Explorer shows the following summary information, which is invaluable:

This says the CPU is used more intensively by the CROSS APPLY, however there is 7 times more I/O used by the UNION ALL variant.

Related Solutions

Sql-server – Query to normalize table/combine row text

This should work, I will clean it up later so its more efficient.

DECLARE @Old TABLE ( 
  id         INT, 
  rank       INT, 
  linenumber INT, 
  sometext   VARCHAR(1000)) 
DECLARE @New TABLE ( 
  id           INT, 
  rank         INT, 
  combinedtext VARCHAR(1000)) 


;WITH combinedresults(ctid, id, rank, linenumber, combinedtext) 
     AS (SELECT 0, 
                id, 
                rank, 
                linenumber, 
                CAST (sometext AS VARCHAR(8000)) 
         FROM   @Old o 
         WHERE  NOT EXISTS (SELECT TOP 1 1 
                            FROM   @Old 
                            WHERE  id = o.id 
                                   AND rank = o.rank 
                                   AND linenumber < o.linenumber) 
         UNION ALL 
         SELECT ctid + 1, 
                o.id, 
                o.rank, 
                o.linenumber, 
                ct.combinedtext + o.sometext 
         FROM   @Old o 
                INNER JOIN combinedresults ct 
                  ON ct.id = o.id 
                     AND ct.rank = o.rank 
         WHERE  o.linenumber > ct.linenumber) 

UPDATE n 
SET    combinedtext = ct.combinedtext 
FROM   @New n 
       INNER JOIN (SELECT n.id, 
                          n.rank, 
                          MAX(o.rank) orank 
                   FROM   @new n 
                          INNER JOIN @Old o 
                            ON n.id = o.id 
                               AND o.rank <= n.rank 
                   GROUP  BY n.id, 
                             n.rank) r 
         ON n.id = r.id 
            AND n.rank = r.rank 
       INNER JOIN (SELECT id, 
                          ct.rank, 
                          MAX(ctid) ctid 
                   FROM   combinedresults ct 
                   GROUP  BY ct.id, 
                             ct.rank) r2 
         ON r2.id = r.id 
            AND r2.rank = r.orank 
       INNER JOIN combinedresults ct 
         ON r.id = ct.id 
            AND ct.rank = r.orank 
            AND ct.ctid = r2.ctid 

SELECT * 
FROM   @New

How to normalize data based only on the repetitions

I've had to reverse engineer several existing complex data sets. The most important thing to establish are the keys and dependencies in the data. The problem is frankly NP hard, so some intuition and inspection will go a long way to getting you to a sensible answer, so don't count on a simple "turn the crank" solution unless you've got a lot of time on your hands.

What you need to do is to query the data a column at a time and by combinations of columns. You want frequency distributions for column values (and column combination values). Columns (or combos) with maximum frequencies of 1 are candidate keys.

You can also look at frequency distributions of combinations of columns to find potential hierarchies. In your example each value in Col2 only ever has one value in Col1 and so forth.

When you identify candidate keys and dependencies between columns you can apply normalization.

EDIT: In response to OP's quesiton in comments:

This is a query that would tell you whether or not one column may have a functional dependency on another column:

select COL2 
from MYSTERY
group by COL2
having count(distinct(COL1)) <> 1

If COL2 has a functional dependency on COL1 then this query would return 0 rows. This is because every value of COL2 has exactly 1 corresponding value of COL1.

Best Answer

Related Solutions

Sql-server – Query to normalize table/combine row text

How to normalize data based only on the repetitions

Related Question