SSIS Get max value in column and create new column

ssis

I'm importing two data sources. The problem is that one data source has a date, the other does not. Everything in the data flow has the same date and is inserted as one transaction into the same destination table.

Source A: [Calendar Date], [Item Number], [Units Sold]

Source B: [Item Number], [Inventory Units]

Final Output: [Item Number],[Calendar Date],[Units Sold],[Inventory Units]

Source A & B always refer to the same date. Therefore I want MAX([Calendar Date]) applied to every row in the final output. I just can't figure out how to do that in SSIS

Update: Possible Solution?

So what I did was add a temp id of 1 to every row following the merge of A&B. I then split the data, grouped by the MAX(DATE) which in this case is the only date. Then I joined back both sides on 1=1 thus giving each row the single value of MAX(DATE). Is this the recommended approach?

enter image description here

Best Answer

I took a different approach and well, mine's going to be a bit easier to maintain. ;)

I have created a Variable named SourceDate. I'm going to run a query against my Source A table and derived the maximum date in there. Due to data type quirkiness, even though I used a date datatype, SSIS 2008 freaks out and thinks I'm changing data type so I explicitly cast it to datetime

My control flow looks like the following.

enter image description here

SQL Derive MAX date

I return a single result set and assign into my variable ordinal position 0. This gives me the thing I need to ensure I'm only pulling current data out of table A and since it's the same for all values, it'll used after the merge join.

SELECT
    CAST(MAX(SA.CalendarDate) AS datetime) AS CalendarDate
FROM
    dbo.SourceA AS SA;

In my data flow, I'm going to use two OLE DB Sources. If this were all on the same source, I'd have you craft the appropriate query there and be done with it but this approach allows you to use two different servers and/or RDBMSes.

Source Query A

Since I like to make things complex, I assumed that Source A could have many days worth of data in there but we only wanted the most recent. Therefore, I'm going to use a parameter of ? and map that to my Variable @[User::SourceDate]

SELECT
    SA.ItemNumber AS ItemNumberA
,   SA.UnitsSold
FROM
    dbo.SourceA AS SA
WHERE
    SA.CalendarDate = ?
ORDER BY
    SA.ItemNumber ASC;

Source Query B

SELECT
    SB.ItemNumber AS ItemNumberB
,   SB.InventoryUnits
FROM
    dbo.SourceB AS SB
ORDER BY
    SB.ItemNumber ASC;

The important thing is that for a Merge Join to work, you must have sorted data. Sort in your database. It has indexes which might mean the data is already sorted for you. But, when you do that, you have to tell SSIS that this data is sorted. You do this by going to the Advanced tab after right clicking on the OLE DB Source component. You mark the Output as Sorted and then you need to identify the SortKey position. Since I'm sorting on ItemNumber I will put a value of 1 there.

In my source queries, note that I'm renaming ItemNumber. After the join, I don't know if my Left, right or Both sources had an ItemNumber.

I merge both data sets together based on the ItemNumber matching and use a type of Full Outer Join

I then add in Derived columns.

The first adds my Variable @[User::SourceDate] as a column called CalendarDate.

The second derived column helps me render a single column called ItemNumber I use an expression to pick the first non-null version of my two columns ItemNumberA and ItemNumberB

ISNULL([ItemNumberA])? [ItemNumberB] : [ItemNumberA]

My Data Flow ends up looking like

enter image description here

Database setup

I spun up a little database with some data to demo against. Nothing fancy here.

CREATE TABLE dbo.SourceA
(
    CalendarDate date NOT NULL
,   ItemNumber int NOT NULL
,   UnitsSold int NOT NULL
);

CREATE TABLE dbo.SourceB
(
    ItemNumber int NOT NULL
,   InventoryUnits int NOT NULL
);

-- add 3 values, 2 for today, one for yesterday
INSERT INTO
    dbo.SourceA
(
    CalendarDate
,   ItemNumber
,   UnitsSold
)
VALUES
    (CURRENT_TIMESTAMP,100,10)
,   (CURRENT_TIMESTAMP,200,20)
,   (DATEADD(d, -1, CURRENT_TIMESTAMP),100,10);

-- Add inventory. Something for items sold + something not sold
INSERT INTO dbo.SourceB
(
    ItemNumber
,   InventoryUnits
)
VALUES
    (100, 30)
,   (200, 10)
,   (300, 7);

Biml

Biml, the Business Intelligence Markup Language, describes the platform for business intelligence. Here, we're going to use it to describe the ETL. BIDS Helper, is a free add on for Visual Studio/BIDS/SSDT that addresses a host of shortcomings with it. Specifically, we're going to use the ability to transform a Biml file describing ETL into an SSIS package. This has the added benefit of providing you a mechanism for being able to generate exactly the solution I'm describing versus clicking through many tedious dialogue boxes.

So, install BIDS Helper if you haven't already. Run my table create scripts. Fix the Connection String to point to wherever you created the tables and Whoosh, you have my solution. Adapt for your actual table and servers.

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
    <Connections>
        <OleDbConnection ConnectionString="Provider=SQLNCLI11;Data Source=localhost\dev2014;Integrated Security=SSPI;Initial Catalog=tempdb" Name="CM_OLE" />
    </Connections>
    <Packages>
        <Package Name="dba_77666" ConstraintMode="Linear">
            <Variables>
                <Variable DataType="DateTime" Name="SourceDate">2014-01-01</Variable>
            </Variables>
            <Tasks>
                <ExecuteSQL ConnectionName="CM_OLE" Name="SQL Derive MAX date" ResultSet="SingleRow">
                    <DirectInput>
SELECT
    CAST(MAX(SA.CalendarDate) AS datetime) AS CalendarDate
FROM
    dbo.SourceA AS SA;
                    </DirectInput>
                    <Results>
                        <Result Name="0" VariableName="User.SourceDate" />
                    </Results>
                </ExecuteSQL>
                <Dataflow Name="DFT Make Data">
                    <Transformations>
                        <OleDbSource ConnectionName="CM_OLE" Name="OLE_SRC SourceA">
                            <DirectInput>
SELECT
    SA.ItemNumber AS ItemNumberA
,   SA.UnitsSold
FROM
    dbo.SourceA AS SA
WHERE
    SA.CalendarDate = ?
ORDER BY
    SA.ItemNumber ASC;

                            </DirectInput>
                            <Columns>
                                <Column SourceColumn="ItemNumberA" SortKeyPosition="1"></Column>
                            </Columns>
                            <Parameters>
                                <Parameter Name="0" VariableName="User.SourceDate" />
                            </Parameters>
                        </OleDbSource>
                        <OleDbSource ConnectionName="CM_OLE" Name="OLE_SRC SourceB">
                            <DirectInput>
SELECT
    SB.ItemNumber AS ItemNumberB
,   SB.InventoryUnits
FROM
    dbo.SourceB AS SB
ORDER BY
    SB.ItemNumber ASC;

                            </DirectInput>
                            <Columns>
                                <Column SourceColumn="ItemNumberB" SortKeyPosition="1"></Column>
                            </Columns>
                        </OleDbSource>

                        <MergeJoin JoinType="FullOuterJoin" Name="MJ Mash all">
                            <LeftInputPath OutputPathName="OLE_SRC SourceA.Output" />
                            <RightInputPath OutputPathName="OLE_SRC SourceB.Output" />
                            <JoinKeys>
                                <JoinKey LeftColumn="ItemNumberA" RightColumn="ItemNumberB"></JoinKey>
                            </JoinKeys>
                        </MergeJoin>


                        <DerivedColumns Name="DER Add Calendar Date">
                            <Columns>
                                <Column DataType="Date" Name="CalendarDate">@[User::SourceDate]</Column>
                            </Columns>
                        </DerivedColumns>

                        <!--
                        We have two ItemNumbers, One of them is populated, possibly both.
                        Create a single column that has the ItemNumber
                        -->
                        <DerivedColumns Name="DER Unified ItemNumber">
                            <Columns>
                                <Column DataType="Int32" Name="ItemNumber">ISNULL([ItemNumberA])? [ItemNumberB] : [ItemNumberA]</Column>
                            </Columns>
                        </DerivedColumns>
                        <!--  
                        Perform any NULL handlings here
                        -->

                        <DerivedColumns Name="DER Fix outstanding nulls">

                        </DerivedColumns>
                    </Transformations>
                </Dataflow>
            </Tasks>
        </Package>
    </Packages>
</Biml>

Update

With the updated information that this will be coming from a file source, the package will change in 3 places.

The first will "SQL Derive MAX date". Instead of using an Execute SQL Task, you will need to use another Data Flow and wire the Source to an Aggregate transformation. Find the MAX date and then use the ... variable sampling thing to assign 1 row into our Variable (or use a Script Component, which is what I'd do).

If the source file that has a date could hold a mix of older and newer data, then you'd need to add a Conditional Split transformation after the data source A to filter out anything older than Max datetime derived in addendum 1 in "DFT Make Data".

Finally, you will need to add a Sort transform immediately after both of your data sources/Conditional Split in the "DFT Make Data" step.

1-2-3, solution holds true. Too lazy to update the Biml as I don't know your file definitions.

Future reader note, this does force you to reprocess one of your source files. If it's a few hundred/thousand/million rows, depending on your IO subsystem, that may not be a concern. If you have a slow IO sub, then you'll need to investigate alternate mechanism of