Sql-server – Advice on how to approach data cleaning & importing

business-intelligencesql-server-2012ssis

I am importing and cleaning data from a datasource that have records like this:

Source Data

The data needs sorting and rearranging based on the Event Type.

The destination schema would be something like this:

Create Table dbo.Destination
(
DestinationId int not Null primary key clustered,
EventType int not null,
fkyCustomerId int not null,
         Constraint FK_Destination_Customers References dbo.Customers (pkyCustomerId),
fkyCategoryId int not null
         Constraint FK_Destination_Categories References dbo.Categories (pkyCategoryId),
fkyCompanyId int not null
         Constraint FK_Destination_Companies References dbo.Companies(pkyCompanyId),
fkyLocationId int not null
        Constraint FK_Destination_Locations References dbo.Locations(pkyLocationId),
Amount money not null
        Constraint DF_Destination_Amount Default(0)
)
-- Lookup table
Create Table dbo.Categories
(
pkyCategoryId int not null primary key clustered,
SourceCategoryId char(10) Not Null,
CategoryName varchar(50) Not Null
)
-- Lookup table
Create Table dbo.Companies
(
pkyCompanyId int not null primary key clustered,
SourceCompanyId varchar(10) Not null,
CompanyName varchar(50) Not Null
)
-- Lookup Table
Create Table dbo.Locations
(
pkyLocationId int not null primary key clustered,
SourceLocationId varchar(10) Not Null,
LocationName varchar(50) Not Null
)

For each Event Type, and there could be hundreds of event types, there are rules on what data exists in each of the col columns.

So for Event Type 1234, Col0 has the mapping for the Categories lookup and Col1 has the mapping for the Companies lookup.

However, for Event Type 5000, Col0 contains the mapping for the Companies lookup and Col1 contains the mapping for Locations.

Also, there are situations when there will be null values for all Col columns which means I need to map these to default category, company etc values.

I need to design an import process that can handle the existing rules and potentially new rules. The import will handle approx 2.5 millions rows per day and increasing.

What would be a good way to implement this kind of import. Nothing is set in stone here, apart from the the source data, so I am more than willing to entertain the idea of changing the schema etc.

The final goal is to transform the source data so it can be easily used for reporting purposes.

Best Answer

I think the biggest challenge will be doing the event type transformations efficiently, so I'm going to focus on that and ignore the schema completely.

My first thought (read: I've never actually tried this) is to use dynamic SQL to build up a SELECT statement that will do the data transformation such that values end up in their correct columns (i.e., Col0 --> CompanyId, or whatever). This boils down to a mapping table consisting of the event type, the source column, the target column, and a default value if required.

Once you have the mapping table (or tables, depending on how complex this actually is), you can use it to dynamically build a SQL statement of the form:

SELECT
    (
        CASE EventType
            WHEN 1234 THEN COALESCE(Col0, 'DefaultCategory')
        END
    ) AS Category,
    (
        CASE EventType
            WHEN 1234 THEN COALESCE(Col1, 'DefaultCompany')
            WHEN 5000 THEN COALESCE(Col0, 'DefaultCompany')
        END
    ) AS Company,
    (
        CASE EventType
            WHEN 5000 THEN COALESCE(Col1, 'DefaultLocation')
        END
    ) AS Location
    FROM RawSourceTable

Build the statement as part of the ETL process. If you need to add a new event type, just add the required column mappings to the mapping table, and they'll be handled the next time the ETL process runs.

Once the source data is in this kind of state, the rest of the import process should be pretty standard.