I am importing and cleaning data from a datasource that have records like this:
The data needs sorting and rearranging based on the Event Type.
The destination schema would be something like this:
Create Table dbo.Destination
(
DestinationId int not Null primary key clustered,
EventType int not null,
fkyCustomerId int not null,
Constraint FK_Destination_Customers References dbo.Customers (pkyCustomerId),
fkyCategoryId int not null
Constraint FK_Destination_Categories References dbo.Categories (pkyCategoryId),
fkyCompanyId int not null
Constraint FK_Destination_Companies References dbo.Companies(pkyCompanyId),
fkyLocationId int not null
Constraint FK_Destination_Locations References dbo.Locations(pkyLocationId),
Amount money not null
Constraint DF_Destination_Amount Default(0)
)
-- Lookup table
Create Table dbo.Categories
(
pkyCategoryId int not null primary key clustered,
SourceCategoryId char(10) Not Null,
CategoryName varchar(50) Not Null
)
-- Lookup table
Create Table dbo.Companies
(
pkyCompanyId int not null primary key clustered,
SourceCompanyId varchar(10) Not null,
CompanyName varchar(50) Not Null
)
-- Lookup Table
Create Table dbo.Locations
(
pkyLocationId int not null primary key clustered,
SourceLocationId varchar(10) Not Null,
LocationName varchar(50) Not Null
)
For each Event Type, and there could be hundreds of event types, there are rules on what data exists in each of the col
columns.
So for Event Type 1234, Col0
has the mapping for the Categories
lookup and Col1
has the mapping for the Companies
lookup.
However, for Event Type 5000, Col0
contains the mapping for the Companies
lookup and Col1
contains the mapping for Locations
.
Also, there are situations when there will be null values for all Col
columns which means I need to map these to default category, company etc values.
I need to design an import process that can handle the existing rules and potentially new rules. The import will handle approx 2.5 millions rows per day and increasing.
What would be a good way to implement this kind of import. Nothing is set in stone here, apart from the the source data, so I am more than willing to entertain the idea of changing the schema etc.
The final goal is to transform the source data so it can be easily used for reporting purposes.
Best Answer
I think the biggest challenge will be doing the event type transformations efficiently, so I'm going to focus on that and ignore the schema completely.
My first thought (read: I've never actually tried this) is to use dynamic SQL to build up a
SELECT
statement that will do the data transformation such that values end up in their correct columns (i.e.,Col0
-->CompanyId
, or whatever). This boils down to a mapping table consisting of the event type, the source column, the target column, and a default value if required.Once you have the mapping table (or tables, depending on how complex this actually is), you can use it to dynamically build a SQL statement of the form:
Build the statement as part of the ETL process. If you need to add a new event type, just add the required column mappings to the mapping table, and they'll be handled the next time the ETL process runs.
Once the source data is in this kind of state, the rest of the import process should be pretty standard.