Sql-server – SQL Server : ETL Stage Index

data-warehouseetlindexsql serverssis

We have a very large staging table (> 80 GB). From our source system we load invoice data in the staging table. From the staging we transform the data and load into DWH/Fact. Every day we delete the current month, then reload from source into stage. The stage contains complete history over time.

In some DW loads we only need the current month. Sometimes year and previous year.

What is a better index strategy:

Clustered index on a date column (Fiscal Period)
Primary key with IDENTITY as surrogate key
Clustered index for natural key (some kind of line item e.g. invoice number)

All queries contain the date column (Fiscal Period) and sometimes additional columns like Invoice type as non-clustered index. In the ETL we can disable the non-clustered index but not the clustered index.

Which of the three types has the best performance for:

Insert into Stage table
Query the Stage table

Best Answer

If it possible for you to use partitioning, I would highly recommend leveraging it in this situation as you know that you're working with periodic data loads which fall into months or years. With partitioning, you could reserve the creation of your indexes on other columns that you may need on things such as InvoiceNo, etc. Let partitioning take care of isolating which month or period you are working with.

If you are fairly sure about the fact that date is integral portion of your join or searches, then you I would associate a "smartkey" for your loading strategy and place a clustered index on that instead.

If you are truly determined to use the date as an index seek, then I would place an non-clustered index on this field as you'll likely have to scan rows in your use-case anyway. Keep in mind creating a clustered index on date and then incurring updates or deletes will fragment your table.

Related Solutions

Sql-server – ETL: extraction strategy for 200 source databases

If you have 200 identical sources then you can parameterise a SSIS package with the data source and kick off nultiple threads. These can be controlled within the package by a foreach loop or from an external source that kicks off the extractors with a parameter.

You could consider a full load for relatively small dimensional sources and an incremental load for transactional data. This would require you to have persistent dimensions, but this is fairly straightforward to do with MERGE operations, or a pre-load area and dimension handler if you need slowly-changing dimensions.

You may wish to consider giving each source its own staging area (maybe a schema for each source in the staging database). This eliminates locking issues on the staging tables. Build a set of views over the staging tables (essentially just set of unions that correspond to each of the source tables) that includes data source information. These can be generated fairly easily, so you don't have to manually cut and paste 200 different queries into the union. Once you've staged the data then ETL process can read the whole lot from the view.

This allows the ETL to run in one hit, although you will have to come up with a strategy to deal with extract failures from individual systems. For this, you might want to look into an architecture that deals with late arriving data gracefully, so you can catch up individual feeds that had transient issues.

BCP

For 200 simple extracts, BCP is probably a good way to go. The sources are all identical, so the BCP files will be the same across sources. You can build a load controller with SSIS. Getting multiple threads to read the top off a common list would require you to implement synchronised access to the list. The SSIS process has a bunch of loops running in parallel in a sequence container that pop the next item, execute it and update the corresponding status.

Implementing the 'next' function uses a sproc running in a serializable transaction that pops the 'next' eligible source off the list and marks it as 'in progress' within the transaction. This is a 'table as queue' problem, but you don't have to implement synchronised inserts - a whole batch can be pushed into the table at the start of the run.

Structure the individual extract process so that it tries once or twice again if the first attempt fails. This will mitigate a lot of failures caused by transient errors. Fail the task if it fails twice, and structure the ETL so it is resilient to individual extraction failures.

Incremental loads

An incremental loader is probably not worth bothering for dimension tables unless you have a really big dimension that shows real performance issues. For the fact table data sources it probably is worth it. If you can add a row version to the application table with a timestamp column or some such, you can pick up stuff that's new. However, you will need to track this locally to record the last timestamp. If there is an insert or update date on the data you may be able to use that instead.

Full Loads

What could possibly go wrong?

200 processes kicking off to do a full load places a load spike on the network and possibly the staging database. This could lead to all sorts of transient issues like timeouts. For small dimension tables it's probably not such a big issue. However for 100GB there are quite a wide variety of issues - WAN saturation, locking (although the right staging architecture will mitigate that), availability of sources. The longer the extract process has to run the bigger influence environmental factors have on the reliability of the process.

There are quite a lot of imponderables here, so YMMV. I'd suggest an incremental load for the larger tables if possible.

Does surrogate key assignment for a fact table require that the source data has natural keys

I'm not sure that you need "natural" keys, but you probably do need to maintain a key mapping of sorts. So you need to understand what relationships map between your source and target systems, identify the keys for those relationships and build your key mappings from there.

I had a previous question on this called "What is the best practice for mapping from natural keys to integer-based keys? (ETL)".

EDIT: So far I am seeing at least three, if not four mappings.
CustomersToDim_Customers (customer_id, dim_customer_id)
ProductsToDim_Products (product_id, dim_product_id)
OrderDatesToDim_Date (order_date, date_id) or (map_id,order_date,date_id) if you want to use a key to map.
And lastly, I see the order_id as your key to the fact table. So I would go
OrdersToFactOrders (order_id,dim_date_id,dim_customer_id,dim_product_id)
In my case I renamed the fields for the mart with dim_field_id because I didn't want name collision within my tables or confusion as to which Id they pointed to. Your ETL would have to know that CustomersToDim_Customers.dim_customer_id really maps to Dim_Customers.customer_id and that CustomersToDim_Customers.customer_id really maps to Customers.customer_id.

I would also be half-inclined to include the order_number in the OrdersToFactOrders mapping table, but that is because I like to have tracking data for audit purposes. Makes my life easier. But, based on what you told me, order_number and order_id are one-to-one so the inclusion of order_number would be redundant then and only necessary if you have a perfectionist paranoia to make sure your data is correct on both sides (I really like to make sure that A on side A and B on side B are really correct after the ETL is done.).

Best Answer

Related Solutions

Sql-server – ETL: extraction strategy for 200 source databases

Does surrogate key assignment for a fact table require that the source data has natural keys

Related Question