Sql-server – Approaches for handling nvarchar sources to data warehouse

data-warehouseetlsql server

Situation is this…our data warehouse doesn't need nvarchar and is built using all non-unicode datatypes. But all new source systems are all coming in with unicode (utf8 Oracle or nvarchar in sqlserver).

If we leave the source nvarchar to our staging then we get implicit datatype conversions if we have to join or compare to our EDW during loads. Plus it just takes up excess storage that is unnecessary. Sounds great to stay true the source and not risk losing characters in conversion but the reality is we need to optimize storage and our load processes and that means not using unicode anytime we can avoid it.

Thoughts?

Best Answer

In SQL Server with Table and Index Compression (Row, Page or Columnstore) NVarchar columns don't use two bytes per character. They use Unicode Compression.

Here's an example:

use master 
go
drop database TestUnicodeCompression
go
create database TestUnicodeCompression

go
use TestUnicodeCompression
drop table if exists A
drop table if exists B
go
create table A(id int identity, a nvarchar(200) default cast(newid() as nvarchar(200) ))

create table B(id int identity, a varchar(200) default cast(newid() as varchar(200)))

go

with q as
(
  select top 1000000 row_number() over (order by (select null)) i
  from sys.objects o, sys.columns c, sys.columns c2
)
insert into A(A) select cast(newid() as varchar(200))
from q;

with q as
(
  select top 1000000 row_number() over (order by (select null)) i
  from sys.objects o, sys.columns c, sys.columns c2
)
insert into B(A) select cast(newid() as varchar(200))
from q;

go
alter table A rebuild with (data_compression=none)
alter table B rebuild with (data_compression=none)
go

select 'before compression' state,  
       cast( object_name(object_id) as varchar(10)) table_name, 
       used_page_count 
from sys.dm_db_partition_stats
where object_id in (object_id('A'), object_id('B'))

go

alter table A rebuild with (data_compression=page)
alter table B rebuild with (data_compression=page)

go

select 'after compression' state, 
        cast( object_name(object_id) as varchar(10)) table_name, 
        used_page_count 
from sys.dm_db_partition_stats
where object_id in (object_id('A'), object_id('B'))

outputs

(1000000 rows affected)

(1000000 rows affected)
state              table_name used_page_count
------------------ ---------- --------------------
before compression A          11114
before compression B          6581

(2 rows affected)

state             table_name used_page_count
----------------- ---------- --------------------
after compression A          6210
after compression B          6062

(2 rows affected)

Related Solutions

Sql-server – ETL: extraction strategy for 200 source databases

If you have 200 identical sources then you can parameterise a SSIS package with the data source and kick off nultiple threads. These can be controlled within the package by a foreach loop or from an external source that kicks off the extractors with a parameter.

You could consider a full load for relatively small dimensional sources and an incremental load for transactional data. This would require you to have persistent dimensions, but this is fairly straightforward to do with MERGE operations, or a pre-load area and dimension handler if you need slowly-changing dimensions.

You may wish to consider giving each source its own staging area (maybe a schema for each source in the staging database). This eliminates locking issues on the staging tables. Build a set of views over the staging tables (essentially just set of unions that correspond to each of the source tables) that includes data source information. These can be generated fairly easily, so you don't have to manually cut and paste 200 different queries into the union. Once you've staged the data then ETL process can read the whole lot from the view.

This allows the ETL to run in one hit, although you will have to come up with a strategy to deal with extract failures from individual systems. For this, you might want to look into an architecture that deals with late arriving data gracefully, so you can catch up individual feeds that had transient issues.

BCP

For 200 simple extracts, BCP is probably a good way to go. The sources are all identical, so the BCP files will be the same across sources. You can build a load controller with SSIS. Getting multiple threads to read the top off a common list would require you to implement synchronised access to the list. The SSIS process has a bunch of loops running in parallel in a sequence container that pop the next item, execute it and update the corresponding status.

Implementing the 'next' function uses a sproc running in a serializable transaction that pops the 'next' eligible source off the list and marks it as 'in progress' within the transaction. This is a 'table as queue' problem, but you don't have to implement synchronised inserts - a whole batch can be pushed into the table at the start of the run.

Structure the individual extract process so that it tries once or twice again if the first attempt fails. This will mitigate a lot of failures caused by transient errors. Fail the task if it fails twice, and structure the ETL so it is resilient to individual extraction failures.

Incremental loads

An incremental loader is probably not worth bothering for dimension tables unless you have a really big dimension that shows real performance issues. For the fact table data sources it probably is worth it. If you can add a row version to the application table with a timestamp column or some such, you can pick up stuff that's new. However, you will need to track this locally to record the last timestamp. If there is an insert or update date on the data you may be able to use that instead.

Full Loads

What could possibly go wrong?

200 processes kicking off to do a full load places a load spike on the network and possibly the staging database. This could lead to all sorts of transient issues like timeouts. For small dimension tables it's probably not such a big issue. However for 100GB there are quite a wide variety of issues - WAN saturation, locking (although the right staging architecture will mitigate that), availability of sources. The longer the extract process has to run the bigger influence environmental factors have on the reliability of the process.

There are quite a lot of imponderables here, so YMMV. I'd suggest an incremental load for the larger tables if possible.

Sql-server – Handling time zones in data mart/warehouse

Firstly...

Separating Datime/Time into a Date dimension and a Time dimension is definitely the way to go.

To manage multiple time zones you need to duplicate the DateKey and the TimeKey so that you have the following:

LocalDateKey
LocalTimeKey
UtcDateKey
UtcTimeKey

You say...

The problem I am having with all that is that 11:00 PM on Tuesday, December 31, 2013 in UTC is Wednesday, January 1st, 2014 in all time zones that are after UTC+2.

By having the 4 columns I've listed above you, will be able to join the fact table to the Date and/or Time dimension Using Table Aliases (in Kimball terminology these aliased dimension tables are known as "Role Playing Dimensions"), so you would have something like the following:

/*
    Assumes the following:
        - [DateLongName] has the format of this example "Tuesday, December 31, 2013"
        - [TimeShortName] has the format of this example "11:00 PM"
        - Both [DateLongName] & [TimeShortName] are strings
*/
select
    -- Returns a string matching this example  "11:00 PM Tuesday, December 31, 2013"
    localTime.TimeShortName + ' ' + localDate.DateLongName
    ,utcTime.TimeShortName + ' ' + utcDate.DateLongName
    ,f.*
from
    FactTableName  AS f

    -- Local Date and Local Time joins          
    inner join dbo.Date  AS localDate
        on localDate.DateKey = f.LocalDateKey

    inner join dbo.Time  AS localTime
        on localTime.TimeKey = f.LocalTimeKey 

    -- Utc Date and Utc Time joins    
    inner join dbo.Date  AS utcDate
        on utcDate.DateKey = f.UtcDateKey

    inner join dbo.Time  AS utcTime
        on utcTime.TimeKey = f.UtcTimeKey

In closing...

As you're building a data mart, and not an OLTP database, the generation of the Local and Utc times should be performed in your ETL, NOT in any client side applications for the following reasons (apart from localisation of the UTC time to the report reader's perspective):

Having the calculation reside in any queries places an extra performance burden on them, multiplied by the number of times you have to run said query for any reports you have (this matters when reading millions of rows)
Extra burden of ensuring the calculation is maintained correctly in each query (especially when you take daylight savings time into account)
Prevent range scanning of any indexes the column is part of, as you'll be performing a calculation on the column which forces queries to perform index scans instead of seeks (which are typically more expensive as each data page is needed to be read); this is known as being non-sargable.
- Edit due to comments: This applies if you push the conversion down into the actual query.
Using the concept of having the additional UTC dates and times available, there is nothing stopping you from taking this concept and extending it by calling this StandardisedDateKey, or CorporateHQDateKey, where instead of a UTC date table you standardise based on some other business agreed standard
Having the two separate column types (Local and UTC), allows for side-by-side comparison across geographic distance. Think -> someone in Australia enters a record that is timestamped with both Local and UTC, someone in New York reads the report with the Local (Australia) date and time and the New York representation of the UTC date and time, thereby seeing that something their Australian counterpart did during the middle of the day (Australia time) happened in the middle of the night their time (New York time). This comparison of time is indispensable in multi-national businesses.