Sql-server – Approaches for handling nvarchar sources to data warehouse

data-warehouseetlsql server

Situation is this…our data warehouse doesn't need nvarchar and is built using all non-unicode datatypes. But all new source systems are all coming in with unicode (utf8 Oracle or nvarchar in sqlserver).

If we leave the source nvarchar to our staging then we get implicit datatype conversions if we have to join or compare to our EDW during loads. Plus it just takes up excess storage that is unnecessary. Sounds great to stay true the source and not risk losing characters in conversion but the reality is we need to optimize storage and our load processes and that means not using unicode anytime we can avoid it.

Thoughts?

Best Answer

In SQL Server with Table and Index Compression (Row, Page or Columnstore) NVarchar columns don't use two bytes per character. They use Unicode Compression.

Here's an example:

use master 
go
drop database TestUnicodeCompression
go
create database TestUnicodeCompression

go
use TestUnicodeCompression
drop table if exists A
drop table if exists B
go
create table A(id int identity, a nvarchar(200) default cast(newid() as nvarchar(200) ))

create table B(id int identity, a varchar(200) default cast(newid() as varchar(200)))

go

with q as
(
  select top 1000000 row_number() over (order by (select null)) i
  from sys.objects o, sys.columns c, sys.columns c2
)
insert into A(A) select cast(newid() as varchar(200))
from q;

with q as
(
  select top 1000000 row_number() over (order by (select null)) i
  from sys.objects o, sys.columns c, sys.columns c2
)
insert into B(A) select cast(newid() as varchar(200))
from q;

go
alter table A rebuild with (data_compression=none)
alter table B rebuild with (data_compression=none)
go

select 'before compression' state,  
       cast( object_name(object_id) as varchar(10)) table_name, 
       used_page_count 
from sys.dm_db_partition_stats
where object_id in (object_id('A'), object_id('B'))

go

alter table A rebuild with (data_compression=page)
alter table B rebuild with (data_compression=page)

go

select 'after compression' state, 
        cast( object_name(object_id) as varchar(10)) table_name, 
        used_page_count 
from sys.dm_db_partition_stats
where object_id in (object_id('A'), object_id('B'))

outputs

(1000000 rows affected)

(1000000 rows affected)
state              table_name used_page_count
------------------ ---------- --------------------
before compression A          11114
before compression B          6581

(2 rows affected)

state             table_name used_page_count
----------------- ---------- --------------------
after compression A          6210
after compression B          6062

(2 rows affected)