T-sql – TSQL – Bringing Data Together from Different Sources …refactoring PK and FKs

stored-procedurest-sql

I have various offices and one central head office. Each office has its own SQL Server instance so each office has its own data set with its own set of IDs.

Each office has already imported data into the head office and stored the data on a set of STAGING_Tables that look like this.

DECLARE @STAGING_COUNTRY TABLE
(
  Original_CountryID INT NOT NULL, 
  OfficeID VARCHAR(10) NOT NULL,
  Data VARCHAR(200) NOT NULL
);

DECLARE @STAGING_CITY TABLE
(
  Original_CityID INT NOT NULL, 
  Original_CountryID_FK INT NOT NULL, 
  OfficeID VARCHAR(10) NOT NULL,
  OtherData VARCHAR(100) NOT NULL
);

STAGING_COUNTRY has the original ID of each row (which off course will be duplicated since each office will have ID=1 for the 1st row on their Country table) and also has a unique OfficeID value that together with the Original_CountryID ..makes a unique value.

STAGING_CITY has also the original ID of each row, the unique OfficeID value that represent each office and in this case a FK to CountryID, (but of course at this point we have a reference to the Original_CountryID ..that in conjunction with the office ID could be identified).

Let's add some dummy rows:

/* ADD DUMMY VALUES TO STAGING_COUNTRY */ 
INSERT INTO @STAGING_COUNTRY
(Original_CountryID, OfficeID, Data) VALUES  (1, 'Office1', 'USA')

INSERT INTO @STAGING_COUNTRY (Original_CountryID, OfficeID, Data)
VALUES  (2, 'Office1', 'Canada')

INSERT INTO @STAGING_COUNTRY (Original_CountryID, OfficeID, Data)
VALUES  (3, 'Office1', 'Japan')

INSERT INTO @STAGING_COUNTRY (Original_CountryID, OfficeID, Data)
VALUES  (1, 'Office2', 'USA')

INSERT INTO @STAGING_COUNTRY (Original_CountryID, OfficeID, Data)
VALUES  (1, 'Office2', 'Italy')

INSERT INTO @STAGING_COUNTRY (Original_CountryID, OfficeID, Data)
VALUES  (3, 'Office2', 'Canada')

INSERT INTO @STAGING_COUNTRY (Original_CountryID, OfficeID, Data)
VALUES  (3, 'Office3', 'Canada')

INSERT INTO @STAGING_COUNTRY (Original_CountryID, OfficeID, Data)
VALUES  (2, 'Office3', 'France')

INSERT INTO @STAGING_COUNTRY (Original_CountryID, OfficeID, Data)
VALUES  (3, 'Office3', 'USA')



/* ADD DUMMY VALUES TO STAGING_CITY */  
INSERT INTO @STAGING_CITY (Original_CityID, Original_CountryID_FK, OfficeID, OtherData) VALUES 
(1, 1, 'Office1', 'New York')

INSERT INTO @STAGING_CITY (Original_CityID, Original_CountryID_FK,
OfficeID, OtherData) VALUES  (2, 1, 'Office1', 'Vancouver')

INSERT INTO @STAGING_CITY (Original_CityID, Original_CountryID_FK,
OfficeID, OtherData) VALUES  (3, 1, 'Office1', 'Tokia')

INSERT INTO @STAGING_CITY (Original_CityID, Original_CountryID_FK,
OfficeID, OtherData) VALUES  (1, 2, 'Office2', 'New York')

INSERT INTO @STAGING_CITY (Original_CityID, Original_CountryID_FK,
OfficeID, OtherData) VALUES  (2, 2, 'Office2', 'Rome')

INSERT INTO @STAGING_CITY (Original_CityID, Original_CountryID_FK,
OfficeID, OtherData) VALUES  (3, 2, 'Office2', 'Vancouver')

INSERT INTO @STAGING_CITY (Original_CityID, Original_CountryID_FK,
OfficeID, OtherData) VALUES  (1, 3, 'Office3', 'Vancouver')

INSERT INTO @STAGING_CITY (Original_CityID, Original_CountryID_FK,
OfficeID, OtherData) VALUES  (2, 3, 'Office3', 'Paris')

INSERT INTO @STAGING_CITY (Original_CityID, Original_CountryID_FK,
OfficeID, OtherData) VALUES  (3, 3, 'Office3', 'New York')

The central head office wants to run reports from a central dtabase that pretty much contains copy all the data from all offices but in order to make this reporting DB optimized, we need to reshuffle a bit the STAGING_Tables …and reorganize the data in FINAL_Tables that look like this:

DECLARE @FINAL_COUNTRY TABLE
(
  CountryID INT IDENTITY PRIMARY KEY, 
  Original_CountryID INT NOT NULL, 
  OfficeID VARCHAR(10) NOT NULL,
  Data VARCHAR(200) NOT NULL
);

DECLARE @FINAL_CITY TABLE
(
  CityID INT IDENTITY PRIMARY KEY, 
  Original_CityID INT NOT NULL, 
  CountryID_FK INT NOT NULL, 
  OfficeID VARCHAR(10) NOT NULL,
  OtherData VARCHAR(100) NOT NULL
);

PROBLEM:
The FINAL_COUNTRY and FINAL_CITY tables should be as optimized as possible for reporting purposes. These reports will be written in T-SQL stored procedures.

QUESTION:
What is the best way to reorganize the FINAL_Tables so that each record has a TRUE PK identifier (like in the original Office_Tables) and each FK is updated to point to the right newly created PK …at the server level?

NOTE:
Please note that both staging & final tables are inside the same DB, on the server.
Also we still need to keep the OriginalIDs on the FINAL_Tables for other purposes.

GOALS:
The main goal here is to reorganize into a set of tables that can be easily indexed for performance purposes.

Please ask more info if needed.

Many many thanks in advanced…

Best Answer

The IDENTITY key in your FINAL_COUNTRIES table is just fine for a PK. You've created a "synthetic key." I would also add a UQ on { OriginalCountryID, OfficeID }, to speed joins from staged data.

Another option would be to have a Countries table in your warehouse with only unique values (e.g., "USA" just once) and then a CountriesMap table relating OfficeID and CountryID to the ID in the warehouse table. This would help performance, as when you get to your fact tables you'll want to use a single INT to relate to the Countries table, rather than an INT and a VARCHAR() - fact tables can get big fast, and it's important to keep them as narrow as you can.