Sql-server – How should I build a DB to hold data about another database

data-warehousedatabase-designreplicationsql-server-2008-r2

I know that this is a strange question. I need ideas and/or advice, not necessarily "how-to"s.

I work with data from a 3rd party tool/database. I do not have permissions to modify this production database. Any changes or additional information that is needed that cannot go into this production database has to be housed somewhere else.

I have tables whose data has some sort of relationship to each other. For example: Let's say I have a "Caller" table that contains the caller's name and the company that they work for, and I have a "Company" table. Normally, I would use the PK on the Company table in the Caller table as an FK. In this case, that is not possible. On top of that, the Company Names may not be EXACTLY the same. For example: Kim Smith might have "ABC Distributing" as her company in the Caller table, while the Company table has the name as "ABC Dist".

I need to create a database to "support" this other database. In addition to the problem listed above, sometimes companies change names. The powers-that-be want the historical records attached to the company's name at that period in time. So if "ABC Dist" changed their name to "ABC Distribution and Manufacturing" in January 2016, the powers-that-be want the records for that company before Jan 1, 2016 to have the "ABC Dist" name, while records AFTER Jan 1, 2016 to have the new name. We now have TWO records for the same company, with two PKs. Again, this is not something that I can change. If I could rewrite this 3rd party tool, I would.

So, I need a way to keep callers associated with the correct company in the company table and a way to associate multiple lines in the same company table together (company name changes).

I know that I'll have to write an ETL to keep these two databases synchronized. What I need is ideas on how to design the support DB. Do I copy company names over or just use the PK for everything?

I hope that I've made the issues clear and understandable. I'm really frustrated with this mess, and need a way to tame it a bit.

OH! Also, I need suggestions on how to keep track of when changes are made to the production database. No one bothers to tell me changes have occurred until something downstream breaks. (Using MS SQL Server 2008 R2)

Should I have submitted this under the developers' StackOverflow?

Best Answer

Strongly recommend you don't do it. There clearly are either technical deficiencies in the current database/app or you have contractual deficiencies that prevent your company from getting the 3rd party to fix/enhance their app/DB. Building a helper DB quickly becomes unmaintainable and you end up with errors that can be hard to troubleshoot.

For example, consider your scenario above where the company changes its name. If you can't modify the primary database to create triggers on that table or modify the app/SP so that changes to company name in the primary DB always ensure propagation to the helper DB, you're almost guaranteed to get out of sync over time. Regardless of whatever procedures you put in place, humans fail. On top of that, you've also tightly coupled another database to the primary so if the helper is not online and operational, there needs to be some mechanism in place to "hold changes" and make sure they get applied later. You also need to know what operations should not be permitted on the primary because they depend on the helper.

Those are just some top level things to deal with, there are more.

If you still want/have to do it, design the table logically the way you would if it was in the same DB. Using your customer name scenario, you'd have the column you want to track changes/variations of so for your example it would be company_name. You'll also need the column that authoritatively identifies the customer. This might be companyID or some other identifier the app uses. If you are on SQL Server 2016, enable temporal table for this and you'll have system managed versioning. Else, you'll need to do your own versioning. You'll need a column to track versions. Date is probably most natural but not necessarily the best for performance if you have a very large table. I suspect it won't matter for your case (if the table becomes very large, this won't be your biggest problem). You'll also need to consider creating a composite primary key or some surrogate key. Again, assuming the table won't be huge, I'd go with composite PK but you know your system best so evaluate options and recommendations carefully - this is a contentious subject.

Once you have that, create your database + tables and make sure you have the right permissions setup and whatever you do to protect the primary database, do the same for this helper DB. Make sure your backup plans will get you the right level of logical consistency and you have a scripted way to reconcile when they get out of sync.

Populating the first time is easy, keeping it in sync is a bit trickier. Since you're not really just doing a 1-1 copy, things like replication won't work without some customization. Writing an ETL job to kick in regularly will do the trick and likely easier but what is the latency you can tolerate? Is it ok to have the helper DB 10 minutes behind the primary? 20? 30? Careful with the selection. Setting high frequency can cause interesting problems if a job gets blocked/delayed for whatever reason. Change tracking or change data capture are helpful here. They are not the same thing and picking the right technology really depends on your environment. At a very high level, if changes to the same row are very infrequent and if they occur, the same row would not be changed multiple times over a short period, change tracking might be a better solution - lighter weight. If the data may be changed repeatedly over a short period, consider change tracking - more overhead but you get full history to move to the helper.

A quick and dirty solution would be to write a trigger on the table so every time a value is changed, you would push that to the helper db but triggers are easy to mess up. It's easy to make a mistake and flood the database. Also, failures are often hard to troubleshoot. I wouldn't go this route.

Whatever you finally implement, be sure you test to ensure it works correctly for deleted companies. If you "blindly" propagate a delete to the helper DB, will that delete the latest or all occurrences? What is the right behavior for the business and are there legal requirements? How do you recover from an accidental delete?

Related Question