Sql-server – Are there tools to perform data comparison between two different schemas

sql serversql-server-2008tools

We are migrating our database to a new schema but want to validate the data has been moved correctly.

Traditional data comparison tools can compare two databases for differences if the schemas are the same. In our case there have been changes to the table designs but all the data from the old schema is in the new one, it's just been moved around a bit, and I need to make sure it is correct. We have tens of millions of rows so manual inspection is not an option.

Are there any tools which could aid in this type of comparison?

If not are there any libraries/frameworks which could help kick start the development of a custom solution?

I'm happy to use a database specific solution if necessary, in this case for SQL Server 2008.

My soluton: I'm comparing the two data sets by creating a VIEW of each table on the old database with the same fields as the new database table.

I then compare the data using the technique described here: The shortest, fastest, and easiest way to compare two tables in SQL Server: UNION !

I'm lucky in this migration as the overall table structure is similar to the old database, with fields moved from one table to another, dropped or added. In the case of the dropped and added there is nothing to compare, for the fields which have moved or been aggregated I do calculations in the view to provide the correct information for comparison.

The UNION comparison shows me only the rows with differences, so as soon as the data is correct I get an empty result set.

Best Answer

It's not impossible to compare two different schemas, it's a calculation of how confident you are in the result. I've essentially borrowed from Bank Reconciliation techniques

Important: This reconciliation isn't about making sure the destination exactly matches the source in it's data context (there's a reason you're migrating to a new system), but you do need to explain any discrepancies!

Basis:

  1. Identify metrics that you will use to measure (ie: total number of users, sum of all their ages, list of user id's and their postcodes...) I try to use several metrics from: Totals, Averages & sample/detailed records
  2. Dump this data into a common location (via views/reports/whatever is reasonable)
  3. Compare your data and make sure any discrepancy can be explained

I've generally split the comparison into several methods (in particular for the detailed metrics):

  • By Table: Usually using the Source Table as the definition, creating a set of views on the destination DB to try and reproduce the Source Table Data
  • By Object: Depending on the DB and what you're storing it might make more sense to create a holistic view of the objects (ie: the user) across several tables, similar to above creating a set of views to return that object as a set of results
  • By Reports: If the Source DB has a good comprehensive set of Management Reports, another technique is to reproduce those reports with the exact same formatting

Techniques:

No matter what method I've used to produce what will be compared, I end up with a set of files/views/dbs that hold the reconcilable Source and Destination data, then depending on the medium I can use one of the commonly available tools to compare them. My own preferences are:

  1. File Comparison
    Create two different folders for SourceDB and DestinationDB (time/version stamped so I know when I did the reconciliation) and dump out the output of my metrics as relevantly named files, then use a comparison tool (such as CSDiff) to identify differences

  2. Excel Comparison
    In particular when dealing with Management reports, I'll create a set of Excel of workbooks to compare the reports (in essence making use of VLookups and comparing totals)

  3. Data Comparison
    Rather than outputting the reconciliation data to files or reports, output them to separate DBs, then use something like Redgate SQL Data Compare to compare the DBs

Other tools:

Not tried any of these, but a quick google search on 'Data reconciliation tools' gives:

There's others, but those were the ones that looked reasonable