Sql-server – Are there tools to perform data comparison between two different schemas

sql serversql-server-2008tools

We are migrating our database to a new schema but want to validate the data has been moved correctly.

Traditional data comparison tools can compare two databases for differences if the schemas are the same. In our case there have been changes to the table designs but all the data from the old schema is in the new one, it's just been moved around a bit, and I need to make sure it is correct. We have tens of millions of rows so manual inspection is not an option.

Are there any tools which could aid in this type of comparison?

If not are there any libraries/frameworks which could help kick start the development of a custom solution?

I'm happy to use a database specific solution if necessary, in this case for SQL Server 2008.

My soluton: I'm comparing the two data sets by creating a VIEW of each table on the old database with the same fields as the new database table.

I then compare the data using the technique described here: The shortest, fastest, and easiest way to compare two tables in SQL Server: UNION !

I'm lucky in this migration as the overall table structure is similar to the old database, with fields moved from one table to another, dropped or added. In the case of the dropped and added there is nothing to compare, for the fields which have moved or been aggregated I do calculations in the view to provide the correct information for comparison.

The UNION comparison shows me only the rows with differences, so as soon as the data is correct I get an empty result set.

Best Answer

It's not impossible to compare two different schemas, it's a calculation of how confident you are in the result. I've essentially borrowed from Bank Reconciliation techniques

Important: This reconciliation isn't about making sure the destination exactly matches the source in it's data context (there's a reason you're migrating to a new system), but you do need to explain any discrepancies!

Basis:

Identify metrics that you will use to measure (ie: total number of users, sum of all their ages, list of user id's and their postcodes...) I try to use several metrics from: Totals, Averages & sample/detailed records
Dump this data into a common location (via views/reports/whatever is reasonable)
Compare your data and make sure any discrepancy can be explained

I've generally split the comparison into several methods (in particular for the detailed metrics):

By Table: Usually using the Source Table as the definition, creating a set of views on the destination DB to try and reproduce the Source Table Data
By Object: Depending on the DB and what you're storing it might make more sense to create a holistic view of the objects (ie: the user) across several tables, similar to above creating a set of views to return that object as a set of results
By Reports: If the Source DB has a good comprehensive set of Management Reports, another technique is to reproduce those reports with the exact same formatting

Techniques:

No matter what method I've used to produce what will be compared, I end up with a set of files/views/dbs that hold the reconcilable Source and Destination data, then depending on the medium I can use one of the commonly available tools to compare them. My own preferences are:

File Comparison
Create two different folders for SourceDB and DestinationDB (time/version stamped so I know when I did the reconciliation) and dump out the output of my metrics as relevantly named files, then use a comparison tool (such as CSDiff) to identify differences
Excel Comparison
In particular when dealing with Management reports, I'll create a set of Excel of workbooks to compare the reports (in essence making use of VLookups and comparing totals)
Data Comparison
Rather than outputting the reconciliation data to files or reports, output them to separate DBs, then use something like Redgate SQL Data Compare to compare the DBs

Other tools:

Not tried any of these, but a quick google search on 'Data reconciliation tools' gives:

There's others, but those were the ones that looked reasonable

Related Solutions

Sql-server – Migrate Data and/or Schema from DBv1 to DBv2

It sounds like you've got this entirely back to front and upside down. Typically changes are tracked using scripts, which you would then apply to the production database. I've never encountered a case where it made sense to import production data to the next version created from development.

If your database isn't currently in source control, the articles linked to from Jeff Atwood's article Get Your Database Under Version Control are a good introduction. Also, the free Redgate ebook SQL Server Team Based Development includes a chapter on source control for databases.

If you've been making changes via GUI tools and don't have any record of what's changed your best bet is a database comparison tool, like Redgate SQLCompare or Apex SQLDiff. These tools will generate the scripts to upgrade your production database to match the schema of development.

If the boss won't part with the cash for a comparison tool, you could reconcile the changes manually using a diff tool like WinMerge or DiffMerge. Use SSMS Generate Script tools to script objects to individual files, then use the diff tool to identify differences. Finally, hand craft the necessary ALTER statements for any changes to tables.

Sql-server – Easily show rows that are different between two tables or queries

You don't need 30 join conditions for a FULL OUTER JOIN here.

You can just Full Outer Join on the PK, preserve rows with at least one difference with WHERE EXISTS (SELECT A.* EXCEPT SELECT B.*) and use CROSS APPLY (SELECT A.* UNION ALL SELECT B.*) to unpivot out both sides of the JOINed rows into individual rows.

WITH TableA(Col1, Col2, Col3) 
     AS (SELECT 'Dog',1,1     UNION ALL 
         SELECT 'Cat',27,86   UNION ALL 
         SELECT 'Cat',128,92), 
     TableB(Col1, Col2, Col3) 
     AS (SELECT 'Dog',1,1     UNION ALL 
         SELECT 'Cat',27,105  UNION ALL 
         SELECT 'Lizard',83,NULL) 
SELECT CA.*
FROM   TableA A 
       FULL OUTER JOIN TableB B 
         ON A.Col1 = B.Col1 
            AND A.Col2 = B.Col2 
/*Unpivot the joined rows*/
CROSS APPLY (SELECT 'TableA' AS what, A.* UNION ALL
             SELECT 'TableB' AS what, B.*) AS CA     
/*Exclude identical rows*/
WHERE  EXISTS (SELECT A.* 
               EXCEPT 
               SELECT B.*) 
/*Discard NULL extended row*/
AND CA.Col1 IS NOT NULL      
ORDER BY CA.Col1, CA.Col2

Gives

what   Col1   Col2        Col3
------ ------ ----------- -----------
TableA Cat    27          86
TableB Cat    27          105
TableA Cat    128         92
TableB Lizard 83          NULL

Or a version dealing with the moved goalposts.

SELECT DISTINCT CA.*
FROM   TableA A 
       FULL OUTER JOIN TableB B 
         ON EXISTS (SELECT A.*  INTERSECT  SELECT B.*) 
CROSS APPLY (SELECT 'TableA' AS what, A.* UNION ALL
             SELECT 'TableB' AS what, B.*) AS CA     
WHERE NOT EXISTS (SELECT A.*  INTERSECT  SELECT B.*) 
AND CA.Col1 IS NOT NULL
ORDER BY CA.Col1, CA.Col2

For tables with many columns it can still be difficult to identify the specific column(s) that differ. For that you can potentially use the below.

(though just on relatively small tables as otherwise this method likely won't have adequate performance)

SELECT t1.primary_key,
       y1.c,
       y1.v,
       y2.v
FROM   t1
       JOIN t2
         ON t1.primary_key = t2.primary_key
       CROSS APPLY (SELECT t1.*
                    FOR xml path('row'), elements xsinil, type) x1(x)
       CROSS APPLY (SELECT t2.*
                    FOR xml path('row'), elements xsinil, type) x2(x)
       CROSS APPLY (SELECT n.n.value('local-name(.)', 'sysname'),
                           n.n.value('.', 'nvarchar(max)')
                    FROM   x1.x.nodes('row/*') AS n(n)) y1(c, v)
       CROSS APPLY (SELECT n.n.value('local-name(.)', 'sysname'),
                           n.n.value('.', 'nvarchar(max)')
                    FROM   x2.x.nodes('row/*') AS n(n)) y2(c, v)
WHERE  y1.c = y2.c
       AND EXISTS(SELECT y1.v
                  EXCEPT
                  SELECT y2.v)