SQL Server 2016 – How Does the Microsoft tablediff Utility Work?

sql serversql-server-2016

Our team is curious about the architecture of tablediff utility. We want to find the difference between table rows.

Microsoft mentions tablediff here, however does not state how it works. Does tablediff take the checksum or hash of every row and compare tables? What is the internal algorithm method?

https://docs.microsoft.com/en-us/sql/tools/tablediff-utility?view=sql-server-2017

create table dbo.CustomerTransaction
(
    CustomerTransactionId int primary key,
    CustomerName varchar(50),
    ProductName varchar(50),
    QuantityBought int
)

Using tablediff example:

So row values Table 1: (1,'Bob','Table',8) is the Same as Table 2: (1,'Bob','Table',8)

These are different (1,'Bob','Table',8), (1,'Bob','Chair',8), different on the primary key.

I know it requires the source table to have primary Key/Identity/rowguid column to compare, and was originally based on replication technology.

Best Answer

I'm not aware of authoritative documentation of the fact that the current version of the tablediff utility uses a checksum to identify data differences. However, I was able to verify this by capturing the queries executed by the utility using a sql_batch_completed event trace. The trace showed a query is run against each table, returning only the primary key value(s) along with the T-SQL BINARY_CHECKSUM over all table columns.

This is the T-SQL script I ran on my local SQL instance to the create tables and trace:

USE tempdb;
DROP TABLE IF EXISTS dbo.table1;
DROP TABLE IF EXISTS dbo.table2;
CREATE TABLE dbo.Table1(Col1 int NOT NULL CONSTRAINT PK_Table1 PRIMARY KEY, Col2 int);
CREATE TABLE dbo.Table2(Col1 int NOT NULL CONSTRAINT PK_Table2 PRIMARY KEY, Col2 int);
INSERT INTO dbo.Table1 VALUES
      (1, 1)
    , (2, 2)
    , (3, 3);
INSERT INTO dbo.Table2 VALUES
      (1, 1)
    , (2, 0)
    , (4, 4);

While viewing the trace in SSMS (Object Explorer: right-click Management-->Extended Events-->Sessions-->sql_batch_completed and select Watch Live Data), I ran this from the command prompt:

"C:\Program Files\Microsoft SQL Server\140\COM\tablediff" -sourceserver "." -sourcedatabase "tempdb" -sourceschema "dbo" -sourcetable "Table1" -destinationserver "." -destinationschema "dbo" -destinationdatabase "tempdb" -destinationtable "Table2"

These were the relevant queries executed by tablediff:

SELECT [dbo].[table1].[col1],BINARY_CHECKSUM([dbo].[table1].[col1],[dbo].[table1].[col2],[dbo].[table1].[col3]) as MShash_54267293 FROM [dbo].[table1] WITH (READUNCOMMITTED) ORDER BY [dbo].[table1].[col1]

SELECT [dbo].[table2].[col1],BINARY_CHECKSUM([dbo].[table2].[col1],[dbo].[table2].[col2],[dbo].[table2].[col3]) as MShash_54267293 FROM [dbo].[table2] WITH (READUNCOMMITTED) ORDER BY [dbo].[table2].[col1]

Related Solutions

Sql-server – Indexing – Uniqueidentifier Foreign Key or Intermediary mapping table

Ok, I am making a lot of assumptions (INT instead of VARCHAR(50) being one of them) with this answer, so feel free to correct me if needed. The problem with option B is that it introduces a new join to relate Users to Alerts without any real added benefit. If joining on the UserID, it is best to index the UserID, so you can utilize seeks for your joins.

For Option A, UserID will be the clustering key (index key for the clustered index) on the Users table. UserID will be a nonclustered index key on Alerts table. This will cost 16 bytes per Alert.

For Option B, UserID will be the clustering key on the Users table. UserId will probably be the clustering key in UserMap too, to make joining more efficient. UserKey (assuming this is an INT) would then be a nonclustered index key on the Alerts table. This will cost 4 bytes per Alert. And 20 bytes per UserMap.

Looking at the big picture, one relationship, for Option A, costs 16 bytes of storage, and involves 1 join operation. Whereas, one relationship, for Option B, costs 24 bytes of storage, and involves 2 join operations.

Furthermore, there are a possibility of 340,282,366,920,938,000,000,000,000,000,000,000,000 uniqueidentifiers and only 4,294,967,296 INTs. Implementing a uniqueidentifier to INT map for a this type of relationship could cause unexpected results when you start reusing INTs.

The only reason for creating this type map table, is if you plan on creating a Many to Many relationship between Users and Alerts.

Taking all of this into consideration, I would recommend Option A.

I hope this helps,

Matt

Sql-server – Why do sequential GUID keys perform faster than sequential INT keys in the test case

I modified @Phil Sandler's code to remove the effect of calling GETDATE() (there may be hardware effects/interrupts involved??), and made rows the same length.

[There have been several articles since SQL Server 2000 relating to timing issues and high-resolution timers, so I wanted to minimise that effect.]

In simple recovery model with data and log file both sized way over what is required, here are the timings (in seconds): (Updated with new results based on exact code below)

       Identity(s)  Guid(s)
       ---------    -----
       2.876        4.060    
       2.570        4.116    
       2.513        3.786   
       2.517        4.173    
       2.410        3.610    
       2.566        3.726
       2.376        3.740
       2.333        3.833
       2.416        3.700
       2.413        3.603
       2.910        4.126
       2.403        3.973
       2.423        3.653
    -----------------------
Avg    2.650        3.857
StdDev 0.227        0.204

The code used:

SET NOCOUNT ON

CREATE TABLE TestGuid2 (Id UNIQUEIDENTIFIER NOT NULL DEFAULT NEWSEQUENTIALID() PRIMARY KEY,
SomeDate DATETIME, batchNumber BIGINT, FILLER CHAR(88))

CREATE TABLE TestInt (Id Int NOT NULL identity(1,1) PRIMARY KEY,
SomeDate DATETIME, batchNumber BIGINT, FILLER CHAR(100))

DECLARE @Numrows INT = 1000000

CREATE TABLE #temp (Id int NOT NULL Identity(1,1) PRIMARY KEY, rowNum int, adate datetime)

DECLARE @LocalCounter INT = 0

--put rows into temp table
WHILE (@LocalCounter < @NumRows)
BEGIN
    INSERT INTO #temp(rowNum, adate) VALUES (@LocalCounter, GETDATE())
    SET @LocalCounter += 1
END

--Do inserts using GUIDs
DECLARE @GUIDTimeStart DateTime = GETDATE()
INSERT INTO TestGuid2 (SomeDate, batchNumber) 
SELECT adate, rowNum FROM #temp
DECLARE @GUIDTimeEnd  DateTime = GETDATE()

--Do inserts using IDENTITY
DECLARE @IdTimeStart DateTime = GETDATE()
INSERT INTO TestInt (SomeDate, batchNumber) 
SELECT adate, rowNum FROM #temp
DECLARE @IdTimeEnd DateTime = GETDATE()

SELECT DATEDIFF(ms, @IdTimeStart, @IdTimeEnd) AS IdTime, DATEDIFF(ms, @GUIDTimeStart, @GUIDTimeEnd) AS GuidTime

DROP TABLE TestGuid2
DROP TABLE TestInt
DROP TABLE #temp
GO

After reading @Martin's investigation, I re-ran with the suggested TOP(@num) in both cases, i.e.

...
--Do inserts using GUIDs
DECLARE @num INT = 2147483647; 
DECLARE @GUIDTimeStart DATETIME = GETDATE(); 
INSERT INTO TestGuid2 (SomeDate, batchNumber) 
SELECT TOP(@num) adate, rowNum FROM #temp; 
DECLARE @GUIDTimeEnd DATETIME = GETDATE();

--Do inserts using IDENTITY
DECLARE @IdTimeStart DateTime = GETDATE()
INSERT INTO TestInt (SomeDate, batchNumber) 
SELECT TOP(@num) adate, rowNum FROM #temp;
DECLARE @IdTimeEnd DateTime = GETDATE()
...

and here are the timing results:

       Identity(s)  Guid(s)
       ---------    -----
       2.436        2.656
       2.940        2.716
       2.506        2.633
       2.380        2.643
       2.476        2.656
       2.846        2.670
       2.940        2.913
       2.453        2.653
       2.446        2.616
       2.986        2.683
       2.406        2.640
       2.460        2.650
       2.416        2.720

    -----------------------
Avg    2.426        2.688
StdDev 0.010        0.032

I wasn't able to get the actual execution plan, as the query never returned! It seems a bug is likely. (Running Microsoft SQL Server 2008 R2 (RTM) - 10.50.1600.1 (X64))

Best Answer

Related Solutions

Sql-server – Indexing – Uniqueidentifier Foreign Key or Intermediary mapping table

Sql-server – Why do sequential GUID keys perform faster than sequential INT keys in the test case

Related Question