Sql-server – Validate data between two large tables

pt-table-checksumsql serversql-server-2008t-sql

I have this scenario. Data files are transferred from multiple stores (transactions) to the head office as they happen. For each store the table in question is pretty big in size. At the head office, that same table contains the same info from all stores which makes it huge. The data is transferred through files as there isn't any direct connections between the HO and the stores.

I'm trying to find an efficient way to make sure that the data from each store has made it safe a sound to the head office table. To do this I have tried to do a checksum_AGG of a checksum for the fields that should match between the store and the HO for a specific Day for a specific store. But the performance was poor. I've also tried the sum of specific numeric columns and do the same at host for a specific Date and store to compare but the performance seems to be poor also.

I've tried to use a persisted computed column to calculate a row's hashbytes and maybe use that but unfortunately, I couldn't, since the table was created with ansi_nulls off, which is a restriction of the persisted computed columns. And it's unthinkable to fix that issue on a very large scale of stores and HOs. I also thought of creating a trigger to fill any type of checksum per row, but the reality is that I will need to do this on multiple tables (maybe 50 tables) and to add triggers on every table will be very heavy, especially in OLTP system at the stores.

I have verified the fragmentation of the indexes and all is good.

Ideally it would be great if I transfer all the data from the store line by line and compare it against the HO but it's unthinkable since it's way too big.

I'm looking for ideas that I can explore to see if performance is acceptable. The idea is to get some type of checksum of a table per date (store/date) from a store and compare it with the same operation at the HO. This means that I would only need to transfer a checksum value per store and date to the HO.
The solution must be compatible from SQL Server 2008 up until the latest SQL Server version.

More info as requested
The sample table that I've been working with is one from a HO for my tests…
Size: 160Gb for 105 stores and 633 days (200 columns… I know… not ideal but it's what it is and practically impossible to change for all stores and HOs)

Example of queries tried to create a daily checksum per store:

select [trsDate],[store],CHECKSUM_AGG(checksum(amt,qty,total))
from BigTable
where store='111'
AND trsDate BETWEEN '2016-01-01' and '2016-01-30'
group by [trsDate],[store]
order by [trsDate],[store]

SELECT trsDate, store,'BigTable',
amt=SUM(ISNULL(amt,0)),
qty=SUM(ISNULL(qty,0)),
total=SUM(ISNULL(total,0)) 
FROM BigTable
WHERE trsDate BETWEEN '2016-01-01' and '2016-01-30'
AND store='111'
GROUP BY trsDate,store

select [trsDate],[store],CHECKSUM_AGG(checksum([trsDate],[store],terminal,trsNumb,linenumb))
from BigTable
where store='111'
AND trsDate BETWEEN '2016-01-01' and '2016-01-30'
group by [trsDate],[store]
order by [trsDate],[store]

SELECT [trsDate],[store], HASHBYTES('SHA2_512', concat(COUNT(*),CHAR(26), SUM(amt+qty+total))
from BigTable
where store='111'
AND trsDate BETWEEN '2016-01-01' and '2016-01-30'
group by [trsDate],[store]
order by [trsDate],[store]

Execution plan shows an index Seek 100%… Adding an index to Store\Date and include amt,qty and total has helped drastically but I have to investigate the impact of this index for the thousands of transaction processed and transfer.
Ideally I would love to have a checksum on all columns but the more I dig the more I realise that it will practically be impossible (otherwise I would need to double the size of each table by putting all columns in the indexe's include)

Best Answer

Transactional Replication.

As one idea, this sounds like a reasonable candidate for transactional replication. Checksums, cyclical redundancy checks (CRC), and row counts are used to validate source data with the destination data.

From Validate Replicated Data, Books Online:

How Data Validation Works

SQL Server validates data by calculating a row count or a checksum at the Publisher and then comparing those values to the row count or checksum calculated at the Subscriber. One value is calculated for the entire publication table and one value is calculated for the entire subscription table, but data in text, ntext, or image columns is not included in the calculations.

While the calculations are performed, shared locks are placed temporarily on tables for which row counts or checksums are being run, but the calculations are completed quickly and the shared locks removed, usually in a matter of seconds.

When binary checksums are used, 32-bit redundancy check (CRC) occurs on a column-by-column basis rather than a CRC on the physical row on the data page. This allows the columns with the table to be in any order physically on the data page, but still compute to the same CRC for the row. Binary checksum validation can be used when there are row or column filters on the publication.