Sql-server – Validation of data with checksum vs Binary_checksum

checksumsql serversql server 2014sql-server-2008

Based on this post, I was able to solve the performance issue by adding indexes and thought I completed my project until I ran into an issue with the checksum… I then did research (should have done this before I started the project) and found that the checksum and binary_checksum are not reliable in a system critical environment.

I'm looking for a reliable way, to compute data for a table at a store and then compare it with the same table at the head office.

As explained in the post mentioned above, Data files are transferred from multiple stores (transactions) to the head office as they happen. For each store the table in question is pretty big in size. At the head office, that same table contains the same info from all stores which makes it huge. The data is transferred through files as there isn't any direct connections between the HO and the stores. Ideally it would be great if I could transfer all the data from the store, line by line and compare it against the head office's table but it's unthinkable since it's way too big.

I've tried to use the HASHBYTES by casting columns to strings but just found out that it has a limitation of 8000 characters up until SQL Server 2014. Our systems have to support any version of SQL Server from 2008 up until today.

I'm trying to find a reliable way to get some type of aggregated checksum value that can be compared between each store and their head_office table by table.

UPDATE: I've tried to use the Hashbyte but it doesn't work in my case since I'm not comparing row by row due to the amount of data between each store and the head office. Example: if a store has 100 000 rows for a table, the head office that would have the info for 10 stores would contain 1 000 000 rows in that same table… I can't send all the rows from the store to the head office just to compare and make sure that all the data has been transferred successfully. So my way of thinking is aggregate the info at the store per date and compare it with the same aggregation for the same store at the head office and if I find a mismatch, I have a process that is triggered to request the info from the store for a specific date to the head office.

I was so happy to find the Checksum_agg of a checksum, which did just that… but after reading and testing different scenarios such as checksum(100.00) giving the same result as checksum(10.00), it killed my whole logic.

Any other ideas? anyone?

Best Answer

The solution that helped me solve this dilemma was to use the hashbytes with SHA2_256 which was then converted to a bigint followed by a conversion to a float to finally be summed up.

Note that the conversion from varbinary to bigint can be dangerous, but in my case it was acceptable. the conversion directly from varbinary to float is not supported... and to sum up many records with a bigint wasn't enough due to an overflow... hence the reason to convert the bigint to a float.