SQL Server – Efficient Way to Compare Two Large Data Sets

exceptperformancequery-performancesql serversql-server-2008-r2

Currently, I'm comparing two data sets, that contain unique StoreKey/ProductKey combinations.

The 1st data set has the unique StoreKey/ProductKey combinations for sales between begin January 2012 and end May 2014 (result = 450K lines). The 2nd data set has the unique StoreKey/ProductKey combinations, for sales begin June 2014, until today (result = 190K lines).

I'm looking to find the StoreKey/ProductKey combinations that are in the 2nd set, but not in the 1st set – i.e. new products sold from the beginning of June.

Up until now, I've dumped the two data sets into temp tables, created indexes for both tables on both keys, and used the EXCEPT statement to find unique items.

What is the most efficient way of comparing such large data sets?
Is there a more efficient way of doing this type of large comparison?

Best Answer

Using EXCEPT is in my opinion the way to go here, but you might want to reconsider the use of the temporary table. By doing so you are effectively duplicating your data in memory, which will slow you down. If the indexes you need exist on the source tables (as I suspect), just compare the appropriate SELECTS:

SELECT StoreKey,ProductKey FROM table WHERE sales BETWEEN date1 AND date2
EXCEPT
SELECT StoreKey,ProductKey FROM table WHERE sales BETWEEN date3 AND date4