Sql-server – Fetch large data from Sql Server and process in chunks

sql serversql-server-2012

I am working on an API which would need to process large dataset. The query to select records is complex and takes time to complete. I cannot use the pagination using ROW_NUMBER as not all records returned by query will be marked closed. The same records (potentially all but not necessarily) may get reselected by the query.

I am thinking to use global temporary table as working set. Below is my approach:

  • API will first create the global temporary table.
  • API will execute the query and populate the temp table.
  • API will take data in chunks and process it.
  • API will drop the table after processing all records.

The API can be scheduled to run at an interval of 5 or 10 minutes. There will not be concurrent instances running, only one instance will run at one time.

Do you see any issues with the usage of global temporary table? Would a permanent table be more appropriate in this scenario?

Best Answer

Using a permanent table (as the commenter suggests) makes it a lot easier to track dependencies. Instead of recreating the table every time, just TRUNCATE it.

When you implement a pattern like this, you have to make sure you are making progress and not selecting the same records over and over again. One way to do this, is to process records in the order of the primary key on the input table. Here is the pseudocode for doing this:

DECLARE @Progress [Type of key]
/* Initial state, start from the bottom of the table */
SELECT  @Progress = MIN(Key) FROM Input

/* Main event loop to do work */
WHILE 1=1 BEGIN
  INSERT INTO WorkTable
  SELECT TOP (@SizeOfBatch) * 
  FROM Input WHERE Key > @Progress
  ORDER BY Key ASC

  /* Do work on WorkTable */

  /* Make sure we start where we left off */
  SELECT @Progress = MAX(Key) FROM WorkTable

  TRUNCATE TABLE WorkTable

  /* Wait for 5-10 min */
END

Of course, the above only works if the Key column is ascending. If it isn't you have to maintain a column in the Input table that track whether work has been done or not (and then SELECT TOP the ones where it hasn't)