Sql-server – Fetch large data from Sql Server and process in chunks

sql serversql-server-2012

I am working on an API which would need to process large dataset. The query to select records is complex and takes time to complete. I cannot use the pagination using ROW_NUMBER as not all records returned by query will be marked closed. The same records (potentially all but not necessarily) may get reselected by the query.

I am thinking to use global temporary table as working set. Below is my approach:

API will first create the global temporary table.
API will execute the query and populate the temp table.
API will take data in chunks and process it.
API will drop the table after processing all records.

The API can be scheduled to run at an interval of 5 or 10 minutes. There will not be concurrent instances running, only one instance will run at one time.

Do you see any issues with the usage of global temporary table? Would a permanent table be more appropriate in this scenario?

Best Answer

Using a permanent table (as the commenter suggests) makes it a lot easier to track dependencies. Instead of recreating the table every time, just TRUNCATE it.

When you implement a pattern like this, you have to make sure you are making progress and not selecting the same records over and over again. One way to do this, is to process records in the order of the primary key on the input table. Here is the pseudocode for doing this:

DECLARE @Progress [Type of key]
/* Initial state, start from the bottom of the table */
SELECT  @Progress = MIN(Key) FROM Input

/* Main event loop to do work */
WHILE 1=1 BEGIN
  INSERT INTO WorkTable
  SELECT TOP (@SizeOfBatch) * 
  FROM Input WHERE Key > @Progress
  ORDER BY Key ASC

  /* Do work on WorkTable */

  /* Make sure we start where we left off */
  SELECT @Progress = MAX(Key) FROM WorkTable

  TRUNCATE TABLE WorkTable

  /* Wait for 5-10 min */
END

Of course, the above only works if the Key column is ascending. If it isn't you have to maintain a column in the Input table that track whether work has been done or not (and then SELECT TOP the ones where it hasn't)

Related Solutions

Sql-server – LATCH_EX Waits on Resource METADATA_SEQUENCE_GENERATOR

Assuming you can isolate the problem to the generation of identity values (try removing that column as a test), what I would recommend is this:

Remove the IDENTITY property from the column in the final table.
Generate identity values in each of the #Temporary tables.
When loading the final table, combine a numeric identifier for the particular store with the identity values from step 2.

So if you have store ids 3 and 4, you would end up with final id values like this:

Or something similar to that. You get the idea.

This will eliminate the need to serialize on IDENTITY generation while preserving uniqueness in the final result.

Alternatively, depending on how the process works, insert the final calculated id values into the #Temporary tables. Then you could create a view that UNION ALLs them together, eliminating the need to copy the data at all.

Sql-server – Scheduled copy of large table from SQL Server 2000 to 2008

I would likely put a TRIGGER on the 2000 table and insert the captured changes into an audit table, then either immediately sync the 2008 database with just those changes through SSIS ( or even a linked server if that suits you ) or do an overnight sync, depending on what would have the less impact on business-hours performance and the network as a whole, truncating the audit table on successful transfers either way. This would ensure the changes are easy to acquire and small enough to pull across the network without generating calls from your infrastructure engineers. The audit table itself could even live in a different file group to reduce disk contention if necessary.

That said, there's probably half a million ways to go about doing this, and your specific, optimal solution is going to revolve largely around the details of how your environment operates today and even the specific structure of the table in question. For instance, if the "rebuild" you've mentioned is a severe example, involving either TRUNCATE or a complete DROP and CREATE, the simple TRIGGER suggestion I've made is not going to be the solution for you without a significantly extended amount of effort, especially compounded if your source table is using an IDENTITY column.

In that case, a Lookup task in a Data Flow may seem like a reasonable route at a glance, but this effectively results in the entire table being pulled across the network anyway for the comparison. Comparison logic at this point is just unnecessary overhead, you may as well stick the receiving catalog in SIMPLE recovery mode and stage all of the data you've received on the other side of the pipe. Doing an ALTER ... SWITCH after the transfer would at least then limit the lock on your reporting table to a single meta data operation, keeping it otherwise completely available to the reports being generated from it.

Best Answer

Related Solutions

Sql-server – LATCH_EX Waits on Resource METADATA_SEQUENCE_GENERATOR

Sql-server – Scheduled copy of large table from SQL Server 2000 to 2008

Related Question