I see no glaring issue with that data flow. I always suggest to do as much work as possible in your source queries, so if you can create a dataset at the outset that allows you to populate both tables via a simple split, it would certainly use less memory. But things like that are not always possible depending on the data source and data format.
Also, the trash destination is good for development/debugging but does no good in production. Remove it. Let the 'Unwanted Data'expire at the split. SSIS can figure out the rest.
As long as you avoid blocking components (UNION, MERGE, etc.) there isn't any reason this process would take days. I frequently process many millions of rows in SSIS without issue. SSIS is only as slow as the person who designed the process.
Is it currently having performance issues?
I am not a huge fan of either the extra "lock" table or the idea of locking the whole table to grab the next record. I get why it is being done, but that also hurts concurrency for operations that are updating to release a locked record (surely two processes can't be fighting over that when it is not possible for two processes to have locked the same record at the same time).
My preference would be to add a ProcessStatusID (typically TINYINT) column to the table with the data being processed. And is there a field for LastModifiedDate? If not, then it should be added. If yes, then do these records get updated outside of this processing? If records can be updated outside of this particular process, then another field should be added to track StatusModifiedDate (or something like that). For the rest of this answer I will just use "StatusModifiedDate" as it is clear in its meaning (and in fact, could be used as the field name even if there is currently no "LastModifiedDate" field).
The values for ProcessStatusID (which should be placed into a new lookup table called "ProcessStatus" and Foreign Keyed to this table) could be:
- Completed (or even "Pending" in this case as both mean "ready to be processed")
- In Process (or "Processing")
- Error (or "WTF?")
At this point it seems safe to assume that from the application, it just wants to grab the next record to process and won't be passing anything in to help make that decision. So we want to grab the oldest (at least in terms of StatusModifiedDate) record that is set to "Completed" / "Pending". Something along the lines of:
SELECT TOP 1 pt.RecordID
FROM ProcessTable pt
WHERE pt.StatusID = 1
ORDER BY pt.StatusModifiedDate ASC;
We also want to update that record to "In Process" at the same time to prevent the other process from grabbing it. We could use the OUTPUT
clause to let us do the UPDATE and SELECT in the same transaction:
UPDATE TOP (1) pt
SET pt.StatusID = 2,
pt.StatusModifiedDate = GETDATE() -- or GETUTCDATE()
OUTPUT INSERTED.RecordID
FROM ProcessTable pt
WHERE pt.StatusID = 1;
The main problem here is that while we can do a TOP (1)
in an UPDATE
operation, there is no way to do an ORDER BY
. But, we can wrap it in a CTE to combine those two concepts:
;WITH cte AS
(
SELECT TOP 1 pt.RecordID
FROM ProcessTable pt (READPAST, ROWLOCK, UPDLOCK)
WHERE pt.StatusID = 1
ORDER BY pt.StatusModifiedDate ASC;
)
UPDATE cte
SET cte.StatusID = 2,
cte.StatusModifiedDate = GETDATE() -- or GETUTCDATE()
OUTPUT INSERTED.RecordID;
The obvious question is whether or not two processes doing the SELECT at the same time can grab the same record. I am pretty sure that the UPDATE with OUTPUT clause, especially combined with the READPAST and UPDLOCK hints (see below for more details), will be fine. However, I have not tested this exact scenario. If for some reason the above query does not take care of the race condition, then adding the following will: application locks.
The CTE query above can be wrapped in sp_getapplock and sp_releaseapplock to create a "gate keeper" for the process. In doing so, only one process at a time will be able to enter in order to run the query above. The other process(es) will be blocked until the process with the applock releases it. And since this step of the overall process is just to grab the RecordID, it is fairly quick and won't be blocking the other process(es) for very long. And, just as with the CTE query, we are not blocking the entire table, thereby allowing other updates to other rows (to set their status to either "Completed" or "Error"). Essentially:
BEGIN TRANSACTION;
EXEC sp_getapplock @Resource = 'GetNextRecordToProcess', @LockMode = 'Exclusive';
{CTE UPDATE query shown above}
EXEC sp_releaseapplock @Resource = 'GetNextRecordToProcess';
COMMIT TRANSACTION;
Application locks are very nice but should be used sparingly.
Lastly, you just need a stored procedure to handle setting the status to either "Completed" or "Error". And that can be a simple:
CREATE PROCEDURE ProcessTable_SetProcessStatusID
(
@RecordID INT,
@ProcessStatusID TINYINT
)
AS
SET NOCOUNT ON;
UPDATE pt
SET pt.ProcessStatusID = @ProcessStatusID,
pt.StatusModifiedDate = GETDATE() -- or GETUTCDATE()
FROM ProcessTable pt
WHERE pt.RecordID = @RecordID;
Table Hints (found at Hints (Transact-SQL) - Table):
Best Answer
The table lock option facilitates minimal logging, reducing transaction log space requirements under certain circumstances. Below are relevant excerpts from the Prerequisites for Minimal Logging topic in the SQL Server Books Online.
The table lock will block other processes using the table, including the other OLEDB Destinations within the SSIS package. You could set the Maximum Insert Commit Size to improve concurrency but with the consideration that file will be partially imported in the event of a failure.