Sql-server – Distribute data into x groups of at-most n records each

sql serverstored-procedures

I have a set of Z entries(e.g. 100 records), which I want to insert into Header and Detail tables.

I want to divide this group into X groups, with each group containing at-most N entries. The last group may contain <= N entries. Here X is variable and N is static. BatchID is the IDENTITY key of the header table.

Simple Approach

a. dump all the data in Temp table
b. Find number of Loops : CEILING(COUNT/N)
c. For every loop insert 1 entry in Header, find IDENTITY KEY and use that as Batch ID and insert N records in Detail Table with that Batch ID like:
```
UPDATE TOP(N) #Temp 
SET IsPickedUp = 1
OUTPUT @BatchID,deleted., deleted.... 
INTO Detail(BatchID, ...)
WHERE IsPickedUp = 0
```
Optimized Approach

a. Find number of batches to be created of N records, where last Batch Count will be <= N
b. Insert all records from temp table to Detail table with correct BatchID

I need help for the second approach.

Here is my existing Code that works fine but take 40 min in production. One crucial requirement is that either the batches Are created or it is not created, we don't want incomplete population (e.g. transaction in while loop)

BEGIN TRY
BEGIN TRAN
    CREATE TABLE #TotalCID (ConsumerID VARCHAR(16),IsPickedUp BIT NOT NULL CONSTRAINT DF_IsPickedUp DEFAULT (0))
    CREATE CLUSTERED INDEX  IX_ConsumerID  ON #TotalCID(IsPickedUp,ConsumerID)    

    INSERT INTO #TotalCID (ConsumerID)
    SELECT ConsumerID FROM ConsumerTable WHERE SeedingStatus IN (0,2) -- 2Cr Records

    DECLARE @TotalCount DECIMAL(11,2) = 0
    DECLARE @BatchCount DECIMAL(9,2) = 0
    DECLARE @LoopCnt INT = 0
    DECLARE @BatchSize INT = 25000
    DECLARE @batchId INT = 0

    SELECT @TotalCount=COUNT(ConsumerID) FROM #TotalCID

    SET @LoopCnt = CEILING (@TotalCount/@BatchSize) 

    WHILE (@LoopCnt > 0)
    BEGIN

        INSERT INTO ETLHeader(Batch_Date,Total_Rec_count)
        VALUES(GETDATE(),0)

        SET @batchId = SCOPE_IDENTITY() -- BatchID in ETLHeader is PK

        UPDATE TOP (CONVERT(INT,@BatchSize)) #TotalCID
        SET IsPickedUp=1
        OUTPUT @batchId,deleted.ConsumerID
        INTO ETLDetail(Batch_ID,ConsumerID)   
        WHERE IsPickedUp = 0

        SET @BatchCount = @@rowcount

        UPDATE ETLHeader
        SET  Total_Rec_count = @BatchCount 
        WHERE Batch_Id = @batchId 

        SET @LoopCnt=@LoopCnt-1
    END

        UPDATE ConsumerTable 
        SET SeedingStatus = 1
        WHERE ConsumerID IN (SELECT ConsumerID FROM #TotalCID Where IsPickedUp= 1)          
        AND SeedingStatus IN (0,2)

IF OBJECT_ID('tempdb..#TotalCID') IS NOT NULL
    DROP TABLE #TotalCID
COMMIT TRAN
END TRY 
BEGIN CATCH
-- ERROR and Open TRAN Handling 
END CATCH

Best Answer

From the example code it seems that you want to transfer all outstanding rows, or none. This can be done in a couple of statements, though there are good reasons not to. More on that later.

If there's a foreign key in the detail table referencing the header table that must be honoured or worked around. Either the header table must be populated first, then the detail table, or the foreign key must be disabled, then re-enabled once the data is in place. Either way some data will be read twice.

I'll not use the temporary table (#TotalCID) either, saving the read and write this consumes. Instead I'll create a view on the source table (ConsumerTable). Even though there are a lot more rows in the base table compared to the temporary table (twenty million versus one hundred thousand) you have to read the base table at some point to find the rows to transfer. I believe a filtered index will be useful here. If there are concurrent writes to the base table they can be avoided by using a "high water mark" if there's an identity column or a suitable isolation level. It's OK to retain #TotalCID if you're more comfortable with that approach.

First I will show all the code, then I'll explain my thinking.

drop table if exists ConsumerTable;
drop table if exists ETLDetail;
drop table if exists ETLHeader;
go

create table ConsumerTable
(
    ConsumerID      varchar(16) primary key,    -- Note A
    SeedingStatus   int         default (0)     -- Note B
);

go
-- Note C
create index ToTransfer on ConsumerTable (ConsumerID)
where SeedingStatus in (0,2);
go

insert ConsumerTable(ConsumerID)
values ('a'), ('b'), ('c'), ('d'), ('e'),
       ('f'), ('g'), ('h'), ('i'), ('j');
go

create table.ETLHeader
(
    HeaderId        int     primary key,
    Batch_Date      date    not null,
    Total_Rec_count int     not null
);

create table ETLDetail
(
    Batch_ID    int not null,
    ConsumerID  int not null
);

go
create or alter view ConsumersToTransfer as
with DataRows as
(
    select
        ConsumerID,
        RowId       = ROW_NUMBER() over (order by ConsumerID)   -- Note D, E
    from ConsumerTable
    where SeedingStatus in (0, 2)                               -- Note F
)
select
    ConsumerID,
    RowId,
    BatchId = (RowId - 1) / (select BatchSize = 4)              -- Note G
from DataRows;
go

Notes

A For testing I'll make this the PK. It doesn't have to be for this method to work.

B The default is a convenience for my test data. It is not required for this solution.

C I think this filtered index will be useful, but you must have code to find these rows efficiently already. Continue to use that.

D To pre-calculate BatchID I will use integer division. From the docs

If an integer dividend is divided by an integer divisor, the result is an integer that has any fractional part of the result truncated.

For that I'll need a unique, gap-less sequence of integers, one for each source row. I'll use ROW_NUMBER() to build this. If your table has an identity column which provides the same conditions it can be used instead. Note that identities can leave gaps on occasion, or rows may have been deleted. If so the batch sizes will not be uniform.

E The OVER() clause is requried. We don't care which rows are together in a batch, however. So use the clustering index's leading column to avoid a sort. If you have a need to keep some rows near others you can change this clause, though there'll likely be a performance hit.

F The predicate matches that of the filtered index defined above.

G I'll use a batch size of 4 for testing. Change this to your production value (25,000) for your tests. I subtract 1 from RowId because ROW_NUMBER() is 1-based and I want the N^th row to end a batch rather than start a new batch.

Here is the code to transfer rows:

begin transaction;

    declare @NextBatchStart int
              = 1 + (select ISNULL(MAX(HeaderId), 0) from ETLHeader);       -- Note H

    insert ETLHeader                    -- Note I
    (
        HeaderId,
        Batch_Date,
        Total_Rec_count
    )
    select
        BatchId + @NextBatchStart,
        Batch_Date = GETDATE(),
        COUNT(*)
    from ConsumersToTransfer
    group by BatchId;

    insert ETLDetail                    -- Note J
    (
        Batch_ID,
        ConsumerID
    )
    select
        BatchId + @NextBatchStart,
        ConsumerId
    from ConsumersToTransfer;

    update ct
    set SeedingStatus = 1
    from ConsumerTable as ct
    inner join ETLDetail as d                   -- Note K
        on d.ConsumerId = ct.ConsumerId
        and d.Batch_ID >= @NextBatchStart       -- Note L
    where ct.SeedingStatus in (0, 2);           -- Note M

commit;

Notes

H This plays the role of your SCOPE_IDENTITY() ensuring new header rows follow on from existing rows.

I Each INSERT will read all source rows. Since this is will be in clustered key sequence (because of the OVER() clause) it should be fast.

J If there are concurrent writes to ConsumerTable these two statements could return different results and, hence, different batches. Specifically Total_Rec_count will be off for the last (non-full) batch.

K An inner join is logically equivalent to your .. IN (select ..).

L Only need to consider batches created in the current run.

M Predicate matches the filtered index.

Wrap this in the error-catching and rollback logic to handle your desired all-or-nothing outcome.

I think part of the reason your current approach takes so long is because the system does not use IX_ConsumerID because the leading column (IsPickedUp) is not selective enough. I touch on this for another question here.

I understand you want all-or-nothing outcomes. It is obvious your hardware can handle this because you have written code that runs. However it takes a toll on the system. All those rows written have to be held in the transaction logs. The logs must grow to accommodate the rows and cannot be truncated until the transaction completes. The system must hold locks, preventing other work from continuing.

A better way may be to process each batch in a transaction. Fewer resources will be needed for each batch compared to the whole. Locks will be released sooner. Restart logic will ensure no row is processed twice or missed.

Related Solutions

Sql-server – sp_ExecuteSQL, performance and table variables

There is an implicit CREATE TABLE at the beginning and DROP TABLE @T at the end as the table goes out of scope that aren't associated with either the INSERT or the SELECT statement.

If you replace with a #temp table you should see some additional reads for both the drop and create statements. For me I see 36 for the CREATE and 100 for the DROP so that accounts for 136 of your missing 137 reads.

(Script with #temp table)

EXEC sp_executesql N'
   CREATE TABLE #T 
  ( 
     x INT 
  ) 

INSERT INTO #T 
            (x) 
SELECT TOP 10 number 
FROM   master..spt_values 

SELECT * 
FROM   #T 
UNION 
SELECT * 
FROM   #T 

DROP TABLE #T
'

Sql-server – Updating a local table with a per-row count(*) which is an aggregate of inner joins on remote server

Depending on your permissions, the linked server could be trying to stream all the data over locally and then doing filtering. References

You might be able to skip that pain by computing the total aggregate count first into a table on the local server and then beat against that.

CREATE TABLE #LOCAL
(
    package_uuid nvarchar(255) NOT NULL PRIMARY KEY CLUSTERED
,   [count] bigint
);

INSERT INTO
    #LOCAL
SELECT 
    p.package_uuid
,   count(d.external_identification) AS [count]
FROM 
    ServerB.DATABASE.dbo.package p
    INNER JOIN 
        ServerB.DATABASE.dbo.doc2 d
        ON p.package_id = d.package_id
GROUP BY 
    p.package_uuid;

Try running that query locally on ServerB first to get an understanding of the theoretical throughput without factoring in your network. You can then do some quick and dirty estimates based on data sizes (500 + 8 per row in temporary table) and then it depends on your network. Hopefully this is all local network.

If the time is significantly different between the run on ServerB and pulling it back over, then you might need to use the OPENQUERY syntax to force the join on the remote server. Code approximately

CREATE TABLE #LOCAL
(
    package_uuid nvarchar(255) NOT NULL PRIMARY KEY CLUSTERED
,   [count] bigint
);

INSERT INTO
    #LOCAL
SELECT
    OQ.package_uuid
,   OQ.[count]
FROM
    OPENQUERY(ServerB,
    N'
    SELECT 
        p.package_uuid
    ,   count(d.external_identification) AS [count]
    FROM 
        DATABASE.dbo.package p
        INNER JOIN 
            DATABASE.dbo.doc2 d
            ON p.package_id = d.package_id
    GROUP BY 
        p.package_uuid
    ) AS OQ;

Best Answer

Notes

Notes

Related Solutions

Sql-server – sp_ExecuteSQL, performance and table variables

Sql-server – Updating a local table with a per-row count(*) which is an aggregate of inner joins on remote server

Related Question