Sql-server – How to assign GUID to batches of data based on row counts and tunable batch size

batch processingsql serveruuid

I need to assign batches of rows a common GUID id (for consumption by an external process).

Below is a simplified setup describing what I'm looking for:

BASE_TABLE represents one of pre-existing tables in this scenario, much simplified obviously. Values in the "real" table should be considered random; they're not consecutive in real life (even the ID is not really an int)

CREATE TABLE BASE_TABLE (
    ID int
    ,VALUE1 varchar(10)
    ,VALUE2 varchar(255)
) 

TARGET_TABLE represents the "output" of the process I'm working on, which is just the input to something else that requires batches of data to have a uniqueIdentifier BATCHID for each batch. Batches in real life are ~1000 rows each.

CREATE TABLE TARGET_TABLE 
(
  ID int
  ,VALUE1 varchar(10)
  ,VALUE2 varchar(255)
  ,BATCHID uniqueIdentifier
)

(Add some dummy data)

DECLARE @DATA1 int
set @DATA1=0
WHILE @DATA1<100
BEGIN
  INSERT INTO BASE_TABLE (ID,VALUE1,VALUE2) VALUES (@DATA1,'v1'+CONVERT(varchar,@DATA1), 'v2'+CONVERT(varchar,@DATA1))
  SET @DATA1=@DATA1+1
END

/**
example DESIRED RESULT, with batch size of 5

ID VALUE1 VALUE2 BATCHID
0 v10 v20 38B1B4FB-7F1E-44FD-9336-19095C01C629
1 v11 v21 38B1B4FB-7F1E-44FD-9336-19095C01C629
2 v12 v22 38B1B4FB-7F1E-44FD-9336-19095C01C629
3 v13 v23 38B1B4FB-7F1E-44FD-9336-19095C01C629
4 v14 v24 38B1B4FB-7F1E-44FD-9336-19095C01C629
5 v15 v25 41122454-A743-4545-8F0C-D7B461E072AE
6 v16 v26 41122454-A743-4545-8F0C-D7B461E072AE
7 v17 v27 41122454-A743-4545-8F0C-D7B461E072AE
8 v18 v28 41122454-A743-4545-8F0C-D7B461E072AE
9 v19 v29 41122454-A743-4545-8F0C-D7B461E072AE
10 v110 v210 41122454-A743-4545-8F0C-D7B461E072AE
11 v111 v211 FBDE5513-C869-4F2D-AC4D-40CBEF4A2D48

etc

Important result is that for each batch of size N there are N rows with the same BatchID, which must be a GUID.

I'm pretty sure I could do what I want with a cursor, but I want to do something like this if possible:

select 
ID,VALUE1,VALUE2,
NEWID() OVER(PARTITION BY ROW_NUMBER() OVER (ORDER BY ID)) AS BatchID
from BASE_TABLE

but it's not valid because NEWID() isn't an aggregate function.

Best Answer

you have to play with row_numbers

 declare @part int = 3
 
 ;with baseRowNums as 
  (
    SELECT ID, VALUE1, VALUE2,
       Row_Number() Over (ORDER BY ID) AS rn
    FROM BASE_TABLE
  )
 ,batchNums as
 (
 SELECT *,
    Sum(CASE WHEN rn % @part = 1 THEN 1 ELSE 0 END)
    Over (ORDER BY ID, rn 
          ROWS Unbounded Preceding) AS part
 FROM baseRowNums
 )
 ,GUIDs as
 (select part, MAX(newid()) as GUID
  from batchNums
  group by part
 )
 select ID, VALUE1, VALUE2, GUIDs.GUID
 from batchNums
 join GUIDs on batchNums.part = GUIDs.part

https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=7c26d9bab095ce9d8ca1abf1f4c1a898