Sql-server – Select data divided in groups evenly distributed by value

sql serversql-server-2012t-sql

I would like to select into 4 groups the data from a table having the sum of a values in the groups as evenly distributed possible. I am sure that I am not explaining it clear enough so I will try to give an example.

Here I use NTILE(4) to create the 4 groups:

SELECT Time, NTILE(4) OVER (ORDER BY Time DESC) AS N FROM TableX

Time -  N
-------------
10  -   1
 9  -   2
 8  -   3
 7  -   4
 6  -   1
 5  -   2
 4  -   3
 3  -   4
 2  -   1
 1  -   2

In the above query and result, the other columns have been omitted for brevity.

So you can see the groups also as follows:

  1    2    3    4
---  ---  ---  ---
 10    9    8    7
  6    5    4    3
  2    1    
---  ---  ---  ---
 18   15   12   10  Sum Totals of Time

Notice that the Sum Totals of Time using NTile is not really balanced between the groups. A better distribution of the Time values would be for example:

  1    2    3    4
---  ---  ---  ---
 10    9    8    7
  3    5    4    6
  1         2
---  ---  ---  ---
 14   14   14   13  Sum Totals of Time

Here the Sum Totals of Time is more evenly distributed over the 4 groups.

How can I perform this through a TSQL statements?

Furthermore I have to say that I am using SQL Server 2012.
If you have something that can help me out, let me know.

I wish you a nice day.

Stan

Best Answer

Here's a stab at an algorithm. It's not perfect, and depending on how much time you want to spend refining it, there are probably some further small gains to be made.

Let's assume you have a table of tasks to be performed by four queues. You know the amount of work associated with performing each task, and you want all four queues to get an almost equal amount of work to do, so all queues will complete at about the same time.

First off, I'd partition the tasks using a modulous, ordered by their size, from small to large.

SELECT [time], ROW_NUMBER() OVER (ORDER BY [time])%4 AS grp, 0

The ROW_NUMBER() orders every row by size, then assigns a row number, starting at 1. This row number is assigned a "group" (the grp column) on a round-robin basis. First row is group 1, second row is group 2, then 3, the fourth gets group 0, and so on.

time ROW_NUMBER() grp
---- ------------ ---
   1            1   1
  10            2   2
  12            3   3
  15            4   0
  19            5   1
  22            6   2
...

For ease of use, I'm storing the time and grp columns in a table variable called @work.

Now, we can perform a few calculations on this data:

WITH cte AS (
    SELECT *, SUM([time]) OVER (PARTITION BY grp)
             -SUM([time]) OVER (PARTITION BY (SELECT NULL))/4 AS _grpoffset
    FROM @work)
...

The column _grpoffset is how much the total time per grp differs from the "ideal" average. If the total time of the all tasks is 1000 and there are four groups, there should ideally be a total of 250 in each group. If a group contains a total of 268, that group's _grpoffset=18.

The idea is to identify the two best rows, one in a "positive" group (with too much work) and one in a "negative" group (with too little work). If we can swap groups on those two rows, we could reduce the absolute _grpoffset of both groups.

Example:

time grp total _grpoffset
---- --- ----- ----------
   3   1   222         40
  46   1   222         40
  73   1   222         40
 100   1   222         40
   6   2   134        -48
  52   2   134        -48
  76   2   134        -48
  11   3   163        -21
  66   3   163        -21
  86   3   163        -21
  45   0   208         24
  71   0   208         24
  92   0   208         24
----
=727

With a grand total of 727, each group should have a score of about 182 for the distribution to be perfect. The difference between the group's score and 182 is what we're putting in the _grpoffset column.

As you can see now, in the best of worlds, we should move about 40 points worth of rows from group 1 to group 2 and about 24 points from group 3 to group 0.

Here's the code to identify those candidate rows:

    SELECT TOP 1 pos._row AS _pos_row, pos.grp AS _pos_grp,
                 neg._row AS _neg_row, neg.grp AS _neg_grp
    FROM cte AS pos
    INNER JOIN cte AS neg ON
        pos._grpoffset>0 AND
        neg._grpoffset<0 AND
        --- To prevent infinite recursion:
        pos.moved<4 AND
        neg.moved<4
    WHERE --- must improve positive side's offset:
          ABS(pos._grpoffset-pos.[time]+neg.[time])<=pos._grpoffset AND
          --- must improve negative side's offset:
          ABS(neg._grpoffset-neg.[time]+pos.[time])<=ABS(neg._grpoffset)
    --- Largest changes first:
    ORDER BY ABS(pos.[time]-neg.[time]) DESC
    ) AS x ON w._row IN (x._pos_row, x._neg_row);

I'm self-joining the common table expression that we created before, cte: On one side, groups with a positive _grpoffset, on the other side groups with negative ones. To further filter out which rows are supposed to match each other, the swap of the positive and negative sides' rows must improve _grpoffset, i.e. get it closer to 0.

The TOP 1 and ORDER BY selects the "best" match to swap first.

Now, all we need to to is add an UPDATE, and loop it until there's no more optimization to be found.

TL;DR - here's the query

Here's the complete code:

DECLARE @work TABLE (
    _row    int IDENTITY(1, 1) NOT NULL,
    [time]  int NOT NULL,
    grp     int NOT NULL,
    moved   tinyint NOT NULL,
    PRIMARY KEY CLUSTERED ([time], _row)
);

WITH cte AS (
    SELECT 0 AS n, CAST(1+100*RAND(CHECKSUM(NEWID())) AS int) AS [time]
    UNION ALL
    SELECT n+1,    CAST(1+100*RAND(CHECKSUM(NEWID())) AS int) AS [time]
    FROM cte WHERE n<100)

INSERT INTO @work ([time], grp, moved)
SELECT [time], ROW_NUMBER() OVER (ORDER BY [time])%4 AS grp, 0
FROM cte;



WHILE (@@ROWCOUNT!=0)
    WITH cte AS (
        SELECT *, SUM([time]) OVER (PARTITION BY grp)
                 -SUM([time]) OVER (PARTITION BY (SELECT NULL))/4 AS _grpoffset
        FROM @work)

    UPDATE w
    SET w.grp=(CASE w._row
               WHEN x._pos_row THEN x._neg_grp
               ELSE x._pos_grp END),
        w.moved=w.moved+1
    FROM @work AS w
    INNER JOIN (
        SELECT TOP 1 pos._row AS _pos_row, pos.grp AS _pos_grp,
                     neg._row AS _neg_row, neg.grp AS _neg_grp
        FROM cte AS pos
        INNER JOIN cte AS neg ON
            pos._grpoffset>0 AND
            neg._grpoffset<0 AND
            --- To prevent infinite recursion:
            pos.moved<4 AND
            neg.moved<4
        WHERE --- must improve positive side's offset:
              ABS(pos._grpoffset-pos.[time]+neg.[time])<=pos._grpoffset AND
              --- must improve negative side's offset:
              ABS(neg._grpoffset-neg.[time]+pos.[time])<=ABS(neg._grpoffset)
        --- Largest changes first:
        ORDER BY ABS(pos.[time]-neg.[time]) DESC
        ) AS x ON w._row IN (x._pos_row, x._neg_row);

Related Solutions

Sql-server – SELECT every nth value by Date from large table

Technically to display n-th value you use the ranking functions like RANK, DENSE_RANK or ROW_NUMBER. Which one exactly depends, but what you describe matches best ROW_NUMBER:

with cte as (
 select row_number() over (order by Date) as rn, *
 from table)
where rn % 1000 = 0;

But you are saying that you want o do this for performance reasons, avoid going through millions of rows. Such a query had already done the damage, it read from disk all the rows and had to sort them so the performance price was already payed.

A better alternative to selecting a random sample of data is to use the TABLESAMPLE clause, see Limiting Result Sets by Using TABLESAMPLE:

select * 
from table tablesample (100 rows);

TABLESAMPLE will be way more efficient as it actually avoids reading all the data, it only samples some of the pages in the table and returns all rows in the sampled pages.

But consider that if you use TABLESAMPLE your WHERE clause are applied after the sampling. So it may be that the sample does not contain any row for the device/sensor you're interested in. This is specially true for sensor with small data.

Sql-server – How to write windowing query which sums a column to create discrete buckets

I am not sure what type of performance you are looking for, but if CLR or external app is not an option, a cursor is all that is left. On my aged laptop I get through 1,000,000 rows in about 100 seconds using the following solution. The nice thing about it is that it scales linearly, so I would be looking at a little about 20 minutes to run through the entire thing. With a decent server you will be faster, but not an order of magnitude, so it would still take several minutes to complete this. If this is a one off process, you probably can afford the slowness. If you need to run this as a report or similar regularly, you might want to store the values in the same table un update them as new rows get added, e.g. in a trigger.

Anyway, here is the code:

IF OBJECT_ID('dbo.MyTable') IS NOT NULL DROP TABLE dbo.MyTable;

CREATE TABLE dbo.MyTable(
 Id INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,
 v NUMERIC(5,3) DEFAULT ABS(CHECKSUM(NEWID())%100)/100.0
);


MERGE dbo.MyTable T
USING (SELECT TOP(1000000) 1 X FROM sys.system_internals_partition_columns A,sys.system_internals_partition_columns B,sys.system_internals_partition_columns C,sys.system_internals_partition_columns D)X
ON(1=0)
WHEN NOT MATCHED THEN
INSERT DEFAULT VALUES;

--SELECT * FROM dbo.MyTable

DECLARE @st DATETIME2 = SYSUTCDATETIME();
DECLARE cur CURSOR FAST_FORWARD FOR
  SELECT Id,v FROM dbo.MyTable
  ORDER BY Id;

DECLARE @id INT;
DECLARE @v NUMERIC(5,3);
DECLARE @running_total NUMERIC(6,3) = 0;
DECLARE @bucket INT = 1;

CREATE TABLE #t(
 id INT PRIMARY KEY CLUSTERED,
 v NUMERIC(5,3),
 bucket INT,
 running_total NUMERIC(6,3)
);

OPEN cur;
WHILE(1=1)
BEGIN
  FETCH NEXT FROM cur INTO @id,@v;
  IF(@@FETCH_STATUS <> 0) BREAK;
  IF(@running_total + @v > 1)
  BEGIN
    SET @running_total = 0;
    SET @bucket += 1;
  END;
  SET @running_total += @v;
  INSERT INTO #t(id,v,bucket,running_total)
  VALUES(@id,@v,@bucket, @running_total);
END;
CLOSE cur;
DEALLOCATE cur;
SELECT DATEDIFF(SECOND,@st,SYSUTCDATETIME());
SELECT * FROM #t;

GO 
DROP TABLE #t;

It drops and recreates the table MyTable, fills it with 1000000 rows and then goes to work.

The cursor copies each row into a temp table while running the calculations. At the end the select returns the calculated results. You might be a little faster if you don't copy the data around but do an in-place update instead.

If you have an option to upgrade to SQL 2012 you can look at the new window-spool supported moving window aggregates, that should give you better performance.

On a side note, if you have an assembly installed with permission_set=safe, you can do more bad stuff to a server with standard T-SQL than with the assembly, so I would keep working on removing that barrier - You have a good use case here where CLR really would help you.

Best Answer

Related Solutions

Sql-server – SELECT every nth value by Date from large table

Sql-server – How to write windowing query which sums a column to create discrete buckets

Related Question