Sql-server – How to write windowing query which sums a column to create discrete buckets

aggregatesql serversql-server-2008window functions

I have a table which includes a column of decimal values, such as this:

id value size
-- ----- ----
 1   100  .02
 2    99  .38
 3    98  .13
 4    97  .35
 5    96  .15
 6    95  .57
 7    94  .25
 8    93  .15

What I need to accomplish is a little difficult to describe, so please bear with me. What I am trying to do is create an aggregate value of the size column which increments by 1 each time the preceding rows sum up to 1, when in descending order according to value. The result would look something like this:

id value size bucket
-- ----- ---- ------
 1   100  .02      1
 2    99  .38      1
 3    98  .13      1
 4    97  .35      1
 5    96  .15      2
 6    95  .57      2
 7    94  .25      2
 8    93  .15      3

My naive first attempt was to keep a running SUM and then CEILING that value, however it doesn't handle the case where some records' size end up contributing to the total of two separate buckets. The below example might clarify this:

id value size crude_sum crude_bucket distinct_sum bucket
-- ----- ---- --------- ------------ ------------ ------
 1   100  .02       .02            1          .02      1
 2    99  .38       .40            1          .40      1
 3    98  .13       .53            1          .53      1
 4    97  .35       .88            1          .88      1
 5    96  .15      1.03            2          .15      2
 6    95  .57      1.60            2          .72      2
 7    94  .25      1.85            2          .97      2
 8    93  .15      2.00            2          .15      3

As you can see, if I were to simply use CEILING on crude_sum record #8 would be assigned to bucket 2. This is caused by the size of records #5 and #8 being split across two buckets. Instead, the ideal solution is to reset the sum each time it reaches 1, which then increments the bucket column and begins a new SUM operation starting at the size value of the current record. Because the order of the records is important to this operation, I've included the value column, which is intended to be sorted in descending order.

My initial attempts have involved making multiple passes over the data, once to perform the SUM operation, once more to CEILING that, etc. Here is an example of what I did to create the crude_sum column:

SELECT
  id,
  value,
  size,
  (SELECT TOP 1 SUM(size) FROM table t2 WHERE t2.value<=t1.value) as crude_sum
FROM
  table t1

Which was used in an UPDATE operation to insert the value into a table to work with later.

Edit:
I'd like to take another stab at explaining this, so here goes. Imagine each record is a physical item. That item has a value associated with it, and a physical size less than one. I have a series of buckets with a volume capacity of exactly 1, and I need to determine how many of these buckets I will need and which bucket each item goes in according to the value of the item, sorted from highest to lowest.

A physical item cannot exist in two places at once, so it must be in one bucket or the other. This is why I can't do a running total + CEILING solution, because that would allow records to contribute their size to two buckets.

Best Answer

I am not sure what type of performance you are looking for, but if CLR or external app is not an option, a cursor is all that is left. On my aged laptop I get through 1,000,000 rows in about 100 seconds using the following solution. The nice thing about it is that it scales linearly, so I would be looking at a little about 20 minutes to run through the entire thing. With a decent server you will be faster, but not an order of magnitude, so it would still take several minutes to complete this. If this is a one off process, you probably can afford the slowness. If you need to run this as a report or similar regularly, you might want to store the values in the same table un update them as new rows get added, e.g. in a trigger.

Anyway, here is the code:

IF OBJECT_ID('dbo.MyTable') IS NOT NULL DROP TABLE dbo.MyTable;

CREATE TABLE dbo.MyTable(
 Id INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,
 v NUMERIC(5,3) DEFAULT ABS(CHECKSUM(NEWID())%100)/100.0
);


MERGE dbo.MyTable T
USING (SELECT TOP(1000000) 1 X FROM sys.system_internals_partition_columns A,sys.system_internals_partition_columns B,sys.system_internals_partition_columns C,sys.system_internals_partition_columns D)X
ON(1=0)
WHEN NOT MATCHED THEN
INSERT DEFAULT VALUES;

--SELECT * FROM dbo.MyTable

DECLARE @st DATETIME2 = SYSUTCDATETIME();
DECLARE cur CURSOR FAST_FORWARD FOR
  SELECT Id,v FROM dbo.MyTable
  ORDER BY Id;

DECLARE @id INT;
DECLARE @v NUMERIC(5,3);
DECLARE @running_total NUMERIC(6,3) = 0;
DECLARE @bucket INT = 1;

CREATE TABLE #t(
 id INT PRIMARY KEY CLUSTERED,
 v NUMERIC(5,3),
 bucket INT,
 running_total NUMERIC(6,3)
);

OPEN cur;
WHILE(1=1)
BEGIN
  FETCH NEXT FROM cur INTO @id,@v;
  IF(@@FETCH_STATUS <> 0) BREAK;
  IF(@running_total + @v > 1)
  BEGIN
    SET @running_total = 0;
    SET @bucket += 1;
  END;
  SET @running_total += @v;
  INSERT INTO #t(id,v,bucket,running_total)
  VALUES(@id,@v,@bucket, @running_total);
END;
CLOSE cur;
DEALLOCATE cur;
SELECT DATEDIFF(SECOND,@st,SYSUTCDATETIME());
SELECT * FROM #t;

GO 
DROP TABLE #t;

It drops and recreates the table MyTable, fills it with 1000000 rows and then goes to work.

The cursor copies each row into a temp table while running the calculations. At the end the select returns the calculated results. You might be a little faster if you don't copy the data around but do an in-place update instead.

If you have an option to upgrade to SQL 2012 you can look at the new window-spool supported moving window aggregates, that should give you better performance.

On a side note, if you have an assembly installed with permission_set=safe, you can do more bad stuff to a server with standard T-SQL than with the assembly, so I would keep working on removing that barrier - You have a good use case here where CLR really would help you.

Related Solutions

Sql-server – How to model and enforce constraints on a categorical item contained in a collection

Categories and Buckets.

create table categories
(
  category_id int primary key
);

create table buckets
(
  bucket_id int primary key
);

Depending on how to interpret "an item may belong to at most 1 bucket" you need to do some different things with Items.

An item has to belong to one bucket.

create table items
(
  item_id int primary key,
  category_id int not null references categories(category_id),
  bucket_id int not null references buckets(bucket_id),
  unique (category_id, bucket_id)
);

If an item may belong to one bucket but does not have to you need to allow for null values in bucket_id and use a filtered index as the unique constraint.

create table items
(
  item_id int primary key,
  category_id int not null references categories(category_id),
  bucket_id int references buckets(bucket_id)
);

create unique index uq_items on items(category_id, bucket_id) where bucket_id is not null;

There is a collection of items where each item belongs to 1 category.

category_id as a foreign key in items.

an item may belong to at most 1 bucket.

bucket_id as a foreign key in items.

There is a collection of buckets which may contain at most 1 item from each category.

A unique constraint on bucket_id and category_id in items.

Postgresql – Postgres Nested WHEN Aggregate Function

After some processing this boiled down to:

While your predicate d."SettlementPointName" = 'John' is filtering a single value for "SettlementPointName" anyway, simplify to:

SELECT count(                                     d."SettlementPointPrice" < 10.5 OR NULL) AS da_00_10
     , count(d."SettlementPointPrice" >= 10.5 AND d."SettlementPointPrice" < 20.5 OR NULL) AS da_11_20
     , count(d."SettlementPointPrice" >= 20.5 AND d."SettlementPointPrice" < 30.5 OR NULL) AS da_21_30
FROM   public.da d
JOIN   public.rt_aggregate r USING ("DeliveryDate", "SettlementPointName")
WHERE  d."SettlementPointName" = 'John'
AND    d."DeliveryDate" >= '2015-02-01'
AND    d."DeliveryDate" <= '2015-02-20'
AND    r."DeliveryHour" = 14
AND    date_part('hour', d."DeliveryHour") = r."DeliveryHour";

About the counting technique:

For absolute performance, is SUM faster or COUNT?

Or better, yet, use the new aggregate filter technique in pg 9.4:

SELECT d."SettlementPointName"
     , count(*) FILTER (WHERE d."SettlementPointPrice" <  10.5) AS da_00_10
     , count(*) FILTER (WHERE d."SettlementPointPrice" >= 10.5
                        AND   d."SettlementPointPrice" <  20.5) AS da_11_20
     , count(*) FILTER (WHERE d."SettlementPointPrice" >= 20.5
                        AND   d."SettlementPointPrice" <  30.5) AS da_21_30
FROM   public.da d
JOIN   public.rt_aggregate r USING ("DeliveryDate", "SettlementPointName")
WHERE  d."DeliveryDate" >= '2015-02-01'
AND    d."DeliveryDate" <= '2015-02-20'
AND    r."DeliveryHour" = 14
AND    date_part('hour', d."DeliveryHour") = r."DeliveryHour"
GROUP  BY 1;

This time, selecting all names and returning one row per name like you asked in the comment.

Details for FILTER:

Return counts for multiple ranges in a single SELECT statement

Best Answer

Related Solutions

Sql-server – How to model and enforce constraints on a categorical item contained in a collection

Postgresql – Postgres Nested WHEN Aggregate Function

Related Question