I have a table which includes a column of decimal values, such as this:
id value size
-- ----- ----
1 100 .02
2 99 .38
3 98 .13
4 97 .35
5 96 .15
6 95 .57
7 94 .25
8 93 .15
What I need to accomplish is a little difficult to describe, so please bear with me. What I am trying to do is create an aggregate value of the size
column which increments by 1 each time the preceding rows sum up to 1, when in descending order according to value
. The result would look something like this:
id value size bucket
-- ----- ---- ------
1 100 .02 1
2 99 .38 1
3 98 .13 1
4 97 .35 1
5 96 .15 2
6 95 .57 2
7 94 .25 2
8 93 .15 3
My naive first attempt was to keep a running SUM
and then CEILING
that value, however it doesn't handle the case where some records' size
end up contributing to the total of two separate buckets. The below example might clarify this:
id value size crude_sum crude_bucket distinct_sum bucket
-- ----- ---- --------- ------------ ------------ ------
1 100 .02 .02 1 .02 1
2 99 .38 .40 1 .40 1
3 98 .13 .53 1 .53 1
4 97 .35 .88 1 .88 1
5 96 .15 1.03 2 .15 2
6 95 .57 1.60 2 .72 2
7 94 .25 1.85 2 .97 2
8 93 .15 2.00 2 .15 3
As you can see, if I were to simply use CEILING
on crude_sum
record #8 would be assigned to bucket 2. This is caused by the size
of records #5 and #8 being split across two buckets. Instead, the ideal solution is to reset the sum each time it reaches 1, which then increments the bucket
column and begins a new SUM
operation starting at the size
value of the current record. Because the order of the records is important to this operation, I've included the value
column, which is intended to be sorted in descending order.
My initial attempts have involved making multiple passes over the data, once to perform the SUM
operation, once more to CEILING
that, etc. Here is an example of what I did to create the crude_sum
column:
SELECT
id,
value,
size,
(SELECT TOP 1 SUM(size) FROM table t2 WHERE t2.value<=t1.value) as crude_sum
FROM
table t1
Which was used in an UPDATE
operation to insert the value into a table to work with later.
Edit:
I'd like to take another stab at explaining this, so here goes. Imagine each record is a physical item. That item has a value associated with it, and a physical size less than one. I have a series of buckets with a volume capacity of exactly 1, and I need to determine how many of these buckets I will need and which bucket each item goes in according to the value of the item, sorted from highest to lowest.
A physical item cannot exist in two places at once, so it must be in one bucket or the other. This is why I can't do a running total + CEILING
solution, because that would allow records to contribute their size to two buckets.
Best Answer
I am not sure what type of performance you are looking for, but if CLR or external app is not an option, a cursor is all that is left. On my aged laptop I get through 1,000,000 rows in about 100 seconds using the following solution. The nice thing about it is that it scales linearly, so I would be looking at a little about 20 minutes to run through the entire thing. With a decent server you will be faster, but not an order of magnitude, so it would still take several minutes to complete this. If this is a one off process, you probably can afford the slowness. If you need to run this as a report or similar regularly, you might want to store the values in the same table un update them as new rows get added, e.g. in a trigger.
Anyway, here is the code:
It drops and recreates the table MyTable, fills it with 1000000 rows and then goes to work.
The cursor copies each row into a temp table while running the calculations. At the end the select returns the calculated results. You might be a little faster if you don't copy the data around but do an in-place update instead.
If you have an option to upgrade to SQL 2012 you can look at the new window-spool supported moving window aggregates, that should give you better performance.
On a side note, if you have an assembly installed with permission_set=safe, you can do more bad stuff to a server with standard T-SQL than with the assembly, so I would keep working on removing that barrier - You have a good use case here where CLR really would help you.