Sql-server – Selecting Values That Differ From Section Average

aggregatesql serversql-server-2012window functions

I'm attempting to select data from a table that's already been made containing records of file size as taken every twelve hours. The table looks something like this:

surrogate_id  | partition_id | data_wanted | collection_ts
1             | 1            | 100         | 2015-09-30 17:00:00
2             | 1            | 100         | 2015-09-30 17:00:00
3             | 1            | 150         | 2015-09-30 17:00:00
4             | 2            | 120         | 2015-09-30 17:00:00
5             | 2            | 130         | 2015-09-30 17:00:00
6             | 3            | 100         | 2015-09-30 17:00:00
7             | 3            | 100         | 2015-09-30 17:00:00
8             | 3            | 100         | 2015-09-30 17:00:00
9             | 3            | 100         | 2015-09-30 17:00:00
10            | 3            | 100         | 2015-09-30 17:00:00
11            | 3            | 100         | 2015-09-30 17:00:00
1             | 1            | 100         | 2015-09-30 18:00:00
2             | 1            | 100         | 2015-09-30 18:00:00
3             | 1            | 150         | 2015-09-30 18:00:00
4             | 2            | 120         | 2015-09-30 18:00:00
5             | 2            | 130         | 2015-09-30 18:00:00
6             | 3            | 100         | 2015-09-30 18:00:00
7             | 3            | 100         | 2015-09-30 18:00:00
8             | 3            | 100         | 2015-09-30 18:00:00
9             | 3            | 100         | 2015-09-30 18:00:00
10            | 3            | 100         | 2015-09-30 18:00:00
11            | 3            | 100         | 2015-09-30 18:00:00

Primary key here would be (surrogate_id, collection_ts).

What I need to do is select partitions that do not have uniform values in the data_wanted column based on the most recent timestamp. So in the above example I'd want to pick partitions 1 and 2 out. I understand that in order to get the proper timestamp selected a window function is appropriate, which I've used to generate other reports fine. This is the CTE I'm working with that implements a window function:

;WITH cte
AS
(
    SELECT *, ROW_NUMBER() OVER(PARTITION BY surrogate_id 
    ORDER BY collection_ts DESC) RowNumber
    FROM example_table
)

I've been attempting to do things with averages but I can't seem to find a way to do it in particular that works out well. It's apparent to me at least that if a partition's given data is different from the average value of that partition in that timestamp then we need to select that partition. I just can't figure out how to individually compare each piece of data in a partition to just the average of that one partition and still allow myself to return the partition_id while also using the CTE I defined.

Best Answer

You're on your way and understand CTEs and Window Functions so here's an option:

With AvgSize As
(
SELECT
     AVG(Data_Wanted) As AveragePartitionSize
    ,[partition_id]
FROM Tble
GROUP BY [partition_id]
)

,LatestValue As
(
SELECT
      surrogate_id
     ,collection_ts
    ,ROW_NUMBER() OVER(PARTITION BY surrogate_id ORDER BY collection_ts DESC) RowNumber
FROM example_table
)

SELECT
    *
FROM BaseTable BT
JOIN AvgSize
    ON BT.[partition_id] = AvgSize.[partition_id]
JOIN LatestValue LV
    ON BT.surrogate_id = LV.surrogate_id
    AND BT.collection_ts = LV.collection_ts
    AND LV.RowNumber = 1
WHERE 
    BT.Data_Wanted > AveragePartitionSize

CTE AvgSize establishes the average for the partition ID.

CTE LatestValue is your provided CTE restricted to necessary columns to establish the latest row.

The final step is to bring them all together. The CTEs restrict the query to the latest row and provide the partition average. The only thing left to do is compare the latest row partition size to the average partition size and you're left with your result set.

You could get a little more advanced with your variance check with something like:

ABS(AveragePartitionSize - BT.Data_Wanted) > 20