Postgresql – Nested aggregate functions over different dimensions

aggregatepostgresqlpostgresql-9.6window functions

I have a table where I periodically record the throughput of all network interfaces on a set of servers:

create table net (
    ts timestamptz not null,
    host text not null,
    interface text not null,
    recv_bytes bigint not null
);
create index on net (ts);
insert into net (ts, host, interface, recv_bytes) values
    ('2017-01-01 00:00:00+00', 'a', 'eth0',  500),
    ('2017-01-01 00:00:00+00', 'b', 'eth1', 2000),
    ('2017-01-01 00:00:01+00', 'b', 'eth0', 1000),  -- measurements arrive with some jitter, +1s here
    ('2017-01-01 00:00:02+00', 'c', 'eth0',  100),  -- only present in this interval
    ('2017-01-01 00:00:04+00', 'b', 'eth1', 1900),
    ('2017-01-01 00:00:05+00', 'a', 'eth0',  550),
    ('2017-01-01 00:00:05+00', 'b', 'eth0', 1200),

    ('2017-01-01 00:00:10+00', 'a', 'eth0',  600),
    ('2017-01-01 00:00:10+00', 'b', 'eth0', 1500),
    ('2017-01-01 00:00:11+00', 'b', 'eth1', 1900),
    ('2017-01-01 00:00:15+00', 'a', 'eth0',  600),
    ('2017-01-01 00:00:15+00', 'b', 'eth1', 1400),
    ('2017-01-01 00:00:16+00', 'b', 'eth0', 1400),
    ('2017-01-01 00:00:16+00', 'b', 'eth1', 1700);  -- (b,eth1) appears 3 times in this interval

I would like to find the total throughput across all interfaces for a given time range, averaged over 10 second intervals. If a (host, interface) pair is missing data for a given interval, it can simply be omitted. This is what I came up with:

select ts_interval, host, sum(recv_bytes) as recv_bytes
from (
    select
        to_timestamp(floor(extract(epoch from ts) / 10) * 10) as ts_interval,
        host,
        interface,
        avg(recv_bytes) as recv_bytes
    from net
    group by ts_interval, host, interface
) avg_net
where ts_interval >= '2017-01-01 00:00:00+00' and ts_interval < '2017-01-01 00:00:20+00'
group by ts_interval, host
order by ts_interval, host;

Result:

      ts_interval       | host |      recv_bytes
------------------------+------+-----------------------
 2017-01-01 00:00:00+00 | a    |  525.0000000000000000
 2017-01-01 00:00:00+00 | b    | 3050.0000000000000000
 2017-01-01 00:00:00+00 | c    |  100.0000000000000000
 2017-01-01 00:00:10+00 | a    |  600.0000000000000000
 2017-01-01 00:00:10+00 | b    | 3116.6666666666666667

This query seems overly verbose. The essence of that I'm trying to do is to sum over one dimension and average over the other. In pseudo-SQL:

select
    to_timestamp(floor(extract(epoch from ts) / 10) * 10) as ts_interval,
    host,
    sum(avg(recv_bytes OVER ts_interval) OVER host) as recv_bytes
from net
where ts >= '2017-01-01 00:00:00+00' and ts < '2017-01-01 00:00:20+00'
group by ts_interval, host
order by ts_interval, host;

Is there a more concise way to do this?

Best Answer

Query

You can use generate_series() to generate a time raster to join to:

SELECT g AS ts_interval, host, sum(recv_bytes) AS recv_bytes
FROM  (
   SELECT g, host, interface, avg(recv_bytes) AS recv_bytes
   FROM   generate_series(timestamptz '2017-01-01 00:00:00+00'
                        , timestamptz '2017-01-01 00:00:10+00'  -- 10 - only lower bound
                        , interval '10 sec') g
   JOIN   net n ON ts >= g
               AND ts <  g + interval '10 sec'
   GROUP  BY g, host, interface
   ) sub
GROUP  BY g, host
ORDER  BY g, host;

Same result. Not less verbose, but much faster if you have an index on (ts).

Then again, the crucial point is to have "sargable" predicates, which you also get without generate_series() this way:

SELECT ts_interval, host, sum(recv_bytes) as recv_bytes
FROM  (
   SELECT to_timestamp(trunc(extract(epoch from ts) / 10) * 10) AS ts_interval
        , host, interface, avg(recv_bytes) as recv_bytes
   FROM   net
   WHERE  ts >= '2017-01-01 00:00:00+00'
   AND    ts <  '2017-01-01 00:00:20+00' -- just make sure to match bounds
   GROUP  BY 1, 2, 3
   ) avg_net
GROUP  BY 1, 2
ORDER  BY 1, 2;

Minor point: With exclusively positive numbers, trunc() is equivalent to floor() and a bit faster.

I am using ordinal numbers, since you asked for "less verbose" code. But that's not the core of the question here ...

Core question

You can run a window function over an aggregate function in one query level (though it's not typically faster than using a subquery).

But not the other way round: you cannot run an aggregate function over a window function in one query level. Just not possible. Window functions happen after aggregate functions in SQL.
Related:

Related Solutions

Postgresql – Is it possible to wrap aggregate functions in Postgres

For all I know, you cannot wrap an aggregate function. You have to write your own aggregate function.

If you write your own function, you may run into incompatibilies when upgrading or porting to another database. So, personally, I wouldn't bother and just add the delimiter to string_agg() on every call. It really isn't that much of a pain.

That said, it's not that complicated to write your own aggregate function, either:

A custom final function:

CREATE FUNCTION f_str_agg_final(anyarray)
    RETURNS text LANGUAGE SQL AS
$func$SELECT array_to_string($1, ', ')$func$;

The aggregate:

CREATE AGGREGATE f_str_agg (anyelement) (
    SFUNC     = array_append
   ,STYPE     = anyarray
   ,INITCOND  = '{}'
   ,FINALFUNC = f_str_agg_final
);

Call:

SELECT f_str_agg (kat) from tbl;

Works for almost any type - except for array types. You would have to modify your aggregate for this. More in this closely related answer on SO.

Postgresql – Longest matching suffix while using aggregate functions

What is wrong with the (maybe too obvious?):

select * from noun n, noun_inflection ni
where
  n.label = 'handlebar' and
  n.label ilike '%'||ni.label_singular
order by
  char_length(ni.label_singular) DESC
limit 1;