PostgreSQL – Strategies for Reducing Data to Summaries Over Time

aggregatepostgresql

I have a large data set where large numbers of entries differ only by timestamp. In the short term it's beneficial to maintain individual entries so that we can correlate by time for individual data points. But after enough time passes, that level of granularity is unnecessary and I'm comfortable replacing those individual entries with a summarized entry by means of a count column.

So if the table has entries this when it's fresh:

+---------------------+------+-------+
| stamp               | data | count |
+---------------------+------+-------+
| 2016/01/01:12:54:00 | abcd | 1     |
| 2016/01/01:12:56:58 | abcd | 1     |
| 2016/01/01:13:02:36 | abcd | 1     |
+---------------------+------+-------+

I'm happy to combine all the entries of a given hour after, say, a week:

+---------------------+------+-------+
| stamp               | data | count |
+---------------------+------+-------+
| 2016/01/01:12:00:00 | abcd | 2     |
| 2016/01/01:13:00:00 | abcd | 1     |
+---------------------+------+-------+

And maybe after a month I only care about the granularity of a day:

+---------------------+------+-------+
| stamp               | data | count |
+---------------------+------+-------+
| 2016/01/01:00:00:00 | abcd | 3     |
+---------------------+------+-------+

This isn't hard to do in SQL (PostgreSQL shown here):

select distinct(date_trunc('hour', stamp),data,sum(count) 
into table tmpdata
from data
where stamp < now() - interval '7 days'
group by date_trunc('hour', stamp),data;

…and then clear the rows I just summarized out of data and replace them with the summaries I created in tmpdata… It seems preferable to keep the data within a single table rather than have separate tables for different granularity of data.

This all works, but it's awkward, and I'd like to learn a better way if there is one. So I have two questions:

  1. What is the name for this sort of summary-replacement-of-data-over-time? I would say the data is being coalesced, but COALESCE has a distinct other meaning in SQL. It would help to know the terminology used to describe this sort of thing so I can tag it and talk about it properly.

  2. What strategies / algorithms / methods of achieving this goal would you recommend, with an eye towards elegance?

Best Answer

I would simply call it "aggregation".

As for elegance, you can do this with a single SQL command without another table using a data-modifying CTE:

For (only) hourly aggregation of rows that are older than a week:

WITH del AS (
   DELETE FROM data d
   WHERE  stamp < now() - interval '7 days'   -- ">" would be wrong
   AND   (date_trunc('hour', stamp) <> stamp  -- not truncated yet.
          OR EXISTS (  -- more than one in same hour
             SELECT 1
             FROM   data
             WHERE  date_trunc('hour', stamp) = date_trunc('hour', d.stamp)
             AND    ctid <> d.ctid   -- except self
             )
         )
   RETURNING *
   )
INSERT INTO data(stamp, data, count) 
SELECT date_trunc('hour', stamp), data, sum(count)
FROM   del
GROUP  BY 1,2;

Only replaces rows that need replacing. Repeated calls would not repeat writes.

If you know a lower bound where all rows have been aggregated already, add another WHERE condition to make it faster:

AND stamp > $lower_bound

Obviously, you need an index on stamp or, ideally, a partial index only including rows younger than $lower_bound that would need to be recreated from time to time. Related:

Using ctid since no PK is defined.

You may want to VACUUM FULL data; to reclaim table bloat after that. But be aware of implications. Maybe pg_repack is for you:

Aside: DISTINCT and GROUP BY was redundant in your original query. And you were aggregating rows younger than a week instead of older.