PostgreSQL – Strategies for Reducing Data to Summaries Over Time

aggregatepostgresql

I have a large data set where large numbers of entries differ only by timestamp. In the short term it's beneficial to maintain individual entries so that we can correlate by time for individual data points. But after enough time passes, that level of granularity is unnecessary and I'm comfortable replacing those individual entries with a summarized entry by means of a count column.

So if the table has entries this when it's fresh:

+---------------------+------+-------+
| stamp               | data | count |
+---------------------+------+-------+
| 2016/01/01:12:54:00 | abcd | 1     |
| 2016/01/01:12:56:58 | abcd | 1     |
| 2016/01/01:13:02:36 | abcd | 1     |
+---------------------+------+-------+

I'm happy to combine all the entries of a given hour after, say, a week:

+---------------------+------+-------+
| stamp               | data | count |
+---------------------+------+-------+
| 2016/01/01:12:00:00 | abcd | 2     |
| 2016/01/01:13:00:00 | abcd | 1     |
+---------------------+------+-------+

And maybe after a month I only care about the granularity of a day:

+---------------------+------+-------+
| stamp               | data | count |
+---------------------+------+-------+
| 2016/01/01:00:00:00 | abcd | 3     |
+---------------------+------+-------+

This isn't hard to do in SQL (PostgreSQL shown here):

select distinct(date_trunc('hour', stamp),data,sum(count) 
into table tmpdata
from data
where stamp < now() - interval '7 days'
group by date_trunc('hour', stamp),data;

…and then clear the rows I just summarized out of data and replace them with the summaries I created in tmpdata… It seems preferable to keep the data within a single table rather than have separate tables for different granularity of data.

This all works, but it's awkward, and I'd like to learn a better way if there is one. So I have two questions:

What is the name for this sort of summary-replacement-of-data-over-time? I would say the data is being coalesced, but COALESCE has a distinct other meaning in SQL. It would help to know the terminology used to describe this sort of thing so I can tag it and talk about it properly.
What strategies / algorithms / methods of achieving this goal would you recommend, with an eye towards elegance?

Best Answer

I would simply call it "aggregation".

As for elegance, you can do this with a single SQL command without another table using a data-modifying CTE:

For (only) hourly aggregation of rows that are older than a week:

WITH del AS (
   DELETE FROM data d
   WHERE  stamp < now() - interval '7 days'   -- ">" would be wrong
   AND   (date_trunc('hour', stamp) <> stamp  -- not truncated yet.
          OR EXISTS (  -- more than one in same hour
             SELECT 1
             FROM   data
             WHERE  date_trunc('hour', stamp) = date_trunc('hour', d.stamp)
             AND    ctid <> d.ctid   -- except self
             )
         )
   RETURNING *
   )
INSERT INTO data(stamp, data, count) 
SELECT date_trunc('hour', stamp), data, sum(count)
FROM   del
GROUP  BY 1,2;

Only replaces rows that need replacing. Repeated calls would not repeat writes.

If you know a lower bound where all rows have been aggregated already, add another WHERE condition to make it faster:

AND stamp > $lower_bound

Obviously, you need an index on stamp or, ideally, a partial index only including rows younger than $lower_bound that would need to be recreated from time to time. Related:

Using ctid since no PK is defined.

You may want to VACUUM FULL data; to reclaim table bloat after that. But be aware of implications. Maybe pg_repack is for you:

VACUUM returning disk space to operating system

Aside: DISTINCT and GROUP BY was redundant in your original query. And you were aggregating rows younger than a week instead of older.

Related Solutions

MongoDB: Aggregation workaround by helper collection

Once you go beyond the limitations you mention (until $out is supported at least), then your best option is Map Reduce. If you are going to run multiple Map Reduce jobs, I would recommend version 2.4 at least (uses V8 engine, muti-threaded). For an example of how to do a unique count with Map Reduce, take a look here:

http://cookbook.mongodb.org/patterns/unique_items_map_reduce/

Postgresql – Group By primary key or DISTINCT increase query time over 1000x with limit

First off, LIMIT / OFFSET without ORDER BY are of limited usefulness, since the order is arbitrary and can change any time (when VACUUM runs or when the table is manipulated in at any way). It is only somewhat reliable with read-only tables. That's fine if you don't care which rows you get back, but it may break paging.

You may be able to solve your conundrum with the good old EXISTS. PostgreSQL can stop searching for more hits as soon as the first is found - as opposed to your query with DISTINCT, where it tries to collect all matches.

It's hard to be more specific without knowing the table structure, cardinalities, index definitions and what's behind your peculiar WHERE expression. But this might just do it:

SELECT s.chem_compound_id AS chem1_0_ 
       -- 10 more columns all short varchar or date fields
FROM   simple_compound s
WHERE  EXISTS (
   SELECT 1
   FROM   compound_composition sc
   JOIN   chemical_structure   c  ON sc.chemical_structure_id = c.structure_id 
   WHERE  c.structure_id  @ ('CCNc1ccccc1', '')::bingo.sub
   AND    sc.chem_compound_id = s.chem_compound_id
   )
LIMIT  5 
OFFSET 5

This also assumes you are only interested in columns from simple_compound in the output.

Best Answer

Related Solutions

MongoDB: Aggregation workaround by helper collection

Postgresql – Group By primary key or DISTINCT increase query time over 1000x with limit

Related Question