Postgresql – Creating a summary table for an averaging query

optimizationperformancepostgresqlquery-performance

In relation to another post of mine regarding slow performance for an averaging query on a table with ~30 million rows, I tried creating a summary table to store the averages.

Firstly, I have a table of activity_scores and members. Below is an example of the activity_scores table:

 id | member_id | score |         created_at         |         updated_at         
----+-----------+-------+----------------------------+----------------------------
  1 |         8 |    73 | 2016-11-04 05:32:55.564235 | 2016-11-04 05:32:55.564235
  2 |        10 |    20 | 2016-11-04 05:22:55.564235 | 2016-11-04 05:22:55.564235

The scores are inserted every 10 minutes over time. In response to ypercube, I created a summary table with the constraint unique(member_id, date) for quicker access during a graphing process.

Then, I wrote the following query to store the average of scores for 'yesterday' per member into a summary table.

INSERT INTO activity_score_daily_averages (member_id, date, average_score)
  SELECT 
     activity_scores.member_id, TO_CHAR((activity_scores.created_at AT TIME ZONE COALESCE(members.zone, 'Etc/UTC')), 'YYYY/MM/DD') AS date,
     AVG(activity_scores.score) AS average_score FROM activity_scores INNER JOIN members ON members.id = activity_scores.member_id
     WHERE 
       NOT EXISTS(
         SELECT 1 FROM activity_score_daily_averages
         WHERE activity_score_daily_averages.member_id = activity_scores.member_id
         AND activity_score_daily_averages.date = date
       )
       AND activity_scores.created_at AT TIME ZONE COALESCE(members.zone, 'Etc/UTC') > (current_date-1) AT TIME ZONE COALESCE(members.zone, 'Etc/UTC')
       AND activity_scores.created_at AT TIME ZONE COALESCE(members.zone, 'Etc/UTC') < (current_date) AT TIME ZONE COALESCE(members.zone, 'Etc/UTC')
     GROUP BY member_id, date

Yet, running the above query took ~5 minutes. I feel as though there is a lot of room for improvement considering that the table I am working with isn't even considered that large for some of the DBMS masters out there…

Some things I was considering for optimizing this query were:

  1. Somehow using a CTE or variable(?) for COALESCE(members.zone, 'Etc/UTC')

  2. Find a different way to handle DUPLICATE KEY exceptions for entries that have the same member_id and date (i.e., replace WHERE NOT EXISTS(...) with something else)

I would appreciate any constructive criticisms and/or pointers.

Best Answer

Here is a possible scheme:

  1. Add a column to the activity_scores table is_processed which flags whether the row has been summarized into the summary table
  2. Add a partial index to the activity_scores table for only the unprocessed rows, this could include all columns
  3. Now when you go to load yesterday's data you are only looking through rows which you know have not been processed
  4. Identify yesterday's rows as part of the load process and insert them into a temp table, then insert the summary of the rows into the summary table
  5. Join the temp table to the source table on id and mark the rows as processed (should be quick as you have the ids and aren't doing a scan)