Postgresql – Aggregating statistics from child rows

aggregatepostgresql

I have a parent row and then many child rows, and I must aggregate statistics from these child rows.

For a more concrete example let's imagine I have one round of golf (parent row) and then child rows with the score of each hole. I need to aggregate statistics on these rows, like number of birdies, pars, bogies, putts and so forth. And then put these numbers in a round_detail table.

What would be the best solution to this?

I'm using postgres if it matters, but this is probably just a general sql propblem.

EDIT: As requested http://sqlfiddle.com/#!1/6549c

Best Answer

The concept you consider is known as materialized view. Since you comment:

rounds are rarely edited afterwards

.. I think this should be good idea for your case. Some RDBMS' have distinct objects for that. In Postgres it's just a matter of writing the results from a view or SRF (set returning function) to a table. You can find a basic code example for refreshing MVs in the manual here.

Refer to this related answer on SO for the more sophisticated regime I use for that purpose.

Your query to (re-)create a materialized view could look something like this:

SELECT c.name AS course
      ,r.user_id
      ,s.round_id
      ,COUNT(NULLIF(s.strokes = h.par, FALSE))     AS pars
      ,COUNT(NULLIF(s.strokes = h.par - 1, FALSE)) AS birdies
      -- more columns ..
FROM   score  s
JOIN   round  r ON r.id = s.round_id
JOIN   course c ON c.id = r.course_id
JOIN   hole   h ON h.id = s.hole_id
GROUP  BY 1,2,3

I simplified your test case and added the query: ->sqlfiddle.

Related Solutions

Postgresql – postgres fetch record from partitions without the check constraints

A few details are still not clear, but let's see, what we can do now. First, having about 100M rows and 231 partitions sounds not that good. The resulting tables will be too small, in turn their number too high - I cannot tell the threshold, but at some point the query planning migt get too expensive. I think it is quite possible that yearly partitions would be enough. Alternatively, if you really want to fetch a whole month at a time, create monthly partitions.

Now to the actual problem.

It is not quite clear to me why you have rows in the parent table. The usual way of partitioning is that the parent is empty, and every row is redirected to one of the children.

At the same time, is you have an index on posted_at of the parent table (as you have it on the children), finding rows in the parent based on the timestamp is easy.

On the other hand, while I'm not sure which column shared_parent_id refers to, you can define an index on it, too - looking rows up based on this will be easy, too.

The only thing still has to be added is tell your query to look for parents in the parent table only. Let's have a look at a possible query:

WITH child_messages AS (
    SELECT shared_parent_id, {other interesting columns}
      FROM messages
     WHERE posted_at {matches your needs}
)
SELECT *
  FROM child_messages

UNION ALL

SELECT shared_parent_id, {other interesting columns}
  FROM ONLY messages -- this way it does not go to the children
 WHERE {unclear column} IN (SELECT shared_parent_id FROM child_messages);

The WITH query may pick up rows from the parent, too - this you may or may not want, adjust the query accordingly.

Furthermore, the performance might not be ideal, in this case there is room for tweaking the query (eg. a JOIN instead of the IN(), pushing the query in the WITH clause into a (sub)query and so on).

And a final notice: varchar(255) is usually a sign of a value of unknown-before length - if you really want to constrain it, you may want to choose a meaningful limit. Otherwise, an unlimited varchar (or text) has a slight performance advantage in PostgreSQL over the limited ones. Furthermore, from your example it seems that shared_parent_id is a number (integer) - use the best fitting type.

PostgreSQL – Make Custom Aggregate Function Easier to Use

Before going into details - are you sure you are not re-inventing a wheel? This might burn down to the very popular topic of greatest-n-per-group.

Your query:

SELECT min(value) as value, valued_min(value, name) as name FROM kv;

Can be rewritten with stock Postgres as:

SELECT value, name
FROM   kv
ORDER  BY value
LIMIT  1;

Which also can use a simple btree index on (value) or (value, name) for an index or index-only scan - much faster.

I am pretty sure any other example can be solved with built-in functionality as well. To get one row per group, your query would be:

SELECT grp_col, min(value) AS value, valued_min(value, name) AS name
FROM   kv
GROUP  BY grp_col;

Replace with:

SELECT DISTINCT ON (grp_col)
       grp_col, value, name
FROM   kv
ORDER  BY grp_col, value;

Again, faster. And much more versatile. Detailed explanation:

Select first row in each GROUP BY group?

Best Answer

Related Solutions

Postgresql – postgres fetch record from partitions without the check constraints

PostgreSQL – Make Custom Aggregate Function Easier to Use

Related Question