PostgreSQL – How to Avoid Invoking Functions Twice with GROUP BY and HAVING

group byperformancepostgresqlpostgresql-9.2query-performance

I have a PostgreSQL database (9.2) with a table of parent-child relations. I have a query that looks for nodes having multiple parents.

The following query works and returns the correct results:

SELECT node,parents FROM
(
  SELECT nr.child AS node, COUNT(nr.parent) AS parents 
  FROM node_relation nr 
  GROUP BY nr.child
) AS count WHERE parents > 1;

The result set:

 node   | parents
--------+---------
 n21174 |       2
 n8635  |       2
(2 rows)

The table definition is:

            Table "public.node_relation"
   Column    |         Type          |   Modifiers
-------------+-----------------------+---------------
 child       | character varying(50) | not null
 parent      | character varying(50) | not null
Indexes:
    "node_relation_pkey" PRIMARY KEY, btree (child, parent)

I re-wrote the query to not use a sub-select:

SELECT child AS node, COUNT(parent) AS parents 
FROM node_relation 
GROUP BY child 
HAVING COUNT(parent) > 1;

The new query works, but I wonder about the COUNT function being invoked multiple times.

Update: Here is the query plan:

                                                 QUERY PLAN
-------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=0.00..1658.81 rows=19970 width=16)
   Filter: (count(parent) > 1)
   ->  Index Only Scan using node_relation_pkey on node_relation  (cost=0.00..1259.40 rows=19971 width=16)

I would prefer to use the parents alias but the following does not work:

SELECT child AS node, COUNT(parent) AS parents 
FROM node_relation 
GROUP BY child 
HAVING parents > 1;

ERROR:  column "parents" does not exist
LINE 1: ...parents FROM node_relation GROUP BY child HAVING parents > ...
                                                            ^

Will PostgreSQL optimize out the multiple invocations of COUNT?

If not, is there an alternative form of this query that would be more efficient?

Best Answer

Your second query (the one where you implement it with the HAVING clause) is probably faster. In your first query (with the subselect), postgres has to calculate the count values for the entire table. In your second query, it can start ignoring rows to count once it reaches a count value above 1 (although I don't 100% know if postgres is smart enough to do that - I'm fairly certain it is, though).

Since COUNT() is an aggregate function, it's going to be run the number of times it'll be run regardless of the number of rows returned. If you had a function that was NOT an aggregate function, then running your group and where/having clause in a subselect would likely be faster.

Example of what I'm referring to:

SELECT
  some_non_agg_function(a.id, a.child)
FROM join_tab1 a
GROUP BY a.id, a.child
HAVING COUNT(a.id) > 1;
-- probably not as fast as
WITH rows_to_process AS (
  SELECT DISTINCT
    id, child
  FROM join_tab1 a
  GROUP BY a.id, a.child
  HAVING COUNT(a.id) > 1
) SELECT
  some_non_agg_function(id, child)
FROM rows_to_process;

To answer your question specifically - yes, postgres will keep track of what aggregate values it has calculated and re-use them (as opposed to re-calculating them) within the HAVING clause. I believe it will also re-use them in the SELECT clause (if for some weird reason you run the exact same aggregate more than once in the SELECT)

To quote Postgres's excellent documentation (bold mine)

It is important to understand the interaction between aggregates and SQL's WHERE and HAVING clauses. The fundamental difference between WHERE and HAVING is this: WHERE selects input rows before groups and aggregates are computed (thus, it controls which rows go into the aggregate computation), whereas HAVING selects group rows after groups and aggregates are computed. Thus, the WHERE clause must not contain aggregate functions; it makes no sense to try to use an aggregate to determine which rows will be inputs to the aggregates. On the other hand, the HAVING clause always contains aggregate functions. (Strictly speaking, you are allowed to write a HAVING clause that doesn't use aggregates, but it's seldom useful. The same condition could be used more efficiently at the WHERE stage.)

This doesn't specifically say that it re-uses the calculated values .. but it implies it with saying that the HAVING clause is used after aggregates are calculated.