I have a PostgreSQL database (9.2) with a table of parent-child relations. I have a query that looks for nodes having multiple parents.
The following query works and returns the correct results:
SELECT node,parents FROM
(
SELECT nr.child AS node, COUNT(nr.parent) AS parents
FROM node_relation nr
GROUP BY nr.child
) AS count WHERE parents > 1;
The result set:
node | parents
--------+---------
n21174 | 2
n8635 | 2
(2 rows)
The table definition is:
Table "public.node_relation"
Column | Type | Modifiers
-------------+-----------------------+---------------
child | character varying(50) | not null
parent | character varying(50) | not null
Indexes:
"node_relation_pkey" PRIMARY KEY, btree (child, parent)
I re-wrote the query to not use a sub-select:
SELECT child AS node, COUNT(parent) AS parents
FROM node_relation
GROUP BY child
HAVING COUNT(parent) > 1;
The new query works, but I wonder about the COUNT function being invoked multiple times.
Update: Here is the query plan:
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=0.00..1658.81 rows=19970 width=16)
Filter: (count(parent) > 1)
-> Index Only Scan using node_relation_pkey on node_relation (cost=0.00..1259.40 rows=19971 width=16)
I would prefer to use the parents
alias but the following does not work:
SELECT child AS node, COUNT(parent) AS parents
FROM node_relation
GROUP BY child
HAVING parents > 1;
ERROR: column "parents" does not exist
LINE 1: ...parents FROM node_relation GROUP BY child HAVING parents > ...
^
Will PostgreSQL optimize out the multiple invocations of COUNT
?
If not, is there an alternative form of this query that would be more efficient?
Best Answer
Your second query (the one where you implement it with the
HAVING
clause) is probably faster. In your first query (with the subselect), postgres has to calculate the count values for the entire table. In your second query, it can start ignoring rows to count once it reaches a count value above 1 (although I don't 100% know if postgres is smart enough to do that - I'm fairly certain it is, though).Since
COUNT()
is an aggregate function, it's going to be run the number of times it'll be run regardless of the number of rows returned. If you had a function that was NOT an aggregate function, then running your group and where/having clause in a subselect would likely be faster.Example of what I'm referring to:
To answer your question specifically - yes, postgres will keep track of what aggregate values it has calculated and re-use them (as opposed to re-calculating them) within the
HAVING
clause. I believe it will also re-use them in theSELECT
clause (if for some weird reason you run the exact same aggregate more than once in theSELECT
)To quote Postgres's excellent documentation (bold mine)
This doesn't specifically say that it re-uses the calculated values .. but it implies it with saying that the
HAVING
clause is used after aggregates are calculated.