Postgresql – For each category, find the count of foreign-key items in all child categories using a PostgreSQL Recursive CTE

countctepostgresqltree

I have a typical tree structure stored as an adjacency list in PostgreSQL 9.4:

gear_category (
   id INTEGER PRIMARY KEY,
   name TEXT,
   parent_id INTEGER   
);

As well as a list of items attached to the categories:

gear_item (
   id INTEGER PRIMARY KEY,
   name TEXT,
   category_id INTEGER REFERENCES gear_category
);

Any category can have attached gear items, not just the leaves.

For speed reasons, I want to pre-calculate some data about each category, which I'll use to generate a Materialized View.

Desired output:

speedy_materialized_view (
   gear_category_id INTEGER,
   count_direct_child_items INTEGER,
   count_recursive_child_items INTEGER
);

count_recursive_child_items is the cumulative number of GearItems attached to the current category or any child categories. There should be one row for each category, with a 0 for any counts that are 0.

In order to calculate this, we need to use a recursive CTE to traverse the tree:

WITH RECURSIVE children(id, parent_id) AS (
    --base case
    SELECT gear_category.id AS id, gear_category.parent_id AS parent_id
    FROM gear_category
    WHERE gear_category.id = 37  -- setting this to id includes current object
                             -- setting to parent_id includes only children
    --combine with recursive part
    UNION ALL 
    SELECT gear_category.id AS gear_category_id
         , gear_category.parent_id AS gear_category_parent_id
    FROM gear_category, children
    WHERE children.id = gear_category.parent_id
)
TABLE children;

It's simple to count the child gear items attached to this list of child categories:

--Subselect variant
SELECT count(gear_item.id) AS count_recursive_child_items_for_single_cat
FROM gear_item 
WHERE gear_item.category_id IN (
SELECT children.id AS children_id
FROM children);


-- JOIN variant
SELECT count(gear_item.id) AS count_recursive_child_items_for_single_cat
FROM gear_item, children
WHERE gear_item.category_id = children.id;

But if you look at the CTE, I've hardcoded the starting category ID of '37'. I can't figure out how to combine these queries to generate the count_recursive_child_items for all categories, not just a single one.

How do I combine these?

Also, currently for each category, I calculate all the child categories, which creates a lot of duplicated work, and I'm not sure how to remove that. For example, say I have Grandparent > Parent > Leaf. Currently I separately calculate child categories for Grandparent and Parent, which means I twice calculate the Parent > Leaf relationship.

And since I'm already returning count_direct_child_items for each category, it might be faster to just use those when calculating count_recursive_child_items rather than starting from scratch in my counts like I currently do.

Separately, each of these concepts makes sense to me. I just can't figure out how to combine them into a single elegant/optimized query.

Best Answer

This does the job:

CREATE MATERIALIZED VIEW speedy_materialized_view AS
WITH RECURSIVE tree AS (
   SELECT id, parent_id, ARRAY[id] AS path
   FROM   gear_category
   WHERE  parent_id IS NULL

   UNION ALL 
   SELECT c.id, c.parent_id, path || c.id
   FROM   tree t
   JOIN   gear_category c ON c.parent_id = t.id
   )
, tree_ct AS (
   SELECT t.id, t.path, COALESCE(i.item_ct, 0) AS item_ct
   FROM   tree t
   LEFT   JOIN  (
      SELECT category_id AS id, count(*) AS item_ct
      FROM   gear_item
      GROUP  BY 1
      ) i USING (id)
   )
SELECT t.id
     , t.item_ct       AS count_direct_child_items
     , sum(t1.item_ct) AS count_recursive_child_items 
FROM   tree_ct t
LEFT   JOIN tree_ct t1 ON t1.path[1:array_upper(t.path, 1)] = t.path
GROUP  BY t.id, t.item_ct;

count_recursive_child_items is counted separately for each category, so I am not convinced it's the fastest possible way for deep trees.

However, aggregate functions are not allowed in recursive CTEs.

Recursive Function

Strictly speaking, it's really iterative - but so is a "recursive" CTE.

You could build a function working with temp tables. You need to know your way around plpgsql or there is too much to explain.

CREATE OR REPLACE FUNCTION f_tree_ct()
  RETURNS TABLE (id int, count_direct_child_items int, count_recursive_child_items int) AS
$func$
DECLARE
   _lvl int;
BEGIN
   -- basic table with added path and count
   CREATE TEMP TABLE t1 AS
   WITH RECURSIVE tree AS (
      SELECT c.id, c.parent_id, '{}'::int[] AS path, 0 AS lvl
      FROM   gear_category c
      WHERE  c.parent_id IS NULL

      UNION ALL 
      SELECT c.id, c.parent_id, path || c.parent_id, lvl + 1
      FROM   tree t
      JOIN   gear_category c ON c.parent_id = t.id
      )
   , tree_ct AS (
      SELECT t.id, t.parent_id, t.path, t.lvl, COALESCE(i.item_ct, 0) AS item_ct
      FROM   tree t
      LEFT   JOIN  (
         SELECT i.category_id AS id, count(*)::int AS item_ct
         FROM   gear_item i
         GROUP  BY 1
         ) i USING (id)
      )
   TABLE tree_ct;

   -- CREATE INDEX ON t1 (lvl); -- only for very deep trees

   SELECT INTO _lvl max(lvl) FROM t1;  -- identify max lvl to start bottom up

   -- recursively aggregate each level in 2nd temp table
   CREATE TEMP TABLE t2 AS
   SELECT t1.id, t1.parent_id, t1.lvl
        , t1.item_ct
        , t1.item_ct AS sum_ct 
   FROM   t1
   WHERE  t1.lvl = _lvl;

   IF _lvl > 0 THEN
      FOR i IN REVERSE _lvl .. 1 LOOP
         INSERT INTO t2
         SELECT t1.id, t1.parent_id, t1.lvl, t1.item_ct
              , CASE WHEN t2.sum_ct IS NULL THEN t1.item_ct ELSE t1.item_ct + t2.sum_ct END
         FROM   t1
         LEFT   JOIN (
            SELECT t2.parent_id AS id, sum(t2.sum_ct) AS sum_ct
            FROM   t2
            WHERE  t2.lvl = i
            GROUP  BY 1
            ) t2 USING (id)
         WHERE  t1.lvl = i - 1;
      END LOOP;
   END IF;

   RETURN QUERY  -- only requested columns, unsorted
   SELECT t2.id, t2.item_ct, t2.sum_ct FROM t2;

   DROP TABLE t1, t2; -- to allow repeated execution in one transaction
   RETURN;
END
$func$  LANGUAGE plpgsql;

This cannot be included in a CREATE MATERIALIZED VIEW statement because of the use of temporary tables. You could just create another (temp) table with it, acting as a manually maintained "materialized view":

CREATE TABLE speedy_materialized_view AS
SELECT * FROM f_tree_ct();

Alternatively you could TRUNCATE speedy_materialized_view in the function and write to it directly. The function would RETURNS void instead or you could return some meta-information like a row count ...

SQL Fiddle.

Aside:
Column aliases in the recursive term of a CTE are just for documentation since output column names are determined by the non-recursive term only.