Postgresql – How to aggregate over a “multi source” with recursive, in postgres

postgresqlrecursive

Imagine I have a table "departments" like this:

name TEXT
parent_department TEXT (nullable)

I also have a "budgets" table like this:

department_name TEXT
budget INTEGER

I need to "traverse" this departments table from "leaf to root", following the parent department column. The end result is that I need to sum up the budgets of each department, but the twist is that the "budgets" table that I'll be using at any given time is very sparse (10s to 100s of rows), while departments is very dense (100m rows).

Normally, I would approach this by creating a "with recursive" CTE that "flattens" the department relationship into a temporary table of just "root department", "leaf department". Then I would join that table on leaf department to budgets, and group by root. The problem is that the "department" table is FAR too big now (the analogy to departments breaks down here a little bit I admit, no company has >100m departments).

Concretely, if I had departments like this (child -> parent):

C -> B
B -> A
A -> NULL
E -> D
D -> NULL
F -> G
G -> NULL

And budgets like this:

A 1
C 3
E 5

I would want to get as output:

A 4
D 5

And ideally, F and G, for example, would never be touched, because there are hundreds of millions of rows like F and G that shouldn't come up. But in my current query, they do. How can I fix this so that I only "traverse" the departments that I actually need to?

Best Answer

It took a while, but I figured it out:

WITH RECURSIVE
initial AS (
    SELECT
        department_name,
        department_name,
        FALSE
    FROM
        budgets
    GROUP BY department_name
),
tmp (child, parent, root) AS (
    SELECT
        *
    FROM
        initial
    UNION ALL
        SELECT
            tmp.child,
            CASE WHEN
                departments.parent_department IS NULL THEN tmp.parent
                ELSE departments.parent_department
            END,
            departments.parent_department IS NULL
        FROM
            tmp
            INNER JOIN departments
                ON tmp.parent = departments.name
        WHERE
            NOT tmp.root
)
SELECT
    tmp.parent,
    SUM(budget)
FROM
    tmp
    INNER JOIN budgets
        ON tmp.child = budgets.department_name
WHERE
    tmp.root
GROUP BY tmp.parent