Postgresql – How to use a recursive CTE to get ancestors in a hierarchy

ctepostgresqlrecursive

create table division (
  id serial primary key,
  name varchar not null
);


-- nested sets table

create table location (
  id serial primary key,
  name varchar not null,
  division_id integer not null references division(id),
  tree_id integer not null,
  parent_id integer references location(id),
  lft integer not null,
  rgt integer not null,
  level integer not null
);


create table report (
  id serial primary key,
  somevar_1 integer not null,
  somevar_2 integer not null,
  tyme timestamptz not null,
  -- other stuff
  location_id integer not null references location(id)
);


-- division hierarchy is Division0 > Division1 > Division2 > Division3
-- reports are generated at Division3, and aggregated at
-- either Division2 or Division1 (usually the latter)

-- this means that to get my data for Division1, i have to
-- do two self-joins on the location table
-- plus one for the report table

given the above schema, i'm experimenting with different ways to speed up querying for the sum of each somevar aggregated at either Division2 or Division1. right now, what i do is

select
  sum(somevar_1) as somevar_1, sum(somevar_2) as somevar_2,
  div1.name, div1.id
from
  report join location as div3 on report.location_id = div3.id
join
  location as div2 on div3.parent_id = div2.id
join
  location as div1 on div2.parent_id = div1.id
where
  -- for example, all reports this year
  date_trunc('year', tyme) = '2019-01-01'
group by div1.id, div1.name

the above query runs in about 4.8s locally with a row count a bit shy of 55k reports for 2019. this is just the base data, without any further processing. the processing step also does something similar, and it's in a web application, so the user is waiting for a not-insignificant amount of time.

for my question: can i use recursive CTEs to speed this up? i haven't yet grokked recursive CTEs, so writing one is still a bit beyond me at this point.

Best Answer

I would stick with one model. If you store both nested set information and parent information, you will have to maintain both, and you may end up with inconsistent information about your tree. If your main focus is on reporting both for div1 and div1, div2, you may consider GROUPING SETS. They are a UNION in disguise:

select sum(somevar_1) as somevar_1
     , sum(somevar_2) as somevar_2
     , div1.name as d1name
     , div1.id as d1id
     , COALESCE(div2.name, 'TOTAL') as d2name
     , div2.id as d2id 
from report 
join location as div3 
    on report.location_id = div3.id
join location as div2 
    on div3.parent_id = div2.id
join location as div1 
    on div2.parent_id = div1.id
-- for example, all reports this year
where date_trunc('year', tyme) = '2019-01-01'
group by 
    GROUPING SETS((div1.id, div1.name, div2.id, div2.name)
                 ,(div1.id, div1.name))

For the second grouping set (div1.id, div1.name) you will get null instead of div2.id, div2.name. That's why I used COALESCE to map the name to TOTAL.

You may also add a level of a GRAND TOTAL like:

    GROUPING SETS((div1.id, div1.name, div2.id, div2.name)
                 ,(div1.id, div1.name)
                 ,())

Grouping sets are the most general construction, but under certain situations (not yours), you can use GROUP BY ROLLUP and GROUP BY CUBE which are a bit shorter. See:

Grouping sets, rollup, and cube

If I get it right, you will always traverse div1, div2 and div3 levels in your queries and I therefor strongly doubt that a recursive CTE will perform better than 3 static JOINS. The power of RCTE is that it can traverse a variable number of joins. For a fixed number of levels, there is no performance gain.

Related Solutions

Postgresql – Postgres multiple joins slow query, how to store default child record

You write:

Each customer can have multiple sites, but only one should be displayed in this list.

Yet, your query retrieves all rows. That would be a point to optimize. But you also do not define which site is to be picked.

Either way, it does not matter much here. Your EXPLAIN shows only 5026 rows for the site scan (5018 for the customer scan). So hardly any customer actually has more than one site. Did you ANALYZE your tables before running EXPLAIN?

From the numbers I see in your EXPLAIN, indexes will give you nothing for this query. Sequential table scans will be the fastest possible way. Half a second is rather slow for 5000 rows, though. Maybe your database needs some general performance tuning?

Maybe the query itself is faster, but "half a second" includes network transfer? EXPLAIN ANALYZE would tell us more.

If this query is your bottleneck, I would suggest you implement a materialized view.

After you provided more information I find that my diagnosis pretty much holds.

The query itself needs 27 ms. Not much of a problem there. "Half a second" was the kind of misunderstanding I had suspected. The slow part is the network transfer (plus ssh encoding / decoding, possibly rendering). You should only retrieve 100 rows, that would solve most of it, even if it means to execute the whole query every time.

If you go the route with a materialized view like I proposed you could add a serial number without gaps to the table plus index on it - by adding a column row_number() OVER (<your sort citeria here>) AS mv_id.

Then you can query:

SELECT *
FROM   materialized_view
WHERE  mv_id >= 2700
AND    mv_id <  2800;

This will perform very fast. LIMIT / OFFSET cannot compete, that needs to compute the whole table before it can sort and pick 100 rows.

pgAdmin timing

When you execute a query from the query tool, the message pane shows something like:

Total query runtime: 62 ms.

And the status line shows the same time. I quote pgAdmin help about that:

The status line will show how long the last query took to complete. If a dataset was returned, not only the elapsed time for server execution is displayed, but also the time to retrieve the data from the server to the Data Output page.

If you want to see the time on the server you need to use SQL EXPLAIN ANALYZE or the built in Shift + F7keyboard shortcut or Query -> Explain analyze. Then, at the bottom of the explain output you get something like this:

Total runtime: 0.269 ms

Postgresql – Retrieve additional columns in recursive CTE

Option 1

Include additional columns (just content in the example) in every iteration of the rCTE right away. You have to GROUP BY it (them) i the outer SELECT additionally.
Obviously, you don't want to count parents, so join the first level of children right away:

WITH recursive tree AS (
   SELECT p.id AS root_id, c.id, p.content
   FROM   comment p
   LEFT   JOIN comment c ON c.parent_id = p.id
   WHERE  p.parent_id IS NULL

   UNION ALL
   SELECT p.root_id, c.id, p.content
   FROM   tree p
   JOIN   comment c ON c.parent_id = p.id
)
SELECT root_id, content, count(id) AS comment_count
FROM   tree
GROUP  BY 1, 2
ORDER  BY 1;

To preserve parents without any children use LEFT JOIN in the base SELECT of the rCTE. Consequently, use count(id) in the outer SELECT to ignore NULL values produced by this.

Option 2

For very deep trees or big additional columns it may be cheaper to just retrieve IDs in the rCTE and join to the table comment once more in the outer SELECT to retrieve more columns:

WITH RECURSIVE tree AS (
   SELECT p.id AS root_id, c.id
   FROM   comment p
   LEFT   JOIN comment c ON c.parent_id = p.id
   WHERE  p.parent_id IS NULL

   UNION ALL
   SELECT p.root_id, c.id
   FROM   tree p
   JOIN   comment c ON c.parent_id = p.id
   )
SELECT p.*, c.content  -- add more columns?
FROM  (
   SELECT root_id, count(id) AS comment_count
   FROM   tree
   GROUP  BY 1  -- cheaper ...
   ) p
JOIN   comment c ON c.id = p.root_id  -- ... but additional join
ORDER  BY p.root_id;

Your fiddle had two errors. Consider this fixed SQL Fiddle.