Postgresql – Possible to use first CTE as filter for second CTE

cteoptimizationperformancepostgresqlpostgresql-performance

I'm trying to optimize a query on a large database that essentially fetches 50 images for a site and the URL's for the pages they appear on. The current query works okay for smaller sites but larger ones, such as the one I'm testing with that has 50,000 images, the query effectively times out.

I've narrowed down the slowness to the second CTE (y) because it's fetching all matching records in that table that consists of over 150M rows. What I'd like to do is use the first CTE (x) as an input to the second so that only records tied to resources fetched in x are fetched in y, since x is where the filtering occurs and results are limited to 50 per query. Is this possible?

WITH x AS 
(
    SELECT vi.resourceid
    FROM vw_image vi
    WHERE vi.siteid = 2294
    AND vi.childtype = 'i'
    ORDER BY vi.id
    OFFSET 1 ROWS
    FETCH NEXT 50 ROWS ONLY
),
y AS (
    SELECT rop.resourceid as resourceid, r.url as url
    FROM resourceonpage rop
    INNER JOIN resource r ON rop.pageid = r.id
    WHERE rop.siteid = 2294
)
SELECT i.*, string_agg(y.url, ',') as urllist
FROM x 
INNER JOIN vw_image i ON x.resourceid = i.resourceid
LEFT JOIN y ON y.resourceid = x.resourceid;

Best Answer

Looking into my crystal ball, your query might work faster by orders of magnitude like this:

SELECT i.*, rr.urllist
FROM   vw_image i
LEFT   JOIN LATERAL (
   SELECT string_agg(r.url, ',') AS urllist
   FROM   resourceonpage rop
   JOIN   resource r ON r.id = rop.pageid
   WHERE  rop.siteid   = 2294  -- or: = i.siteid
   AND    r.resourceid = i.resourceid
   ) rr ON true
WHERE  i.siteid = 2294
AND    i.childtype = 'i'
ORDER  BY i.id
OFFSET 1 ROWS
FETCH  NEXT 50 ROWS ONLY;  -- or just: OFFSET 1 LIMIT 50

You don't need CTEs here, those will only slow you down.

I use a LATERAL join to use indexes on resourceonpage and resource and aggregate in the subquery right away. About LEFT JOIN LATERAL ... ON true:

Looking even deeper into my crystal ball, the best indexes might be:

CREATE INDEX ON vw_image (siteid, id) WHERE childtype = 'i';
CREATE INDEX ON resourceonpage (siteid, pageid);
CREATE INDEX ON resource (resourceid, id, url);

Appending url is only useful if you get index-only scans out of it.

An alternative approach would be to use parentheses in the FROM list like demonstrated here:

But, according to my crystal ball (and an educated guess) the LATERAL version will be faster. There is not enough information in the question to give you more than educated guesses.