Postgresql – Why does this LEFT JOIN perform so much worse than LEFT JOIN LATERAL

execution-planjoin;performancepostgresqlpostgresql-10postgresql-performance

I have the following tables (taken from the Sakila database):

film: film_id is pkey
actor: actor_id is pkey
film_actor: film_id and actor_id are fkeys to film/actor

I am selecting a particular film. For this film, I also want all actors participating in that film. I have two queries for this: one with a LEFT JOIN and one with a LEFT JOIN LATERAL.

select film.film_id, film.title, a.actors
from   film
left join
  (         
       select     film_actor.film_id, array_agg(first_name) as actors
       from       actor
       inner join film_actor using(actor_id)
       group by   film_actor.film_id
  ) as a
on       a.film_id = film.film_id
where    film.title = 'ACADEMY DINOSAUR'
order by film.title;

select film.film_id, film.title, a.actors
from   film
left join lateral
  (
       select     array_agg(first_name) as actors
       from       actor
       inner join film_actor using(actor_id)
       where      film_actor.film_id = film.film_id
  ) as a
on       true
where    film.title = 'ACADEMY DINOSAUR'
order by film.title;

When comparing the query plan, the first query performs much worse (20x) than the second:

 Merge Left Join  (cost=507.20..573.11 rows=1 width=51) (actual time=15.087..15.089 rows=1 loops=1)
   Merge Cond: (film.film_id = film_actor.film_id)
   ->  Sort  (cost=8.30..8.31 rows=1 width=19) (actual time=0.075..0.075 rows=1 loops=1)
     Sort Key: film.film_id
     Sort Method: quicksort  Memory: 25kB
     ->  Index Scan using idx_title on film  (cost=0.28..8.29 rows=1 width=19) (actual time=0.044..0.058 rows=1 loops=1)
           Index Cond: ((title)::text = 'ACADEMY DINOSAUR'::text)
   ->  GroupAggregate  (cost=498.90..552.33 rows=997 width=34) (actual time=15.004..15.004 rows=1 loops=1)
     Group Key: film_actor.film_id
     ->  Sort  (cost=498.90..512.55 rows=5462 width=8) (actual time=14.934..14.937 rows=11 loops=1)
           Sort Key: film_actor.film_id
           Sort Method: quicksort  Memory: 449kB
           ->  Hash Join  (cost=6.50..159.84 rows=5462 width=8) (actual time=0.355..8.359 rows=5462 loops=1)
             Hash Cond: (film_actor.actor_id = actor.actor_id)
             ->  Seq Scan on film_actor  (cost=0.00..84.62 rows=5462 width=4) (actual time=0.035..2.205 rows=5462 loops=1)
             ->  Hash  (cost=4.00..4.00 rows=200 width=10) (actual time=0.303..0.303 rows=200 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 17kB
               ->  Seq Scan on actor  (cost=0.00..4.00 rows=200 width=10) (actual time=0.027..0.143 rows=200 loops=1)
 Planning time: 1.495 ms
 Execution time: 15.426 ms

 Nested Loop Left Join  (cost=25.11..33.16 rows=1 width=51) (actual time=0.849..0.854 rows=1 loops=1)
   ->  Index Scan using idx_title on film  (cost=0.28..8.29 rows=1 width=19) (actual time=0.045..0.048 rows=1 loops=1)
     Index Cond: ((title)::text = 'ACADEMY DINOSAUR'::text)
   ->  Aggregate  (cost=24.84..24.85 rows=1 width=32) (actual time=0.797..0.797 rows=1 loops=1)
     ->  Hash Join  (cost=10.82..24.82 rows=5 width=6) (actual time=0.672..0.764 rows=10 loops=1)
           Hash Cond: (film_actor.actor_id = actor.actor_id)
           ->  Bitmap Heap Scan on film_actor  (cost=4.32..18.26 rows=5 width=2) (actual time=0.072..0.150 rows=10 loops=1)
             Recheck Cond: (film_id = film.film_id)
             Heap Blocks: exact=10
             ->  Bitmap Index Scan on idx_fk_film_id  (cost=0.00..4.32 rows=5 width=0) (actual time=0.041..0.041 rows=10 loops=1)
               Index Cond: (film_id = film.film_id)
           ->  Hash  (cost=4.00..4.00 rows=200 width=10) (actual time=0.561..0.561 rows=200 loops=1)
             Buckets: 1024  Batches: 1  Memory Usage: 17kB
             ->  Seq Scan on actor  (cost=0.00..4.00 rows=200 width=10) (actual time=0.039..0.275 rows=200 loops=1)
 Planning time: 1.722 ms
 Execution time: 1.087 ms

Why is this? I want to learn to reason about this, so I can understand what is going on and can predict how the query will behave when data size increases and which decisions the planner will make under certain conditions.

My thoughts: in the first LEFT JOIN query, it looks like the subquery is executed for all films in the database, without taking into account the filtering in the outer query that we are only interested in one particular film. Why is the planner not able to have that knowledge in the subquery?

In the LEFT JOIN LATERAL query, we are more or less 'pushing' that filtering downwards. So the issue we had in the first query is not present here, hence the better performance.

I guess I am mainly looking for rule of thumbs, general wisdoms, … so this planner magic becomes second nature – if that makes sense.

update (1)

Rewriting the LEFT JOIN as following also gives better performance (slightly better than the LEFT JOIN LATERAL):

select film.film_id, film.title, array_agg(a.first_name) as actors
from   film
left join
  (         
       select     film_actor.film_id, actor.first_name
       from       actor
       inner join film_actor using(actor_id)
  ) as a
on       a.film_id = film.film_id
where    film.title = 'ACADEMY DINOSAUR'
group by film.film_id
order by film.title;

 GroupAggregate  (cost=29.44..29.49 rows=1 width=51) (actual time=0.470..0.471 rows=1 loops=1)
   Group Key: film.film_id
   ->  Sort  (cost=29.44..29.45 rows=5 width=25) (actual time=0.428..0.430 rows=10 loops=1)
     Sort Key: film.film_id
     Sort Method: quicksort  Memory: 25kB
     ->  Nested Loop Left Join  (cost=4.74..29.38 rows=5 width=25) (actual time=0.149..0.386 rows=10 loops=1)
           ->  Index Scan using idx_title on film  (cost=0.28..8.29 rows=1 width=19) (actual time=0.056..0.057 rows=1 loops=1)
             Index Cond: ((title)::text = 'ACADEMY DINOSAUR'::text)
           ->  Nested Loop  (cost=4.47..19.09 rows=200 width=8) (actual time=0.087..0.316 rows=10 loops=1)
             ->  Bitmap Heap Scan on film_actor  (cost=4.32..18.26 rows=5 width=4) (actual time=0.052..0.089 rows=10 loops=1)
               Recheck Cond: (film_id = film.film_id)
               Heap Blocks: exact=10
               ->  Bitmap Index Scan on idx_fk_film_id  (cost=0.00..4.32 rows=5 width=0) (actual time=0.035..0.035 rows=10 loops=1)
                 Index Cond: (film_id = film.film_id)
             ->  Index Scan using actor_pkey on actor  (cost=0.14..0.17 rows=1 width=10) (actual time=0.011..0.011 rows=1 loops=10)
               Index Cond: (actor_id = film_actor.actor_id)
 Planning time: 1.833 ms
 Execution time: 0.706 ms

How can we reason about this?

update (2)

I continued with some experiments and I think an interesting rule of thumb is: apply the aggregate function as high/late as possible. The query in update (1) probably performs better because we are aggregating in the outer query, no longer in the inner query.

The same seems to apply if we rewrite the LEFT JOIN LATERAL above as following:

select film.film_id, film.title, array_agg(a.first_name) as actors
from   film
left join lateral
  (
       select     actor.first_name
       from       actor
       inner join film_actor using(actor_id)
       where      film_actor.film_id = film.film_id
  ) as a
on       true
where    film.title = 'ACADEMY DINOSAUR'
group by film.film_id
order by film.title;

 GroupAggregate  (cost=29.44..29.49 rows=1 width=51) (actual time=0.088..0.088 rows=1 loops=1)
   Group Key: film.film_id
   ->  Sort  (cost=29.44..29.45 rows=5 width=25) (actual time=0.076..0.077 rows=10 loops=1)
     Sort Key: film.film_id
     Sort Method: quicksort  Memory: 25kB
     ->  Nested Loop Left Join  (cost=4.74..29.38 rows=5 width=25) (actual time=0.031..0.066 rows=10 loops=1)
           ->  Index Scan using idx_title on film  (cost=0.28..8.29 rows=1 width=19) (actual time=0.010..0.010 rows=1 loops=1)
             Index Cond: ((title)::text = 'ACADEMY DINOSAUR'::text)
           ->  Nested Loop  (cost=4.47..19.09 rows=200 width=8) (actual time=0.019..0.052 rows=10 loops=1)
             ->  Bitmap Heap Scan on film_actor  (cost=4.32..18.26 rows=5 width=4) (actual time=0.013..0.024 rows=10 loops=1)
               Recheck Cond: (film_id = film.film_id)
               Heap Blocks: exact=10
               ->  Bitmap Index Scan on idx_fk_film_id  (cost=0.00..4.32 rows=5 width=0) (actual time=0.007..0.007 rows=10 loops=1)
                 Index Cond: (film_id = film.film_id)
             ->  Index Scan using actor_pkey on actor  (cost=0.14..0.17 rows=1 width=10) (actual time=0.002..0.002 rows=1 loops=10)
               Index Cond: (actor_id = film_actor.actor_id)
 Planning time: 0.440 ms
 Execution time: 0.136 ms

Here, we moved array_agg() upwards. As you can see, this plan is also better than the original LEFT JOIN LATERAL.

That said, I am not sure if this self-invented rule of thumb (apply the aggregate function as high/late as possible) is true in other cases.

additional information

Fiddle: https://dbfiddle.uk/?rdbms=postgres_10&fiddle=4ec4f2fffd969d9e4b949bb2ca765ffb

Version: PostgreSQL 10.4 on x86_64-pc-linux-musl, compiled by gcc (Alpine 6.4.0) 6.4.0, 64-bit

Environment: Docker: docker run -e POSTGRES_PASSWORD=sakila -p 5432:5432 -d frantiseks/postgres-sakila. Please note that the image on Docker hub is outdated, so I did a build locally first: build -t frantiseks/postgres-sakila after cloning the git repository.

Table definitions:

film

 film_id              | integer                     | not null default nextval('film_film_id_seq'::regclass)
 title                | character varying(255)      | not null

 Indexes:
    "film_pkey" PRIMARY KEY, btree (film_id)
    "idx_title" btree (title)

 Referenced by:
    TABLE "film_actor" CONSTRAINT "film_actor_film_id_fkey" FOREIGN KEY (film_id) REFERENCES film(film_id) ON UPDATE CASCADE ON DELETE RESTRICT

actor

 actor_id    | integer                     | not null default nextval('actor_actor_id_seq'::regclass)
 first_name  | character varying(45)       | not null

 Indexes:
    "actor_pkey" PRIMARY KEY, btree (actor_id)

 Referenced by:
    TABLE "film_actor" CONSTRAINT "film_actor_actor_id_fkey" FOREIGN KEY (actor_id) REFERENCES actor(actor_id) ON UPDATE CASCADE ON DELETE RESTRICT

film_actor

 actor_id    | smallint                    | not null
 film_id     | smallint                    | not null

 Indexes:
    "film_actor_pkey" PRIMARY KEY, btree (actor_id, film_id)
    "idx_fk_film_id" btree (film_id)
 Foreign-key constraints:
    "film_actor_actor_id_fkey" FOREIGN KEY (actor_id) REFERENCES actor(actor_id) ON UPDATE CASCADE ON DELETE RESTRICT
    "film_actor_film_id_fkey" FOREIGN KEY (film_id) REFERENCES film(film_id) ON UPDATE CASCADE ON DELETE RESTRICT

Data: this is from the Sakila sample database. This question is not a real-life case, I am using this database mostly as a learning sample database. I have been introduced to SQL some months ago and I am trying to expand my knowledge. It has the following distributions:

select count(*) from film: 1000
select count(*) from actor: 200
select avg(a) from (select film_id, count(actor_id) a from film_actor group by film_id) a: 5.47

Best Answer

Test setup

Your original setup in the fiddle leaves room for improvement. I kept asking for your setup for a reason.

You have these indexes on film_actor:
```
"film_actor_pkey" PRIMARY KEY, btree (actor_id, film_id)  
"idx_fk_film_id" btree (film_id)
```
Which is pretty helpful already. But to best support your particular query, you would have a multicolumn index on (film_id, actor_id), columns in this order. A practical solution: replace idx_fk_film_id with an index on (film_id, actor_id) - or create the PK on (film_id, actor_id) for the purpose of this test, like I do below. See:
- Is a composite index also good for queries on the first field?
In a read-only (or mostly, or generally when VACUUM can keep up with write activity) it also helps to have an index on (title, film_id) to allow index only-scans. My test case is now highly optimized for read performance.
Type mismatch between film.film_id (integer) and film_actor.film_id (smallint). While that works it makes queries slower and can lead to various complications. Also makes FK constraints more expensive. Never do this if it can be avoided. If you are not sure, pick integer over smallint. While smallint can save 2 bytes per field (often consumed by alignment padding) there are more complication than with integer.
To optimize the performance of the test itself, create indexes and constraints after bulk-inserting lots of rows. It is substantially slower to add tuples incrementally to existing indexes than to create them from scratch with all rows present.

Unrelated to this test:

Free-standing sequences plus column defaults instead of much simpler and more reliable serial (or IDENTITY) columns. Don't.
- Auto increment table column
timestamp without timestamp is typically unreliable for a column like last_update. Use timestamptz instead. And note that column defaults do not cover the "last update", strictly speaking.
The length modifier in character varying(255) indicates that the test case is not intended for Postgres to begin with because the odd length is pretty pointless here. (Or the author is clueless.)

Consider the audited test case in the fiddle:

db<>fiddle here - building on your fiddle, optimized and with added queries.

How to implement a many-to-many relationship in PostgreSQL?

A test setup with a 1000 films and 200 actors has limited validity. The most efficient queries take < 0.2 ms. Planning time is more than execution time. A test with 100k or more rows would be more revealing.

Why retrieve only first names of authors? Once you retrieve multiple columns, you already have a slightly different situation.

ORDER BY title makes no sense while filtering for a single title with WHERE title = 'ACADEMY DINOSAUR'. Maybe ORDER BY film_id?

And for total runtime rather use EXPLAIN (ANALYZE, TIMING OFF) to reduce (potentially misleading) noise with sub-timing overhead.

Answer

It's hard to form a simple rule of thumb, because total performance depends on many factors. Very basic guidelines:

Aggregating all rows in sub-tables carries less overhead but only pays when you actually need all rows (or a very large part).
For selecting few rows (your test!), different query techniques yield better results. That's where LATERAL comes in. It carries more overhead but only reads required rows from sub-tables. A big win if only a (very) small fraction is needed.

For your particular test case, I would also test an ARRAY constructor in the LATERAL subquery:

SELECT f.film_id, f.title, a.actors
FROM   film
LEFT   JOIN LATERAL (
   SELECT ARRAY (
      SELECT a.first_name
      FROM   film_actor fa
      JOIN   actor a USING (actor_id)
      WHERE  fa.film_id = f.film_id
      ) AS actors
   ) a ON true
WHERE  f.title = 'ACADEMY DINOSAUR';
-- ORDER  BY f.title; -- redundant while we filter for a single title

While only aggregating a single array in the lateral subquery, a simple ARRAY constructor performs better than the aggregate function array_agg(). See:

Why is array_agg() slower than the non-aggregate ARRAY() constructor?

Or with a lowly correlated subquery for the simple case:

SELECT f.film_id, f.title
     , ARRAY (SELECT a.first_name
              FROM   film_actor fa
              JOIN   actor a USING (actor_id)
              WHERE  fa.film_id = f.film_id) AS actors
FROM   film f
WHERE  f.title = 'ACADEMY DINOSAUR';

Or, very basically, just 2x LEFT JOIN and then aggregate:

SELECT f.film_id, f.title, array_agg(a.first_name) AS actors
FROM   film f
LEFT   JOIN film_actor fa USING (film_id)
LEFT   JOIN actor a USING (actor_id)
WHERE  f.title = 'ACADEMY DINOSAUR'
GROUP  BY f.film_id;

These three seem fastest in my updated fiddle (planning + execution time).

Your first attempt (only slightly modified) is typically fastest to retrieve all or most films, but not for a small selection:

SELECT f.film_id, f.title, a.actors
FROM   film f
LEFT   JOIN (         
   SELECT fa.film_id, array_agg(first_name) AS actors
   FROM   actor
   JOIN   film_actor fa USING (actor_id)
   GROUP  by fa.film_id
   ) a USING (film_id)
WHERE  f.title = 'ACADEMY DINOSAUR';  -- not good for a single (or few) films!

Tests with much bigger cardinalities will be more revealing. And don't generalise results lightly, there are many factors for total performance.

Best Answer

Test setup

Answer

Related Solutions

Postgresql – Optimizing ORDER BY in a full text search query

Postgresql – How to speed up a Postgres query containing lots of Joins with an ILIKE condition

Related Question