I'm just starting studying Postgres and I'm in a situation where, depending how I do the JOIN on my tables, the performance and plan output seems really strange.
Those are the table used with their indexes:
create table escola
(
pk_codigo integer not null
constraint pk_escola
primary key,
nome varchar(100),
municipio varchar(150),
uf char(2),
cod_municipio integer,
uf_id integer default 0 not null
constraint fk_escola_uf_id
references tb_uf
)
;
create index idx_escola_uf
on escola (uf)
;
create index idx_escola_uf_id
on escola (uf_id)
;
create index idx_multi_escola_uf_pk
on escola (uf, pk_codigo)
;
create table if not exists candidato
(
pk_numero_inscricao bigint not null
constraint candidato_pk
primary key,
cod_municipio_residencia integer,
municipio_residencia varchar(150),
uf_residencia char(2),
uf_nascimento char(2),
situacao_conclusao numeric(1),
ano_concluiu smallint,
idade smallint,
sexo char,
fk_codigo_escola integer
constraint fk_candidato_codigo_escola
references escola,
uf_prova char(2)
)
;
create index if not exists idx_candidato_codigo_escola
on candidato (fk_codigo_escola)
;
create table tb_uf
(
uf varchar(2),
pk_id serial not null
constraint tb_uf_pkey
primary key
)
;
create unique index tb_uf_uf_uindex
on tb_uf (uf)
;
create unique index tb_uf_pk_id_uindex
on tb_uf (pk_id)
;
And the queries (with plans):
EXPLAIN ANALYZE
SELECT pk_numero_inscricao, pk_codigo
FROM escola e
JOIN candidato c
ON c.fk_codigo_escola = e.pk_codigo
WHERE e.uf = 'RJ'
;
Time without EXPLAIN ANALYZE
: 916ms
Plan: https://explain.depesz.com/s/M6B
EXPLAIN ANALYZE
SELECT pk_numero_inscricao, pk_codigo
FROM escola AS e
JOIN candidato AS c
ON c.fk_codigo_escola = e.pk_codigo
JOIN tb_uf AS u
ON e.uf_id = u.pk_id
WHERE u.uf = 'RJ'
;
Time without EXPLAIN ANALYZE
: 72ms
Plan: https://explain.depesz.com/s/E3MR
EXPLAIN ANALYZE
SELECT pk_numero_inscricao, pk_codigo
FROM escola AS e
JOIN candidato AS c
ON c.fk_codigo_escola = e.pk_codigo
WHERE e.uf_id = 19
;
Time without EXPLAIN ANALYZE
: 961ms
Plan: https://explain.depesz.com/s/v67V
The weird thing happening for me is that queries 1 and 3 are slower than query 2, although query 2 has an extra join. Does anybody know what might be causing this?
I noticed that Index Scan
on table candidato
is a lot slower on queries 1 and 3 also, and that makes no sense for me, since the final result is the same.
Another point is that EXPLAIN ANALYZE
is adding a lot of overhead into the queries.
Thanks in advance! If I need to provide any more information, I can edit this post if needed!
Best Answer
The answer is pretty "searchable" when you notice, that 1st and 3rd query fetch 1.2M rows from
candidato
...just to exclude 90% of them from the results.2nd query returns 1 row from
tb_uf
, which forces Nested Loop plan.This means that planner has wrong assumptions about statistics (expected results count) or costs (of random seek). You could either tune these values:
https://www.postgresql.org/message-id/20060926193553.GA27268@oppetid.no
Have Postgresql query planner use nested loop w/ indices over hash join
or manually force a Nested Loop. Although my intuition tells me it should be better to have Hash Join here.
Not having the data, I'd suggest to try one of:
set enable_mergejoin = off