Postgresql – Postgres NOT NULL optimization

optimizationpostgresqlquery-performance

I am trying to optimize this SQL query:

select topics.id from "topics"
     left join "articles_topics" on "topics"."id" = "articles_topics"."topic_id"
     left join "articles" on "articles_topics"."article_id" = "articles"."id"
where not "topics"."type" = 'sport' and "articles"."image" is not null
group by "topics"."id"
having COUNT(articles.id) > 10

here is the full query cost (I used EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS))

Finalize HashAggregate  (cost=12881.12..12974.90 rows=2501 width=8) (actual time=209.037..210.463 rows=1381 loops=1)
  Output: topics.id
  Group Key: topics.id
  Filter: (count(articles.id) > 10)
  Rows Removed by Filter: 5672
  Buffers: shared hit=8624
  ->  Gather  (cost=12018.39..12843.61 rows=7502 width=16) (actual time=198.146..205.348 rows=10376 loops=1)
"        Output: topics.id, (PARTIAL count(articles.id))"
        Workers Planned: 1
        Workers Launched: 1
        Buffers: shared hit=8624
        ->  Partial HashAggregate  (cost=11018.39..11093.41 rows=7502 width=16) (actual time=192.791..194.319 rows=5188 loops=2)
"              Output: topics.id, PARTIAL count(articles.id)"
              Group Key: topics.id
              Buffers: shared hit=8624
              Worker 0: actual time=188.316..190.218 rows=5394 loops=1
                Buffers: shared hit=3745
              ->  Hash Join  (cost=7499.10..10515.66 rows=100546 width=16) (actual time=54.077..159.378 rows=63672 loops=2)
"                    Output: topics.id, articles.id"
                    Inner Unique: true
                    Hash Cond: (articles_topics.topic_id = topics.id)
                    Buffers: shared hit=8624
                    Worker 0: actual time=47.006..148.933 rows=65595 loops=1
                      Buffers: shared hit=3745
                    ->  Parallel Hash Join  (cost=6948.79..9699.13 rows=101364 width=16) (actual time=48.622..113.016 rows=87035 loops=2)
"                          Output: articles_topics.topic_id, articles.id"
                          Inner Unique: true
                          Hash Cond: (articles_topics.article_id = articles.id)
                          Buffers: shared hit=7900
                          Worker 0: actual time=39.510..116.939 rows=90075 loops=1
                            Buffers: shared hit=3383
                          ->  Parallel Seq Scan on public.articles_topics  (cost=0.00..2464.56 rows=108856 width=16) (actual time=0.010..17.317 rows=92554 loops=2)
"                                Output: articles_topics.article_id, articles_topics.topic_id"
                                Buffers: shared hit=1376
                                Worker 0: actual time=0.010..21.592 rows=96072 loops=1
                                  Buffers: shared hit=732
                          ->  Parallel Hash  (cost=6720.30..6720.30 rows=18279 width=8) (actual time=46.963..46.964 rows=21942 loops=2)
                                Output: articles.id
                                Buckets: 65536  Batches: 1  Memory Usage: 2240kB
                                Buffers: shared hit=6524
                                Worker 0: actual time=39.462..39.462 rows=17804 loops=1
                                  Buffers: shared hit=2651
                                ->  Parallel Seq Scan on public.articles  (cost=0.00..6720.30 rows=18279 width=8) (actual time=0.010..30.455 rows=21942 loops=2)
                                      Output: articles.id
                                      Filter: (articles.image IS NOT NULL)
                                      Rows Removed by Filter: 1636
                                      Buffers: shared hit=6524
                                      Worker 0: actual time=0.014..26.579 rows=17804 loops=1
                                        Buffers: shared hit=2651
                    ->  Hash  (cost=456.54..456.54 rows=7502 width=8) (actual time=5.394..5.394 rows=7502 loops=2)
                          Output: topics.id
                          Buckets: 8192  Batches: 1  Memory Usage: 358kB
                          Buffers: shared hit=724
                          Worker 0: actual time=7.437..7.437 rows=7502 loops=1
                            Buffers: shared hit=362
                          ->  Seq Scan on public.topics  (cost=0.00..456.54 rows=7502 width=8) (actual time=0.022..2.176 rows=7502 loops=2)
                                Output: topics.id
                                Filter: ((topics.type)::text <> 'sport'::text)
                                Rows Removed by Filter: 61
                                Buffers: shared hit=724
                                Worker 0: actual time=0.027..2.189 rows=7502 loops=1
                                  Buffers: shared hit=362
Planning Time: 1.580 ms
Execution Time: 211.823 ms

I tried to use indexes, moved and "articles"."image" is not null to the articles join section …, also I tried this: https://stackoverflow.com/questions/31966218/postgresql-create-an-index-to-quickly-distinguish-null-from-non-null-values
But no improvements. Can we somehow optimize this query?

Create scripts:

CREATE TABLE public.articles (
    id bigserial NOT NULL,
    long_id varchar(255) NOT NULL,
    title varchar(1023) NULL,
    summary text NULL,
    is_top bool NULL DEFAULT false,
    date_published timestamptz NULL,
    image varchar(1023) NULL,
    original_url varchar(2047) NULL,
    created_at timestamptz NULL,
    updated_at timestamptz NULL,
    CONSTRAINT articles_long_id_unique UNIQUE (long_id),
    CONSTRAINT articles_pkey PRIMARY KEY (id)
);
CREATE INDEX articles_date_published_id_index ON public.articles USING btree (date_published, id);
CREATE INDEX articles_date_published_index ON public.articles USING btree (date_published);

CREATE TABLE public.topics (
    id bigserial NOT NULL,
    long_id varchar(255) NOT NULL,
    "name" varchar(255) NULL,
    icon_name varchar(255) NULL,
    "type" varchar(255) NULL,
    created_at timestamptz NULL,
    updated_at timestamptz NULL,
    logo varchar(255) NULL,
    image varchar(255) NULL,
    short_name varchar(255) NULL,
    source_id varchar(255) NULL,
    source_type_id int4 NULL,
    full_name varchar(255) NULL,
    nick_name varchar(255) NULL,
    first_name varchar(255) NULL,
    surname varchar(255) NULL,
    parent_topic_id int8 NULL,
    CONSTRAINT topics_long_id_unique UNIQUE (long_id),
    CONSTRAINT topics_pkey PRIMARY KEY (id),
    CONSTRAINT topics_type_source_id_unique UNIQUE (type, source_id),
    CONSTRAINT topics_parent_topic_id_foreign FOREIGN KEY (parent_topic_id) REFERENCES topics(id) ON DELETE SET NULL
);
CREATE INDEX topics_type_index ON public.topics USING btree (type);

CREATE TABLE public.articles_topics (
    article_id int8 NOT NULL,
    topic_id int8 NOT NULL,
    CONSTRAINT articles_topics_pkey PRIMARY KEY (article_id, topic_id),
    CONSTRAINT articles_topics_article_id_foreign FOREIGN KEY (article_id) REFERENCES articles(id) ON UPDATE CASCADE ON DELETE CASCADE,
    CONSTRAINT articles_topics_topic_id_foreign FOREIGN KEY (topic_id) REFERENCES topics(id) ON UPDATE CASCADE ON DELETE CASCADE
);
CREATE INDEX articles_topics_topic_id_index ON public.articles_topics USING btree (topic_id);

Best Answer

Well the actual speed is not the problem, problem is the cost

Cost as reported in EXPLAIN is just a loose estimate for the speed, so I don't understand this distinction.

If the problem is speed under high concurrency, I would turn off parallelization (max_parallel_workers_per_gather=0).

But really this doesn't look like the kind of query whose results change frequently, or where absolute accuracy is required. So do look into the matview, or some other form of caching.