PostgreSQL Optimization – Select with Large IN and Millions of Records

optimizationperformance-tuningpostgresqlpostgresql-performancequery-performance

I work with a table which consists of 1.5B records.

DB: RDS, PostgreSQL 12.4, 16GB RAM, 4vCPU

Schema:

CREATE TABLE public.trip (
     id bigint NOT NULL,
     cell_to character varying NOT NULL,
     cell_from character varying NOT NULL,
     indicator character varying NOT NULL,
     time_id integer,
     weight double precision
);

CREATE INDEX ix_trip_cell_from ON public.trip USING btree (cell_from);
CREATE INDEX ix_trip_cell_to ON public.trip USING btree (cell_to);
CREATE INDEX ix_trip_indicator ON public.trip USING btree (indicator);
CREATE INDEX ix_trip_time_id ON public.trip USING btree (time_id);

I'm trying to pull all trips which happen within some cells (output around 7-12 million records):

EXPLAIN ANALYZE SELECT
    cell_to,
    cell_from,
    time_id,
    weight AS trips
FROM
    trip
WHERE
    cell_to IN (VALUES ... 1k values)
  AND 
    cell_from IN (VALUES ... 1k values (the same as above))
  AND
    time_id IN (VALUES ... 3 to 20 values)
  AND
    indicator = 'some string';

The result you can find here https://explain.depesz.com/s/RxH4.

What I've tried:

Replaced with INNER JOINs -> got some improvements
Changed b-tree index to BRIN -> a bit improved timing
VACUUM, REINDEX, work_mem -> zero effect

The query still runs too long.

UPDATE:

Thanks to @NikitaSerbskiy and @Laurenz Albe, forcing postgresql using index and adding a multicolumn index helped a lot.

Best Answer

You might get bitmap index scans and better performance if you increase work_mem to something like 200MB or more, so that a bitmap for your table fits into it.

Other than that, the only remedy I can see is using more parallel workers by raising max_parallel_workers_per_gather.

But all these optimizations are questionable if you plan to run more than a single concurrent query on this tiny machine.

UPDATE:

Experiments with enable_seqscan = off suggest that PostgreSQL overestimates the cost of an index scan.

So if you lower random_page_cost to something closer to 1, PostgreSQL should choose the better plan automatically.

Related Solutions

Mysql – Understanding optimisation of complex SELECT queries

Why there are massive differences in performance between the queries?
(assuming that you have at least defined indexes for the columns used in the subquery criteria and in the joining conditions)

Basically because MySQL optimizer is not smart enough to figure out that all these queries are equivalent. So, it probably produces different execution plans for the different queries. If you do not have properly defined FOREIGN KEY constraints, the optimizer may be actually right, there is no guarantee that the queries return identical results.

So, what to do to increase performance and have a standard way of writing this type of query variants?

Several things that affect MySQL queries performance:

Do not use id IN (SELECT subquery) if you can avoid it. It's not very well optimized in most MySQL versions (see point 6 below). Use joins if you can.
Replace UNION with UNION ALL if that doesn't change the result set (you could do that in EG3 query).

Don't use implicit joins (with commas in the FROM clause and the joining condition in the WHERE clause). Use explicit JOIN syntax. As an example, your EG1 is actually the same as (this is not for performance but for consistence):

SELECT people.*
  FROM people
       LEFT JOIN (criterion1 SELECT) c1 ON people.id=c1.pid
       LEFT JOIN (criterion2 SELECT) c2 ON people.id=c2.pid,
       JOIN (criterion3 SELECT) c3 ON people.id=c3.pid
 WHERE ( c1.pid IS NOT NULL OR c2.pid IS NOT NULL );

The above query has an OR condition that is relevant to 2 tables (and their joins to the people, so 3 tables actually). This is usually not very good perfomance-wise.

You can try rewriting using EXISTS. This will make your queries easier to write and it may help with performance, too:

SELECT people.*
  FROM people
 WHERE EXISTS 
         (criterion3 SELECT modified with `people.id = some_table.pid`)
   AND ( EXISTS 
           (criterion1 SELECT modified with `people.id = some_table.pid`)
      OR EXISTS 
           (criterion2 SELECT modified with `people.id = some_table.pid`) 
       ) ;

If your host allows it, try/test MariaDB (it's a MySQL fork-replacement) that has introduced several improvements in queries execution in its latest versions. The optimizer will be a bit smarter identifying equivalent queries and certainly smarter as it has some new algorithms implemented that affect queries will joins, subqueries among other things.

MySQL 5.6 has also a few improvements in the optimizer but it's not available yet as a stable release.

Postgresql – Improve performance on concurrent UPDATEs for a timestamp column in Postgres

Would a different kind of column be faster? For example an integer

No. timestamp and timestamptz are just unsigned 64-bit integers internally anyway.

Is there some way to not lock the column?

It doesn't lock the column. It takes weak table lock that doesn't really block anything except DDL, and takes a row level lock on the row you're updating.

There is no way to prevent the row level lock. It exists because without it behaviour and ordering concurrent updates would be undefined. We don't like undefined behaviour in RDBMSs.

It only blocks concurrent updates of the same row anyway.

Any other tips to improve this?

Not with the detail provided. There's likely a better way to do what you're trying to do, but it'll probably involve taking a few steps back and looking for a different strategy for solving the underlying problem.

In the specific case of cache invalidation I think you might want to look into LISTEN and NOTIFY. Again though, there just isn't enough info here to go on.

Best Answer

Related Solutions

Mysql – Understanding optimisation of complex SELECT queries

Postgresql – Improve performance on concurrent UPDATEs for a timestamp column in Postgres

Related Question