Postgresql – Postgres: Performance Issue: Query on enormous data fails to use index

indexperformancepostgresql

This is the schema of task_statuses table

                         Table "public.task_statuses"
     Column     |            Type             | Collation | Nullable | Default 
----------------+-----------------------------+-----------+----------+---------
 id             | uuid                        |           | not null | 
 updated_at     | timestamp without time zone |           | not null | 
 status         | task_status                 |           | not null | 
 status_details | json                        |           |          | 
Indexes:
    "task_statuses_id_updated_at_key" UNIQUE CONSTRAINT, btree (id, updated_at)
    "idx_task_statuses_id_updated_at" btree (id, updated_at)
Foreign-key constraints:
    "task_statuses_id_fkey" FOREIGN KEY (id) REFERENCES tasks(id) ON DELETE CASCADE

The table however is huge and has 150GB of data in production.
I am trying to run an extremely simple query

 SELECT 
            ts.id
        FROM task_statuses ts
        WHERE 
            ts.status IN ('succeeded', 'failed', 'cancelled') 
        ORDER BY ts.id, ts.updated_at desc LIMIT 1000

It keeps timing out in production. When I remove ORDER BY the query runs successfully. Since, I have index in id and udpated_at, I am not sure why order by is timing out.

explain analyse times out as well.

Here is the explain for the above query.

Limit  (cost=10651159.84..10651276.51 rows=1000 width=24)
  ->  Gather Merge  (cost=10651159.84..10744721.60 rows=801902 width=24)
        Workers Planned: 2
        ->  Sort  (cost=10650159.81..10651162.19 rows=400951 width=24)
              Sort Key: id, updated_at DESC
              ->  Parallel Seq Scan on task_statuses ts  (cost=0.00..10628176.10 rows=400951 width=24)
                    Filter: (status = ANY ('{succeeded,failed,cancelled}'::task_status[]))

Query plan without order by:

https://explain.depesz.com/s/CfIU

Helpful links:

Suggestions or help would be much appreciated.

Best Answer

Your costs are on your WHERE predicate for ts.status. You can see in the explain it's doing a Seq Scan for 400,951 rows with a cost of 10,628,176.10.

While having an index that is based on the ORDER BY fields in a query can help performance with the sorting, generally you should focus more on indexing based on your predicates (JOIN, WHERE, and HAVING clauses) because it won't have to do a Sequential Scan rather it can use the index to scan or seek even.

In this case if you had an index on the status column instead, your performance would likely be better (regardless sorting on your ORDER BY clause).

The difference in performance you're currently seeing is probably a difference in query plan between when you use and remove the ORDER BY clause that happens to be more efficient altogether. If you ran an explain for the query without the ORDER BY clause, I'm sure you'd see different operations occuring. But again, proper indexing on the status field should give you consistency in performance, either way.

Related Solutions

Postgresql – Postgres Index scan forward vs backward = speed difference of 357X slower

Since I like replacing aggregate functions by old-fashioned self-joins and NOT EXISTS clauses, here is my attempt:

SET search_path='tmp';

DROP TABLE tmp.changes CASCADE;
CREATE TABLE tmp.changes
        ( id integer NOT NULL PRIMARY KEY
        , fullname varchar
        , issuer varchar
        , rsymbol varchar
        , industry varchar
        , activity INTEGER NOT NULL
        , shareschange FLOAT
        , sharespchange FLOAT
        , mfiled FLOAT
        );

        -- lacking information from the OP
        -- I can only presume a flat distribution.
INSERT INTO tmp.changes(id, activity, shareschange,sharespchange,mfiled )
SELECT nm.*
        , (random() *20)::integer -- mfiled
        , random() *10000
        , random() *100
        , random() *100000
FROM generate_series(1,1000000) nm
        ;

ALTER TABLE tmp.changes
        ALTER shareschange
        SET STATISTICS 1000
        ;
ALTER TABLE tmp.changes
        ALTER mfiled
        SET STATISTICS 1000
        ;

VACUUM ANALYZE tmp.changes
        ;


CREATE INDEX changes_mfiled_shareschange
    ON tmp.changes(mfiled,shareschange)
        ;

EXPLAIN ANALYZE
SELECT initcap(ch.fullname) AS some_name1
     , initcap(ch.issuer) AS some_name2
     , upper(ch.rsymbol) AS some_name3
     , initcap(ch.industry) AS some_name4
     , ch.activity
     , to_char(ch.shareschange,'FM9,999,999,999,999,999') AS some_name5
     , ch.sharespchange || '%' AS some_name6
FROM   changes ch
WHERE  ch.activity IN (4,5)
        -- NOTE: the subquery is *not* correlated.
        -- [I had expected a subselect of nx.activity IN (4,5)
        -- like in the main query. ]
AND    NOT EXISTS (SELECT * FROM changes nx
        WHERE nx.mfiled > ch.mfiled
        )
ORDER  BY ch.shareschange ASC
LIMIT  15
        ;

Postgresql – How to speed up a Postgres query containing lots of Joins with an ILIKE condition

I see a couple of issues.

The biggest one is that PG is using a sequence scan on A when filtering A. I think you need a composite index on A.flag AND A.strvalue. If there is already an index available, PostgreSQL is choosing not to use it for some reason. This seems to be eating up 92% of your cost estimate and is likely what's making it run for so long.

As for the ILIKE, PostgreSQL cannot natively (but see below for a module that can) use an index as long as your wildcard is the first character. That's simply a restriction on the ILIKE operator. For that reason you are getting a sequence scan which means every single row is being loaded and the C.name column is being scanned for characters. But one thing that's weird is that the ILIKE sequence scan doesn't seem to be eating up much of the cost estimate in this query plan. Anyway, if it is the ILIKE operator causing the slowdown, I would consider rewriting your query so that it somehow looks like this: ILIKE 'value%' or else consider using PostgreSQL's full text search.

UPDATED

The ILIKE operator can use a trigram index. Superb!

Best Answer

Related Solutions

Postgresql – Postgres Index scan forward vs backward = speed difference of 357X slower

Postgresql – How to speed up a Postgres query containing lots of Joins with an ILIKE condition

Related Question