PostgreSQL Performance – Analyzing Execution Time Over 24 Hours

disaster recoveryindexperformancepostgresqlpostgresql-10

I upgraded a Postgres DB 9.3.2–>10.5 using pg_upgrade (in place). I did everything according to the documentation and the instructions given by pg_upgrade. Everything went fine but then I realized that the indexes were not being used in one of the tables (maybe others are affected too).

So I started an ANALYZE on that table yesterday which is still running (for over 22h)…!

The question: Is it normal for ANALYZE to have such a long execution time?

The table contains about 30M records. The structure is:

CREATE TABLE public.chs_contact_history_events (
    event_id bigint NOT NULL
           DEFAULT nextval('chs_contact_history_events_event_id_seq'::regclass),
    chs_id integer NOT NULL,
    event_timestamp bigint NOT NULL,
    party_id integer NOT NULL,
    event integer NOT NULL,
    cause integer NOT NULL,
    text text COLLATE pg_catalog."default",
    timestamp_offset integer,
    CONSTRAINT pk_contact_history_events PRIMARY KEY (event_id)
);

ALTER TABLE public.chs_contact_history_events OWNER to c_chs;

CREATE INDEX ix_chs_contact_history_events_chsid
    ON public.chs_contact_history_events USING hash (chs_id)
    TABLESPACE pg_default;

CREATE INDEX ix_chs_contact_history_events_id
    ON public.chs_contact_history_events USING btree (event_id)
    TABLESPACE pg_default;

CREATE INDEX ix_history_events_partyid
    ON public.chs_contact_history_events USING hash (party_id)
    TABLESPACE pg_default;

UPDATE:

I ran the query below in order to get the currently running processes and got a more than interesting results:

SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes'
AND state = 'active';

abnormal running queries on that table

It seems that the maintenance tasks and the concurrent recreation of the index table are frozen!

So the next question: is it safe to cancel those processes? And what to do next? IMO stopping them all and restarting index creation will be necessary but I'm unsure.


ANNEX 1

Possibly related errors corrected in v9:

9.3.7 and 9.4.2
Fix possible failure during hash index bucket split, if other processes are modifying the index concurrently

9.3.18 and 9.4.13 and 9.5.8 and 9.6.4
Fix low-probability corruption of shared predicate-lock hash table in Windows builds

9.5.4
Fix building of large (bigger than shared_buffers) hash indexes
The code path used for large indexes contained a bug causing incorrect hash values to be inserted into the index, so that subsequent index searches always failed, except for tuples inserted into the index after the initial build.

Possibly related errors corrected in v10:

10.2
Fix failure to mark a hash index's metapage dirty after adding a new overflow page, potentially leading to index corruption

Prevent out-of-memory failures due to excessive growth of simple hash tables


And last but not least that makes me concerns (since an upgrade seems to be not realistic on the productive environment):

10.6
Avoid overrun of a hash index's metapage when BLCKSZ is smaller than default

Fix missed page checksum updates in hash indexes


ANNEX 2

Upgrade instruction in v10:

Hash indexes must be rebuilt after pg_upgrade-ing from any previous major PostgreSQL version

Major hash index improvements necessitated this requirement. pg_upgrade will create a script to assist with this.

Note that I ran that script of course at the time of upgrade.

Best Answer

After several hours of research and examining the current situation I think I managed to solve the issue. (Many thanks to fellow user ypercube for the inspiration and for Erwin Brandstetter who in parallel came to the same solution.)

So there were several layers of the problem.

1.) UPGRADE

Upgrading with pg_upgrade 9.3.2 --> 10.5 should be made in two steps. First within the same line (9.3.2 --> 9.3.25) and then to 10.x (10.5 in my case)

I made a direct upgrade and it seems that it was the root cause of the problem.

2.) HASH INDEXES

It seems that hash indexes suffered from some strange errors in postgres which have been corrected already but using indexes of the pre-correction versions leads to errors

3.) FROZEN TASKS

It does make sense to look for postgres processes which are running for non-realistic long time. (See query in the question.) In my case it turned out that the recreation of indexes stuck somehow and several other tasks have been blocked as well.

It is safe to cancel most of them with SELECT pg_cancel_backend(__pid__); where pid is the process ID found in the result set of the before-mentioned query. So I did it. I even stopped the autovacuum processes.

4.) MEMORY HANDLING

When after all of this I finally thought I was able to delete and create the new indexes I faced the next problem. After about one minute all maintenance queries exited with an error message:

ERROR: out of memory
DETAIL: Failed on request of size 22355968.
SQL state: 53200

It seems that the memory handling changed between 9.3 and 10. I had to reduce the amount of maintenance_mem in the config:

maintenance_work_mem = 64MB     # min 1MB

It was 512MB before and although the server has 32GB of RAM it was still not working with that.

5.) RECREATING INDEXES

After all it was possible to recreate indexes (drop old ones and create new ones). It would have been easier with a proper script but I had to do it manually. Don't forget that creating and dropping indexes locks the table so in productive environment (like mine) you should do that CONCURRENTLY.

Edit:

I also realized that using hash indexes in my specific case was not really meaningful so I decided to change them to btree at the recreation.

6.) ANALYZE

After recreating the indexes it is necessary to run an analyze on the affected tables (or the whole DB). After all the above actions it will run surprisingly quick even in a huge DB like mine.

The indexes are again being used and the performance is perfect again. So this is a happy end in my first StackExchange post. :-)