PostgreSQL Performance – Analyzing Execution Time Over 24 Hours

disaster recoveryindexperformancepostgresqlpostgresql-10

I upgraded a Postgres DB 9.3.2–>10.5 using pg_upgrade (in place). I did everything according to the documentation and the instructions given by pg_upgrade. Everything went fine but then I realized that the indexes were not being used in one of the tables (maybe others are affected too).

So I started an ANALYZE on that table yesterday which is still running (for over 22h)…!

The question: Is it normal for ANALYZE to have such a long execution time?

The table contains about 30M records. The structure is:

CREATE TABLE public.chs_contact_history_events (
    event_id bigint NOT NULL
           DEFAULT nextval('chs_contact_history_events_event_id_seq'::regclass),
    chs_id integer NOT NULL,
    event_timestamp bigint NOT NULL,
    party_id integer NOT NULL,
    event integer NOT NULL,
    cause integer NOT NULL,
    text text COLLATE pg_catalog."default",
    timestamp_offset integer,
    CONSTRAINT pk_contact_history_events PRIMARY KEY (event_id)
);

ALTER TABLE public.chs_contact_history_events OWNER to c_chs;

CREATE INDEX ix_chs_contact_history_events_chsid
    ON public.chs_contact_history_events USING hash (chs_id)
    TABLESPACE pg_default;

CREATE INDEX ix_chs_contact_history_events_id
    ON public.chs_contact_history_events USING btree (event_id)
    TABLESPACE pg_default;

CREATE INDEX ix_history_events_partyid
    ON public.chs_contact_history_events USING hash (party_id)
    TABLESPACE pg_default;

UPDATE:

I ran the query below in order to get the currently running processes and got a more than interesting results:

SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes'
AND state = 'active';

It seems that the maintenance tasks and the concurrent recreation of the index table are frozen!

So the next question: is it safe to cancel those processes? And what to do next? IMO stopping them all and restarting index creation will be necessary but I'm unsure.

ANNEX 1

Possibly related errors corrected in v9:

9.3.7 and 9.4.2
Fix possible failure during hash index bucket split, if other processes are modifying the index concurrently

9.3.18 and 9.4.13 and 9.5.8 and 9.6.4
Fix low-probability corruption of shared predicate-lock hash table in Windows builds

9.5.4
Fix building of large (bigger than shared_buffers) hash indexes
The code path used for large indexes contained a bug causing incorrect hash values to be inserted into the index, so that subsequent index searches always failed, except for tuples inserted into the index after the initial build.

Possibly related errors corrected in v10:

10.2
Fix failure to mark a hash index's metapage dirty after adding a new overflow page, potentially leading to index corruption

Prevent out-of-memory failures due to excessive growth of simple hash tables

And last but not least that makes me concerns (since an upgrade seems to be not realistic on the productive environment):

10.6
Avoid overrun of a hash index's metapage when BLCKSZ is smaller than default

Fix missed page checksum updates in hash indexes

ANNEX 2

Upgrade instruction in v10:

Hash indexes must be rebuilt after pg_upgrade-ing from any previous major PostgreSQL version

Major hash index improvements necessitated this requirement. pg_upgrade will create a script to assist with this.

Note that I ran that script of course at the time of upgrade.

Best Answer

After several hours of research and examining the current situation I think I managed to solve the issue. (Many thanks to fellow user ypercube for the inspiration and for Erwin Brandstetter who in parallel came to the same solution.)

So there were several layers of the problem.

1.) UPGRADE

Upgrading with pg_upgrade 9.3.2 --> 10.5 should be made in two steps. First within the same line (9.3.2 --> 9.3.25) and then to 10.x (10.5 in my case)

I made a direct upgrade and it seems that it was the root cause of the problem.

2.) HASH INDEXES

It seems that hash indexes suffered from some strange errors in postgres which have been corrected already but using indexes of the pre-correction versions leads to errors

3.) FROZEN TASKS

It does make sense to look for postgres processes which are running for non-realistic long time. (See query in the question.) In my case it turned out that the recreation of indexes stuck somehow and several other tasks have been blocked as well.

It is safe to cancel most of them with SELECT pg_cancel_backend(__pid__); where pid is the process ID found in the result set of the before-mentioned query. So I did it. I even stopped the autovacuum processes.

4.) MEMORY HANDLING

When after all of this I finally thought I was able to delete and create the new indexes I faced the next problem. After about one minute all maintenance queries exited with an error message:

ERROR: out of memory
DETAIL: Failed on request of size 22355968.
SQL state: 53200

It seems that the memory handling changed between 9.3 and 10. I had to reduce the amount of maintenance_mem in the config:

maintenance_work_mem = 64MB     # min 1MB

It was 512MB before and although the server has 32GB of RAM it was still not working with that.

5.) RECREATING INDEXES

After all it was possible to recreate indexes (drop old ones and create new ones). It would have been easier with a proper script but I had to do it manually. Don't forget that creating and dropping indexes locks the table so in productive environment (like mine) you should do that CONCURRENTLY.

Edit:

I also realized that using hash indexes in my specific case was not really meaningful so I decided to change them to btree at the recreation.

6.) ANALYZE

After recreating the indexes it is necessary to run an analyze on the affected tables (or the whole DB). After all the above actions it will run surprisingly quick even in a huge DB like mine.

The indexes are again being used and the performance is perfect again. So this is a happy end in my first StackExchange post. :-)

Related Solutions

MySQL looking up more rows than needed (indexing issue)

Your indexes are fine for the two types of queries you mentioned.

This query will be satisfied by traversing the clustered index on the primary key...

[...] WHERE participant_id = x AND question_id = y AND given_answer_id = z;

...and this one is satisfied by the index on 'question_id':

[...] WHERE question_id = x;

The output of EXPLAIN SELECT is not telling you what you think it is telling you, because the value shown in rows is an estimate of the number of rows the server will need to consider, not the actual rows it will examine. For InnoDB these are based on index statistics.

rows

The rows column indicates the number of rows MySQL believes it must examine to execute the query.

For InnoDB tables, this number is an estimate, and may not always be exact.

^{— http://dev.mysql.com/doc/refman/5.5/en/explain-output.html#explain_rows}

The optimizer gathers information about different possible query plans, and chooses the one with the lowest cost. The information shown in EXPLAIN is the information the optimizer gathered about the plan it selected.

When type is ref and key is not NULL, this means that the name listed in the key column is the name of the index that the optimizer has chosen to use to find the desired rows, so your query plan looks exactly as it should.

Note, sometimes you will see Using index in the Extra column and a lot of people assume that this means an index is being used, or that no index is being used when that doesn't appear, but that's not correct, either. Using index describes a special case called a "covering index" -- it does not indicate whether an index is being used to locate the rows of interest.

It's possible that running ANALYZE [LOCAL] TABLE would cause the numbers in rows shown by EXPLAIN to differ, but this is a simple query and selecting this index is an obvious choice for the optimizer to make, so ANALYZE TABLE is unlikely to make any actual difference in performance.

It is possible, however, that your overall performance might see some marginal improvement with an occasional OPTIMIZE [LOCAL] TABLE, because you are not inserting rows in primary key order (as would be the case with an auto_increment primary key)... but on large tables this can be time-consuming because it rebuilds a new copy of the table... but, again, I wouldn't expect any significant change.

Postgresql – Simple PostgreSQL lookup table is inexplicably slow

Rather than "VARCHAR(40000)" why not use "TEXT"?
HASH index use is discouraged (see docs at http://www.postgresql.org/docs/current/static/indexes-types.html).
Have you run "ANALYZE" on your tables before running the query?
Giant IN lists can be performance killers.

How do the following queries perform?

SELECT e.entityId
FROM entities e
INNER JOIN triples t ON (t.object = e.entityId)
LIMIT 10000;

SELECT e.entityId
FROM entities e
WHERE EXISTS (SELECT 1 FROM triples t WHERE t.object = e.entityId LIMIT 10000);

Best Answer

Related Solutions

MySQL looking up more rows than needed (indexing issue)

Postgresql – Simple PostgreSQL lookup table is inexplicably slow

Related Question