Postgres Performance Issue – Troubleshooting Intermittent Problems (Docker)

postgresql-9.3

I am running a local Postgres 9.3 in a docker container with a temporary database that is used for some file import operations.

The basic scenario is:

  • Load files into input tables
  • INSERT from a SELECT into the output tables (transform)
  • Read the output tables

There are several tables that we transform with separate and sequential statements. We are running into severe performance degradation on two of our transform steps. It changes between which of the two will have the issue and if one has it, the other will not.

The degradation is ~100x slower, 2 seconds -> 20 minutes for one; and 4 seconds -> 40 minutes for the other.

The queries look like:

INSERT INTO target_table (
  field1,
  field2,
  -- elided
  field20,
  member_key,
  source_file_name,
  source_line_number
)
SELECT
  src.field1 as field1,
  src.field2 as field2,
  -- elided
  src.field20 as field20,
  mem.member_key as member_key,
  src.source_file as source_file_name,
  src.source_line as source_line_number
FROM
source_table src
INNER JOIN members mem
ON src.member_identifier = mem.member_identifier;

The two queries are nearly identical, the slower of the two has a few more fields, and about twice the rows. The 'source_table' and 'target_table' change, but both queries use the same members table.

In our sample set the first source table has 360,000 rows and the second has 600,000 rows. There is no where clause, so the SELECT portion will operate on all rows.

This slowdown started showing up for us about a week ago. The only thing that we changed was creating a reduced data set to work on performance issues elsewhere in the system. The original datasets for those queries were 3.6M and 6M rows and the inserts never took longer than a couple of minutes to complete.

Other Information:

  • This is running on an AWS EC2 i2.2xlarge box.
  • OS is Ubuntu 14.04.2 LTS
  • Postgres and the caller of this query are in Docker containers (separate)
  • The Postgres image is the stock postgres:9.3 image run with postgres -F (disables fsync, this is an ephemeral database)
  • There is nothing else happening on the box
  • When running the query, one CPU is at 100%, the others hover around 0%
  • When running the query, top reports a 'wa' parameter of 0.0%
  • When the query was running and stuck in the 100% CPU state, I ran just the SELECT portion from a separate connection. This completed in a couple of seconds.
  • When running the query, the checkpoints_req column of the table pg_stat_bgwriter is 2.

Best Answer

I think this is the answer. I'll update/remove this if I get any contra-indications.

In our (admittedly bizarre) use case, the tables in this ephemeral database are being loaded via COPY FROM statements. Then we start performing transforms.

My hypothesis is that the table statistics do not reflect the actual data population when the queries are run, causing the generation of a poor query plan.

I have forced the execution of an ANALYZE statement after the bulk load, but before the queries began. This seems to have addressed the issue.