PostgreSQL 9.4 – Performance Issues After Upgrade

auto-growthfill-factorlockingpostgresqlpostgresql-9.4

After upgrading our database from 9.3.5 to 9.4.1 last night, the server suffers from high CPU spikes. The upgrade was done with pg_dump. So the database was converted to SQL and then imported into 9.4.

During the CPU spikes, there are a lot of these messages in the logs:

process X still waiting for ExclusiveLock on extension of relation Y of database Z 
after 1036.234 ms

And:

process X acquired ExclusiveLock on extension of relation Y of database Z
after 2788.050 ms

What looks suspicious is that there are sometimes several "acquired" messages for the exact same relation number in the exact same millisecond.

Why would Postgres grow a table twice in the same millisecond? Could it be an index with a high fill factor?

Any suggestions on how to approach this issue are welcome.

P.S. I've also asked this question on the Postgres mailing list, if that's not okay let me know.

Best Answer

The problem had to do with a kernel feature called Transparent Huge Pages (THP.) You can diagnose this with perf top:

 59.73%       postmaster  [kernel.kallsyms]      [k] compaction_alloc
  1.31%       postmaster  [kernel.kallsyms]      [k] _spin_lock
  0.94%       postmaster  [kernel.kallsyms]      [k] __reset_isolation_suitable
  0.78%       postmaster  [kernel.kallsyms]      [k] compact_zone
  0.67%       postmaster  [kernel.kallsyms]      [k] get_pageblock_flags_group
  0.64%       postmaster  [kernel.kallsyms]      [k] copy_page_c
  0.48%           :13410  [kernel.kallsyms]      [k] compaction_alloc
  0.45%           :13465  [kernel.kallsyms]      [k] compaction_alloc
  0.45%       postmaster  [kernel.kallsyms]      [k] clear_page_c
  0.44%       postmaster  postgres               [.] hash_search_with_hash_value
  0.41%           :13324  [kernel.kallsyms]      [k] compaction_alloc
  0.40%           :13561  [kernel.kallsyms]      [k] compaction_alloc

The compaction_alloc function points at a problem. You can turn off THP with:

echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled

Postgres versions before 9.4 do not specifically ask for huge pages, but it can be forced on them with always.

Here's a link to RedHat discouraging THP for database workloads.

Related Solutions

Postgresql – pg_dump does not finish

I was able to figure out which file in the database storage was the culprit, by copying all the files to /dev/null.

cp -vR /usr/lib/postgresql/8.4 /dev/null

(The path to your DB files might differ)

The currupt file could not be copied, but there was nothing I could do to change that. (so it was most probably a FS error or hardware failure)

So I restarted the server with a forced fsck (e.g. touch /forcefsck), to make sure the FS would do the best to fix itself. This might not be the way you'll want to go, since it is possible to have a total data loss afterwards, but I was able to preserve the most precious data already beforehand, so I took the risk.

After reboot I could finally access the inaccessable table again, but I am not sure, if the data contained is corrupted or not. Anyway, I do have a backup now, which I can disect to find out, and my server can go back online for now...

I recommend reading the wiki of postgres about corruption and the slides of this FOSDEM presentation for some more info on DB corruption

Postgresql – How to maintain high INSERT-performance on PostgreSQL

There are a few things that might be causing this issue, but I can't be sure any of them are the real problem. The troubleshooting all involves turning on extra logging in the database, then seeing if the slow parts line up with messages there. Make sure you put a timestamp in the log_line_prefix setting to have useful logs to look at. See my tuning intro to get started here: https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server

Postgres does all of its writes to the operating system cache, then later they head to disk. You can watch that by turning on log_checkpoints and reading the messages. When things slow down, it may simply be that all the caches are now full, and all writes are stuck waiting for the slowest part of I/O. You might improve this by changing the Postgres checkpoint settings.

There's an internal issue with the database people hit sometimes where heavy inserts get stuck waiting for a lock in the database. Turn on log_lock_waits to see if you're hitting that one.

Sometimes the rate you can do burst inserts at is higher than you can sustain once the system autovacuum process kicks in. Turn on log_autovacuum to see if the problems are concurrent with when it's happening.

We know that large amount of memory in the database's private shared_buffers cache doesn't work as well on Windows as it does on other operating systems. There isn't as much visibility into what goes wrong when it happens either. I would not try to host something that's doing 1000+ inserts a second to a Windows PostgreSQL database. It's just not a good platform for really heavy writes yet.

Best Answer

Related Solutions

Postgresql – pg_dump does not finish

Postgresql – How to maintain high INSERT-performance on PostgreSQL

Related Question