Postgresql – Effect of over allocating shared_buffers

amazon-rdscrashpostgresqltuning

I have a 9.6 PostgreSQL RDS instance that I am tending to, but did not initially configure, and have been struggling with two different symptoms:

  1. the system (a db.r3.2xlarge w/ 61GB memory) would experience "cliffs" in freeable memory, dropping 20-30GB almost instantaneously, followed by –

  2. an auto-recovery event after some amount of time (could be minutes, could be hours).

On investigation, I found shared_buffers set abnormally high, actually exceeding physical memory on the machine. I have reset to a more sane level, but would like some more insight into the expected behavior.

The workload is modest – about a dozen or so concurrent users, manually triggering some large queries (e.g., aggregates on 300+ million row tables). My guess is that those queries trigger the cliffs. My question is about the after effects:

  1. Would over-allocating shared_buffers explain the stability problems? E.g., PostgreSQL tries to allocate more buffer than the OS has left, the OS says "nope" and things go south from there.

  2. Is shared_buffer memory ever released back to the OS? For example, I would frequently see a couple idle backends consuming 20-30GB, terminate them, but never see a rise in freeable memory.

Best Answer

Would over-allocating shared_buffers explain the stability problems? E.g., PostgreSQL tries to allocate more buffer than the OS has left, the OS says "nope" and things go south from there.

When the OS says "nope" it's in the system log, and likely the PostgreSQL error log. Just open it up and look. The process should also not start if a bigger allocation is requested than SHMMAX. But that value may not be set properly -- it may be higher than the physical memory.

On investigation, I found shared_buffers set abnormally high, actually exceeding physical memory on the machine. I have reset to a more sane level, but would like some more insight into the expected behavior.

shared_buffers doesn't have to fit into RAM. It'll just page. This could be very bad for performance, but I don't know if it would cause the symptoms seen. "Stability problems" are usually hard to define. A bottleneck that reduces memory speed to disk speed can cause all kinds of bad behavior.