Postgresql – Could not fork new process for connection: Could not allocate memory

postgresqlpostgresql-9.3

I need help to understand the cause of this error and how to fix it.

I have a server running PostgreSQL 9.3. The cluster has around 1.000 databases. I use pgBouncer for pooling connections, and I set up a pool_size of only 1 connection per database.
The total amount of active connections at any time is ~80. This is because most of databases have almost no activity at all.

My application load has been increasing slowly during the last months, and in the last couple of days I started to see the error showing up intermitently in the postgres log file. The time the error is logged corresponds with brief moments where the load is a bit higher than usual. Notice the error happens when postgres tries to fork a new process for connection and even for an autovacuum process:

2020-05-07 07:16:16 -03  LOG:  main could not fork new process for connection: Cannot allocate memory
2020-05-07 07:16:17 -03  LOG:  could not fork autovacuum worker process

These are some of my PostgreSQL settings:

max_connections = 300
shared_buffers = 2GB
effective_cache_size = 2GB
maintenance_work_mem = 1GB
work_mem = 288MB
wal_buffers = 8MB
checkpoint_segments = 16

The server has 64GB of total RAM, 16 CPU cores, and it is running CentOS 7.
My stack is: Nginx, uWSGI, Redis, pgBouncer and PostgreSQL.
It's all installed in the same server, so resources must be shared between the elements of the stack.Redis is set to use no more than 30GB of RAM. Nginx ~8GB of RAM. uWSGI uses ~10GB of RAM. PostgreSQL uses ~8GB of RAM.

I'm no an expert at PostgreSQL. I've been reading the documentation regarding memory consumption, and considering my settings, my best guess is that I need to set higher values for shared_buffers and effective_cache_size. I got to that conclusion because those settings have been there for a long time (when my server had much less resources). But in the last few years, I've been adding resources to my server, but at the same time it has become busier. So I think maybe shared_buffers should be at least of 8gb, what do you think about that?

Also, for what I've read, I think I should make effective_cache_size higher than shared_buffers (considering the amount of RAM the server has available). Am I right?

One more thing: what do you think about my work_mem setting? I'm planning to reduce it (I'm not sure why it is set to 288MB). I know that the vast majority of the queries are very simple and fast. This is because all the databases correspond to simple blog applications, where most of the queries are SELECTs and the UPDATEs involve changing some value at an specific row, nothing complex. So, what do you think? I plan to reduce work_mem, am I in the right direction?

Thank you very much in advance!
Warm regards,
Lisandro.

Best Answer

I think that your machine is overloaded. You don't say what your max_connection setting is, but based on what you say I would assume it to be something around 1000.

As long as the number of active sessions is around 80, your memory should not get exhausted, but I assume that there are spikes of more active sessions, and your RAM is exhausted.

I am surprised that your machine runs well with 80 active sessions, since you have only 16 cores (you didn't say anything about the storage).

1000 databases seem a lot; I assume that each of them is small.

Since you have so many databases, I recommend that you move some to other clusters on different machines. That is an easy way to distribute the load.

Using a lower value for max_connections would reduce the danger of overload and memory exhaustion.