Postgresql – FATAL 53300: Remaining connection slots are reserved for non-replication superuser connections

max-connectionspostgresqlpostgresql-12

I have a PostgreSQL 12.1 database system (I'll refer to it as PGSQL) running on a remotely hosted VM server (Windows Server 2019). We upgraded the server OS and PGSQL a couple of months ago. Everything has been running more-or-less normally since then, until this morning when I started receiving the above-mentioned database error in pretty much every one of our in-house applications that connect to this PGSQL instance.

To check the connections, I ran SELECT * FROM pg_stat_activity;, which returned 103 rows. My postgresql.conf file has max_connections = 100, so that makes sense, but what doesn't make sense is that, of those 103 connections, 90+ of them are listed as idle with query text of DISCARD ALL. All of these show as being executed by the same, non-superuser account from the server's own internal address. However, several of the connections show a query_start date value from a month or more ago.

Now, many of the applications we have in place are unfortunately built with hard-coded credentials (I have a lot of "clean-up" work to do on the code for these applications that I inherited) and are generally being executed from shortcuts pointing to an "Application" share on the server that's hosting the PGSQL database, so none of this looks particularly "suspicious". I tried to simply kill the processes using SELECT pg_cancel_backend(<pid>); on one of the pid values from the previous query, but requerying pg_stat_activity still shows the same record in the result set (all of the values appear to be exactly the same, from what I can tell).

Perhaps I'm not using the correct function to terminate these "hung" processes or something, but I could not figure out how to clear out these connections individually. Because I needed to get our production environment back to a usable state, I ended up just stopping and restarting the PGSQL service on the server which did clear out all of those old DISCARD ALL statements, but I'm curious if there's something I could do to prevent this backlog of "hung" statements in the future.

My question here is, how can I prevent this from happening in the future? One thing to note is that, prior to upgrading our PGSQL server to v12.1, we ran v9.4 for a number of years and never once encountered this issue. I'm wondering if there might be something inherent to the newer version of PGSQL, or perhaps even something about running PGSQL in the Windows Server 2019 environment that might be causing this behavior.


EDIT

For reference and consolidation, the following information comes from the comments:

I do not have anything server-side for managing connection pooling (I've seen some references in other questions about PgBouncer, but haven't had an opportunity to look at that for whether or not it would be helpful in our environment). Most of my applications are implementing pooling in the connection string via the Npgsql library. I've built a "common library" for managing my applications' connections – connecting, disconnecting, disposing, etc. – which at least seems to be working normally.

However, it's certainly possible that some of the "legacy" code I've inherited does not have this implemented correctly, but that's something that'll take me a bit of time to dig through the code to find all of the connections (there are a lot of issues with some of that code). I'll investigate that as a potential/likely source of the issue as time permits.

As stated above, I've not encountered this issue until we upgraded PGSQL to v12.1 and that same legacy code has been in place for several years. As a 1-man IT Dept, the server's reboot "schedule" is generally managed by me and I've rarely rebooted the server or restarted the PGSQL service due to its nature as a production environment. Hopefully, I'm just being hyper-sensitive about stuff because of the recent upgrade and this whole thing is a "one-off" situation I won't see again.

I guess what triggered my question was wondering why the database hadn't dumped these idle connections that have been hanging around for a month or more. I'll keep an eye on things and, if the problem persists, I'll look into more aggressive connection pool management options.


UPDATE/EDIT

Since posting this question 4 months ago, I, unfortunately, haven't been actively checking the status of the idle connections. This morning, however, the error popped back up and I was reminded of this post. Again, executing SELECT * FROM pg_stat_activity; showed over 100 connections (the max defined by my server configuration), of which over 70 were idle / DISCARD ALL with "older" (several weeks/months) values for the backend_start / query_start date.

Understanding that the database itself doesn't "manage" these, I still find it odd that I never encountered this error in more than a decade of running previous versions of PostgreSQL. It wasn't until after we upgraded to v12, even though the environment has remained basically the same.

Regardless, because I have so many "legacy" applications that potentially could be leaving these connections open, I've decided to implement a check in one of my daily routines that should prevent this issue from reoccurring. I have an application that is run every day to replicate and clean up certain information in the database. In that application, I am adding a small function that will execute the following SQL command to keep the backend clear of these "leftover" idle connections:

SELECT
    pg_terminate_backend(pid)
FROM
    pg_stat_activity
WHERE
    pid <> pg_backend_pid()
    AND backend_start < (SELECT (NOW() - INTERVAL '2 WEEK')::DATE)
    AND state = 'idle';

As you can see, this should look for any idle connections older than 2 weeks and terminate them. Since this application is run daily, I believe this will help to keep the pool as clear as possible. Of course, if there's a better way – one that does not require the implementation of some new 3rd-party solution (at least, at this time) – I'm all ears, but, for the time being, I believe this will "solve" my problem.


ADDITIONAL REFERENCE / POSSIBLE ALTERNATE SOLUTION

As I was looking around a bit more, I also came across this answer on StackOverflow:

How to close idle connections in PostgreSQL automatically?

The linked answer refers to using the idle_in_transaction_session_timeout setting introduced in PostgreSQL 9.6. As stated in fresko's answer, I should be able to set this with a single SQL command sent to the server either as a superuser for the entirety of the server:

ALTER SYSTEM SET idle_in_transaction_session_timeout = '5min';

or for individual user connections on a per-session basis (if needed)

SET SESSION idle_in_transaction_session_timeout = '5min';

I checked this setting on my server and it appears to be disabled (0), so I may look into configuring this setting as well. I'll be referring to the v12 documentation page for Client Connection Defaults in case anyone wants to "follow along".

Best Answer

Regardless of whatever issues you might have, pg_cancel_backend cancels the backend's current query. But an idle backend doesn't have a current query, which is what makes it idle. You want pg_terminate_backend.

More fundamentally, it seems like you have an application problem or a connection pooler problem, but you haven't described them enough for us to give advice.

If you just want to kick the can down the road a bit further, you can increase max_connections. Having a large number of idle connections (not "idle in transaction"!) is not desirable, but is also not a huge problem.

prior to upgrading our PGSQL server to v12.1, we ran v9.4 for a number of years and never once encountered this issue.

So 0 vs 1? That doesn't sound like a sample size on which you can draw a conclusion. Maybe 9.4 just got rebooted for other reasons often enough to keep the leak under control. Or maybe someone increased the max_connections to a large enough value that it was never reached between reboots, but that increase didn't survive the upgrade.