Postgresql – pgBouncer works great but occasionally becomes unavailable

pgbouncerpostgresqlpython

I'm running pgBouncer in front of a busy postgres 9 database.
For most of the time it works fine. But every few hours I'll get an error email from my application with an exception from psycopg2:

OperationalError('could not connect to server: Cannot assign requested address
Is the server running on host "neo-hulk" and accepting
TCP/IP connections on port 6432?')

This is a python app with a bunch of celery workers running tasks. When those errors arrive I check the pgbouncer db and the pool size is within limits.
After some experimentation I've set the pool max size to 400 and pool size to 200. pool mode is "session" (requests are mostly auto-commit, almost no transactions).

What makes pgBouncer 'vanish' like that? its only for short periods of time (and in total we're talking about a tiny amount of requests compared to the sheer volume of requests its handing) but those requests that fail are important.

Thanks!

Best Answer

The "Cannot assign requested address" part in the error message comes from the kernel TCP stack. When encountered intermittently, this typically means that the space of available sockets is exhausted because of too much sockets in wait state (TIME_WAIT, or less probably FIN_WAIT_1 or FIN_WAIT_2)

The range of socket ports can be output by cat /proc/sys/net/ipv4/ip_local_port_range. The default value on a stock Linux kernel is generally 32768 61000.

You may check the result of netstat -ton|grep WAIT on the client(s) and on the pgBouncer's host when the system is busy. The -o flag will show the timeout counters related to wait states.

If the total number of TCP sockets is close to 61000-32768=28232 then exhaustion of this range is likely your problem. Since a closed socket spends 60 seconds in TIME_WAIT state in normal condition, if a client host connects more than 28232 times in one minute, new connections will fail with the mentioned error until ports are freed.

As a first workaround, the TCP ports range may be extended:

 # echo "1025 65535" >/proc/sys/net/ipv4/ip_local_port_range

If it's not satisfactory, check the tcp_tw_recycle and tcp_tw_reuse flags, also tunable through /proc/sys/net/ipv4 and sysctl.

They're defined as (from man tcp):

       tcp_tw_recycle (Boolean; default: disabled; since Linux 2.4)
              Enable  fast  recycling  of  TIME_WAIT  sockets.   Enabling this
              option is not recommended since this causes problems when  work‐
              ing with NAT (Network Address Translation).

       tcp_tw_reuse (Boolean; default: disabled; since Linux 2.4.19/2.6)
              Allow  to reuse TIME_WAIT sockets for new connections when it is
              safe from protocol viewpoint.  It should not be changed  without
              advice/request of technical experts.

Personally I had success with tcp_tw_recycle when faced with this problem with a MySQL client app, but don't take this as a recommendation, my understanding of TCP being superficial at best.