Postgresql – Troubleshooting PgPool Query Lags

load balancingpgpoolpostgresqlpostgresql-9.2

I have been using PGpool-II for quite a while but just noticed an intermittent lag (slow request taking about 20s to execute) which I am able to replicate consistently. Before I explain here is a quick rundown the underlying architecture:

  • 2 Different Database (1 master, 2 replicas) using streaming replication
  • Each of which interact with a respective pgpool server
  • 2 client applications balanced across 5 servers
  • Pgpool Load balance mode turned on

I was able to rule out slow queries by pointing the connections directly to the master server. I am able to consistently replicate the lag only when subsequently querying pgpool from a different client application in repetition, the lag usually happens at about the second iteration. I noticed that by turning connection_cache=off the lag is less frequent and not as bad but still happens nonetheless. I tried turning on pgpool logging to figure out the issue but after tailing pgpool.log there is just so much information that I have no idea what to look for and using grep specifically for the string ERROR yields nothing while the lag happens.

Here is the configuration:

listen_addresses = '*'
port = 5432
socket_dir = '/var/run/postgresql/'
listen_backlog_multiplier = 2
pcp_listen_addresses = '*'
pcp_port = 9898
pcp_socket_dir = '/var/run/postgresql/'
backend_hostname0 = 'database3-master'
backend_port0 = 5432
backend_weight0 = 1
backend_data_directory0 = '/var/lib/pgsql/data'
backend_flag0 = 'DISALLOW_TO_FAILOVER'
backend_hostname1 = 'database3-replica'
backend_port1 = 5432
backend_weight1 = 1
backend_data_directory1 = '/var/lib/pgsql/data'
backend_flag1 = 'DISALLOW_TO_FAILOVER'
backend_hostname2 = 'database3-replica2'
backend_port2 = 5432
backend_weight2 = 1
backend_data_directory2 = '/var/lib/pgsql/data'
backend_flag2 = 'DISALLOW_TO_FAILOVER'
enable_pool_hba = on
pool_passwd = 'pool_passwd'
authentication_timeout = 60
ssl = off
num_init_children = 32
max_pool = 4
child_life_time = 300
child_max_connections = 0
connection_life_time = 60
client_idle_limit = 0
log_destination = 'stderr'
log_connections = on
log_hostname = off
log_statement = off
log_per_node_statement = off
log_standby_delay = 'none'
syslog_facility = 'LOCAL0'
syslog_ident = 'pgpool'
debug_level = 1
pid_file_name = '/var/run/pgpool/pgpool.pid'
logdir = '/var/log/pgpool/'
connection_cache = off
replication_mode = off
replicate_select = off
insert_lock = on
replication_stop_on_mismatch = off
failover_if_affected_tuples_mismatch = off
load_balance_mode = on
ignore_leading_white_space = on
black_function_list = 'nextval,setval,nextval,setval'
allow_sql_comments = off
master_slave_mode = on
master_slave_sub_mode = 'stream'
sr_check_period = 0
sr_check_user = 'pgpool'
sr_check_password = 'password'
delay_threshold = 0
follow_master_command = '/bin/echo %M > /tmp/postgres_master'
health_check_period = 30
health_check_timeout = 20
health_check_user = 'pg_produser'
health_check_password = '9password'
health_check_max_retries = 0
health_check_retry_delay = 1
connect_timeout = 10000
failover_command = '/etc/pgpool-II/failover.sh %d %H %P /tmp/postgresql.trigger.failover startup-pgpool4'
fail_over_on_backend_error = on
search_primary_node_timeout = 10
recovery_user = 'pgpool'
recovery_password = 'password'
recovery_timeout = 90
client_idle_limit_in_recovery = 0
use_watchdog = on
wd_hostname = 'pgpool3'
wd_port = 9000
wd_authkey = ''
wd_escalation_command = '/bin/bash /etc/pgpool-II/pgpool-failover.sh'
wd_lifecheck_method = 'heartbeat'
wd_interval = 10
wd_heartbeat_port = 9694
heartbeat_destination0 = 'pgpool4'
heartbeat_destination_port0 = 9694
other_pgpool_hostname0 = 'pgpool4'
other_pgpool_port0 = 5432
other_wd_port0 = 9000
relcache_expire = 0
relcache_size = 256
check_temp_table = on
check_unlogged_table = on
memory_cache_enabled = off

I have been seriously considering another solution but not finding much. Ideally what I want is some middleware that will split read/write requests to the master and slave respectively. HAPROXY doesn't have the ability to do this because it does not parse queries.

At the application level, is there an ideal solution for splitting read write queries? I would assume any query with an UPDATE/INSERT would go to the master server but I feel like I am missing something.

So to recap:

  1. Is there an efficient way to debug pgpool lags and if there is what
    should I be looking for?
  2. If not, is there an ideal solution besides
    pgpool for splitting read/write queries to replica/master
    respectively?
  3. If not, what is the most reliable way to split
    read/write queries at the application level in a high level language
    agnostic explanation?

Best Answer

I seemed to have partially figured out the first answer of my question and I have also found a solution that has reduced the lag time altogether however it may pose later issues down the road.

First how I was able to decipher some meaningful debugging:

  • /bin/sh -c /usr/bin/pgpool -f /etc/pgpool-II/pgpool.conf -d -n Pass the -d parameter for debugging.

  • edit pgpool conf to show the following:

    log_connections = on
    log_hostname = on
    log_statement = on
    log_per_node_statement = on
    
  • restart pgpool and run tcpdump to grep the particular incoming client connection

    tcpdump -i eth1 | grep '[clientname] * > [pgpoolhosname]'

  • in a separate window grep the pgpool logs for LOG and any other useful information:

    tail -f /var/log/pgpool/pgpool.log | grep "LOG\|error\|anythingelseuseful"

I was able to determine that a TCP connection was being established but pgpool was not responsive with anything until several seconds later. This leads me to believe that there was an issue with the idle connection accepting new requests.

In pgpool the number of idle connection is specified with:

num_init_children = 20
                               # Number of pools
                               # (change requires restart)
max_pool = 4
                               # Number of connections per pool
                               # (change requires restart)

I simply changed num_init_children to 100 and max_pool to 2

This may however introduce different issues down the road as there are now up to 200 idle processes running on the server at once. It seems that messing around with the settings in the pool connection seem to resolve the lag. More information about these variables can be found here:

http://www.pgpool.net/mediawiki/index.php/Relationship_between_max_pool,_num_init_children,_and_max_connectionshttp://www.pgpool.net/mediawiki/index.php/Relationship_between_max_pool,_num_init_children,_and_max_connections