Postgresql – Troubleshooting PgPool Query Lags

load balancingpgpoolpostgresqlpostgresql-9.2

I have been using PGpool-II for quite a while but just noticed an intermittent lag (slow request taking about 20s to execute) which I am able to replicate consistently. Before I explain here is a quick rundown the underlying architecture:

2 Different Database (1 master, 2 replicas) using streaming replication
Each of which interact with a respective pgpool server
2 client applications balanced across 5 servers
Pgpool Load balance mode turned on

I was able to rule out slow queries by pointing the connections directly to the master server. I am able to consistently replicate the lag only when subsequently querying pgpool from a different client application in repetition, the lag usually happens at about the second iteration. I noticed that by turning connection_cache=off the lag is less frequent and not as bad but still happens nonetheless. I tried turning on pgpool logging to figure out the issue but after tailing pgpool.log there is just so much information that I have no idea what to look for and using grep specifically for the string ERROR yields nothing while the lag happens.

Here is the configuration:

listen_addresses = '*'
port = 5432
socket_dir = '/var/run/postgresql/'
listen_backlog_multiplier = 2
pcp_listen_addresses = '*'
pcp_port = 9898
pcp_socket_dir = '/var/run/postgresql/'
backend_hostname0 = 'database3-master'
backend_port0 = 5432
backend_weight0 = 1
backend_data_directory0 = '/var/lib/pgsql/data'
backend_flag0 = 'DISALLOW_TO_FAILOVER'
backend_hostname1 = 'database3-replica'
backend_port1 = 5432
backend_weight1 = 1
backend_data_directory1 = '/var/lib/pgsql/data'
backend_flag1 = 'DISALLOW_TO_FAILOVER'
backend_hostname2 = 'database3-replica2'
backend_port2 = 5432
backend_weight2 = 1
backend_data_directory2 = '/var/lib/pgsql/data'
backend_flag2 = 'DISALLOW_TO_FAILOVER'
enable_pool_hba = on
pool_passwd = 'pool_passwd'
authentication_timeout = 60
ssl = off
num_init_children = 32
max_pool = 4
child_life_time = 300
child_max_connections = 0
connection_life_time = 60
client_idle_limit = 0
log_destination = 'stderr'
log_connections = on
log_hostname = off
log_statement = off
log_per_node_statement = off
log_standby_delay = 'none'
syslog_facility = 'LOCAL0'
syslog_ident = 'pgpool'
debug_level = 1
pid_file_name = '/var/run/pgpool/pgpool.pid'
logdir = '/var/log/pgpool/'
connection_cache = off
replication_mode = off
replicate_select = off
insert_lock = on
replication_stop_on_mismatch = off
failover_if_affected_tuples_mismatch = off
load_balance_mode = on
ignore_leading_white_space = on
black_function_list = 'nextval,setval,nextval,setval'
allow_sql_comments = off
master_slave_mode = on
master_slave_sub_mode = 'stream'
sr_check_period = 0
sr_check_user = 'pgpool'
sr_check_password = 'password'
delay_threshold = 0
follow_master_command = '/bin/echo %M > /tmp/postgres_master'
health_check_period = 30
health_check_timeout = 20
health_check_user = 'pg_produser'
health_check_password = '9password'
health_check_max_retries = 0
health_check_retry_delay = 1
connect_timeout = 10000
failover_command = '/etc/pgpool-II/failover.sh %d %H %P /tmp/postgresql.trigger.failover startup-pgpool4'
fail_over_on_backend_error = on
search_primary_node_timeout = 10
recovery_user = 'pgpool'
recovery_password = 'password'
recovery_timeout = 90
client_idle_limit_in_recovery = 0
use_watchdog = on
wd_hostname = 'pgpool3'
wd_port = 9000
wd_authkey = ''
wd_escalation_command = '/bin/bash /etc/pgpool-II/pgpool-failover.sh'
wd_lifecheck_method = 'heartbeat'
wd_interval = 10
wd_heartbeat_port = 9694
heartbeat_destination0 = 'pgpool4'
heartbeat_destination_port0 = 9694
other_pgpool_hostname0 = 'pgpool4'
other_pgpool_port0 = 5432
other_wd_port0 = 9000
relcache_expire = 0
relcache_size = 256
check_temp_table = on
check_unlogged_table = on
memory_cache_enabled = off

I have been seriously considering another solution but not finding much. Ideally what I want is some middleware that will split read/write requests to the master and slave respectively. HAPROXY doesn't have the ability to do this because it does not parse queries.

At the application level, is there an ideal solution for splitting read write queries? I would assume any query with an UPDATE/INSERT would go to the master server but I feel like I am missing something.

So to recap:

Is there an efficient way to debug pgpool lags and if there is what
should I be looking for?
If not, is there an ideal solution besides
pgpool for splitting read/write queries to replica/master
respectively?
If not, what is the most reliable way to split
read/write queries at the application level in a high level language
agnostic explanation?

Best Answer

I seemed to have partially figured out the first answer of my question and I have also found a solution that has reduced the lag time altogether however it may pose later issues down the road.

First how I was able to decipher some meaningful debugging:

/bin/sh -c /usr/bin/pgpool -f /etc/pgpool-II/pgpool.conf -d -n Pass the -d parameter for debugging.

edit pgpool conf to show the following:

log_connections = on
log_hostname = on
log_statement = on
log_per_node_statement = on

restart pgpool and run tcpdump to grep the particular incoming client connection

tcpdump -i eth1 | grep '[clientname] * > [pgpoolhosname]'
in a separate window grep the pgpool logs for LOG and any other useful information:

tail -f /var/log/pgpool/pgpool.log | grep "LOG\|error\|anythingelseuseful"

I was able to determine that a TCP connection was being established but pgpool was not responsive with anything until several seconds later. This leads me to believe that there was an issue with the idle connection accepting new requests.

In pgpool the number of idle connection is specified with:

num_init_children = 20
                               # Number of pools
                               # (change requires restart)
max_pool = 4
                               # Number of connections per pool
                               # (change requires restart)

I simply changed num_init_children to 100 and max_pool to 2

This may however introduce different issues down the road as there are now up to 200 idle processes running on the server at once. It seems that messing around with the settings in the pool connection seem to resolve the lag. More information about these variables can be found here:

http://www.pgpool.net/mediawiki/index.php/Relationship_between_max_pool,_num_init_children,_and_max_connections http://www.pgpool.net/mediawiki/index.php/Relationship_between_max_pool,_num_init_children,_and_max_connections

Related Solutions

PostgreSQL 9.1 Hot Backup Error: the database system is starting up

The message "The database system is starting up." does not indicate an error. The reason it is at the FATAL level is so that it will always make it to the log, regardless of the setting of log_min_messages:

http://www.postgresql.org/docs/9.1/interactive/runtime-config-logging.html#RUNTIME-CONFIG-LOGGING-WHEN

After the rsync, did you really run what you show?:

pgsql -c "select pg_stop_backup();";

Since there is, so far as I know, no pgsql executable, that would leave the backup uncompleted, and the slave would never come out of recovery mode. On the other hand, maybe you really did run psql, because otherwise I don't see how the slave would have logged such success messages as:

Log: consistent recovery state reached at 0/BF0000B0

and:

Log: streaming replication successfully connected to primary

Did you try connecting to the slave at this point? What happened?

The "Success. You can now start..." message you mention is generated by initdb, which shouldn't be run as part of setting up a slave; so I think you may be confused about something there. I'm also concerned about these apparently conflicting statements:

The only ways I have restarted Postgres is through the service postgresql-9.1 restart or /etc/init.d/postgresql-9.1 restart commands. After I receive this error, I kill all processes and again try to restart the database...

Did you try to stop the service through the service script? What happened? It might help in understanding the logs if you prefixed lines with more information. We use:

log_line_prefix = '[%m] %p %q<%u %d %r> '

The recovery.conf script looks odd. Are you copying from the master's pg_xlog directory, the slave's active pg_xlog directory, or an archive directory?

Postgresql – PGPool Frequent Queries

I don't think these are issued by PGPool. I think they are instigated by your application (which I am guessing is written in PHP and running over some sort of ORM).

The first entry is DEALLOCATE which means effectively to tell the server to free up memory from a prepared statement. These look like they are issued through some PDO module.

The second and third queries look like they are ORM-related mapping queries. Your application is probably unware that these queries are being issued by the framework it is running on, but these queries only make sense really in an ORM or similar environment.

Best Answer

Related Solutions

PostgreSQL 9.1 Hot Backup Error: the database system is starting up

Postgresql – PGPool Frequent Queries

Related Question