Postgresql – Promotion of PostgreSQL standby and application reconnection

failoverpostgresqlreplication

Background:

I work for a small company. We have been using PostgreSQL on Heroku as one of our core technologies. While, in general, we've been happy with Heroku, the company has made the decision that our monthly bill with them is getting prohibitively expensive. As a result, the engineering team has decided to port everything over to DigitalOcean. Since I'm the most experienced database developer on the team, I've been given the task of figuring out "what to do about the databases". To be clear, I've got almost 20 years of writing SQL statements and stored procedures and 4 years with PostgreSQL, but I'm definitely not a DBA (And frankly, I'm scared to death that I'm going to screw things up). I've spent every night for a week reading blogs and watching videos on configuration strategies for "high availability". What I've decided to do is create a master, with 2 hot standbys. For each service/product we have, the architecture is basically the same:

  • 1:n webserver(s)
  • 1:n jobserver(s)
  • 1:n reporting server(s)

Web and job(s) will be pointed to master, and the reporting server(s) will be pointed to 1 of the hot standbys.

note: We are using Anisble for Cloud/IT automation.

So, the idea is that if the master fails, the most up-to-date standby gets promoted to Master. Obviously, that means that the other hot standby must be switched to the new master, and the all the servers (web, job, reporting) must be reconfigured and reconnected.

Questions

I've read a lot about how to do this with something like repmgr. But, nothing on how all of the other servers know about this change and are reconfigured. Is there a simple way this is done? With Anisible, we could have a playbook that connects and changes environment vars and restarts the services, but it's hard for me to believe this is the way everyone does it. It feels like there is a piece missing for me. It feels like you'd just need/want some kind of a proxy in the middle and just reconfigure it. Is this where people use something like PgBouncer or PgPool?

Any help will be greatly appreciated.

Relevant technologies:

  • PostgreSQL 9.3 (but, could move up to 9.4 if there is a reason)
  • DigitalOcean
  • Ubuntu 14.04
  • Python 2.7
  • Ansible

Best Answer

To be truly highly available you will need some logic built into the application to handle this. You can do this sort of thing with pg_pool, but what if pg_pool breaks?

I think there are many ways you could work this one, but the way I would do it would be:

Master DB (lets call it A), and two slaves (lets call them B and C). B pulls changes from A and C pulls changes from B in a cascading replication setup. This means if A fails and is a gonner promote B, no need to change C as its already a slave of B, when A is back you make it a slave of C and so on.

You will need to make changes to your application so that it has some intelligence in it regarding the state of the database. This would be typically done by having a layer that manages the connections, and it runs a test query (like "SELECT current_timestamp") if the query passes all is fine and it gives the connection to your application to use. Each application will need to know a list of servers to use in which order so the one that writes will use A first then B. So in the event the primary fails it tries the secondary (and potentially the third and so on), until it gets one where the test query works.

With this in place if C fails its only read only, the applications will just fail over to B or A and start using C if you get it back. If B fails, same deal though you might want to do something with C pointing it to A. If A fails then the system will go read only (but will still work for all read queries). There will then need to be some human intervention to say "yep A is gone", promote B (which will only take seconds), when B is promoted the writers will go though the connection layer, it will test A and fail test B and pass to the application will get a connection to B which is now a master so writes will be allowed.

I appreciate that this is still not a full HA solution (as there is still a manual step, but its a very simple and quick one), but I think we are going to have to wait for PostgreSQL 9.5 for that.

As I say I dont think there is one dead cert right answer to this, but this is how I would have done it.

Hope it helps