Postgresql – How to Set Up a Shadow Production Database for PostgreSQL

developmentpostgresqltesting

I'm working on a Django project and I'd like to avoid the common pitfall of using SQLite for development and PostgreSQL for production referenced in Two Scoops of Django.

I'd like to take this a step further and use a "shadow production database" (shadow prod DB) for PostgreSQL. How can I get this set up? I think that once the shadow prod DB is set up and configured, swapping out the database settings from "production" to "development" (shadow Prod) will be trivial in the Django application's settings.py file.

However, when it comes to setting up the framework where the "shadow prod DB" is actually recording all of the same transactions that are occurring in the production database I have no idea where to start. Any help is greatly appreciated.

Further, what do I do if I want to get the shadow prod DB onto a laptop. What happens when the database size grows prohibitively large?

Thanks, Scott.

For posterity, I originally heard about the concept of a "shadow prod DB" on the podcast Scaling Postgres which linked to this post from Heap.

Best Answer

One option is to use a "hot standby" server, either with log shipping or with streaming (or both). This will automatically apply changes from the master into the replica whenever there is connectivity. The problem is that the hot standby server is read only. If your Django never needs to write into the database, this is fine, but if it needs to write this will not work. So this will likely not work, as most applications will write into the database when someone logs in, even if the application is only used for reporting purposes. Another potential problem is that unlogged tables are not available on the hot standby.

Another possibility to set up a standby server, and then promote it when you need to use it. When you promote it, it will become writable, but it will sever the connection to the master and so will slowly diverge. You will have to recreate it every now and then in order to bring them back into sync. If you need to minimize downtime, you can be using the old promoted copy while you are re-establishing a new standby copy in the background. This is a bit more work, and requires you to have enough storage for two full copies of the database (although it could be less if your master changes slowly and your file system supports fancy deduplication)

You could use logical replication to replicate changes from the master into a "live" writable test database. This is a lot more work as you need to go through all the tables and decide which ones should have changes replicated from the master and which should stand alone on the replica after the initial sync, and which ones need to do both with some kind of conflict resolution. And if the purpose of this dev database is so that you can test work that is likely to have bugs and cause bad data to get into your database and you want to protect the production database from getting corrupted with this bad data, this won't work very well. Locally caused corruption will remain in the local database, replicating new changes from the master is unlikely to fix the bad local data, so you will need to do a full refresh to get rid of the bad data. So overall, this method will probably be more work but no more effective than the previous solution. This also has the problem that it will not replicate schema changes. So if your master schema has not been stabilized but is still getting new tables, new columns, etc. this will be a nightmare.

Further, what do I do if I want to get the shadow prod DB onto a laptop. What happens when the database size grows prohibitively large?

Either create a demo database which has enough data to serve its purpose, and forget about trying to keep it up to date; or get a bigger laptop. If really really need to have all the latest data from the master, but don't need the old stuff, then you can use logical replication to do that. You can create your subscription with "copy_data false" to avoid copying the historical data. Presumably you need at least some history, but it would then be on you to figure out how much and how to get it.