Postgresql – Restoring PostgreSQL tables faster

backuppostgresql

There's a PostgreSQL PITR system running on a fairly large system, with hundreds of databases with hundreds of tables in each, and we're running into time constraint-based problems when trying to restore individual tables from a single db, or a set of dbs. PITR is being managed with Barman.

With this setup, we need to restore a whole cluster (which could contain dozens of dbs), then extract the single table and then restore the data – which can take hours.

Statistically, because of the number of dbs involved and the number of people involved, this happens often enough that it's becoming a problem.

Are there faster solutions which would give us PITR-like features? I've thought of doing manual pg_dump backups for every table individually every hour, but maybe there are better solutions?

Best Answer

I assume you have already tried the "shame people into being more careful" approach without success.

I think I would try for an in-database solution, like adding timestamp columns to the table and adding columns for "soft deletes" rather than doing hard deletes, in such way that you can run queries to restore a table to its previous state. If forbidding user-level hard deletes is not an option, then you can keep some log tables around that allow you to craft such restorative queries. I use this module extensively for that purpose, although if I were starting from scratch I might consider something more modern like this. The triggers needed to execute this logging do impose a considerable overhead on any DML operations, so you will have assess whether you can tolerate this.

A benefit of this approach is that you can often come up with queries that will fix one person's mistake without all changes after that mistake being lost. Of course if the bad data from the oopsie leads to cascading errors because later work depended on the bad data, this may not be possible.

If that won't work, perhaps a hybrid approach would be a good idea. You could save hourly pg_dump snapshots every hour and keep them for 12 hours (or whatever amount of storage you are willing to dedicate to this purpose). If you do rarely need to go back more than 12 hours, then fallback to the PITR recovery method instead. If you go this route, you might want to use the -Fc or -Fd options for pg_dump. That way you can take just one dump of the database per hour, but choose which tables to restore at restoration time, rather than dump time.

If you want to keep with your current practice, there are many ways to make it faster, but it depends on what the current bottleneck is. If the bottleneck is restoring the base backup from archive before you can start recovery, then you can just keep some already-restored copies around so that you can initiate recovery quickly using a pre-staged backup and just editing the pg_restore.conf file and then starting. Once used you would have to throw it away and recreate it in the background so it is ready to go for next time. If the bottleneck is instead the time it takes to do the PITR itself, then you could take more frequent base backups and use the most recent one which is old enough to be usable for your desired restoration time. That way there is less work to recover. For example, take a base backup every day, but only keep one from the beginning of time, and one from today, and one from yesterday. Everytime you successfully complete a base backup, delete the one from two days ago.

You can also keep a lagging hot standby with "recovery_min_apply_delay" that is always around 12 hours behind, but by the time you stop that server, change the config to fast forward it to the time you need, and once done throw it away and recreate it for next time, it will probably not be much faster than just keeping a recent base backup, but will add a bit of complexity.