Postgresql – Restore postgres data/tablespace to new tablespace (at new mount point)

postgresql

I have a bunch of test/development databases which we currently refresh with production data as-needed using a NetApp filer's snapshot capabilities – we have a production slave with its datadir on a filer mount (NFS), and once a night (via cron) we shutdown the slave, snapshot the filer volume, and then start the database back up. When we need to do data refresh in a test/dev environment, we stop postgres there, clone the NetApp snapshot, and use that clone as the test/dev environment data directory. This is wonderful from a storage point of view, as the clones only store changed blocks on the filer. We have a ~1.5TB datadir, but these clones are only a few MB each, because there's little changed data. (We're running postgres 9.0.13, but if what I'm about to ask is "more possible" with a newer version, that's a vague possibility)

Up until now, each test/dev environment has had its own postgres server instance, running as the only thing in a dedicated virtual machine. That's pretty bad on resources, since they're largely idle most of the time.

Now, we have to spin up somewhere around 100-150 of these environments. The NetApp is about the only way we can do it, because we simply don't have 150-225TB of disk to spare. It would also be a real pain (and inefficient) to run 100-150 separate machines, each running a single instance of postgres.

So, I'd like to replicate this setup, but instead of having one postgres instance on its own VM for each database, I'd like to do this with multiple databases in the same postgres instance on one host. I believe that tablespaces will be the key to this, but I'm not sure how to do it. Essentially, I'll have a datadir (an NFS mounted filer volume) and need to be able to unmount the current volume and mount a different one.

Or, put another way, I'd like to be able to "add" a tablespace that points to an already existing datadir. In other words, I have the filesystem data from a database called "test", at a given path on disk. How do I get postgres to use that, as a database called "test25"?

Is this even possible? Anyone have experience with, essentially, creating a tablespace that points to an existing data directory? Or, alternatively, swapping out the data directory of one tablespace with that of another?

PS – The slightly-less-abstract explanation:

We're spinning up nearly 100 (maybe 150) new test environments. We simply don't have the ~200T of disk to maintain a separate DB for each of them, so using the NetApp filer and letting it do thin clones is an absolute requirement. We also need to be able to do quick restores to the latest daily production snapshot. Quick as in, the developers and testers run a script in their test environment that does the restore. We're currently doing this for ~50 environments, and we run a separate VM with postgres for each one, so it's relatively straightforward – stop postgres, unmount the datadir, do the filer magic to refresh it with a clone of this morning's production DB, then re-mount the new filer volume (clone) and start postgres. Unfortunately, the massive overhead of running a separate virtual machine with a separate postgres instance for every DB is more than we can handle when scaling 2-3x. In addition to that, the resource waste is awful (a dedicated VM running a dedicated postgres instance for each DB, most of which are idle about 14 hours/day). So, we'd like to run multiple databases in separate tablespaces, but on the same host and the same postgres instance, so N databases can use the same shared memory, etc.

The developers and testers who use these instances need to be able to "get fresh data" as often as need (sometimes 3x/day). Ideally we'd be able to refresh one of these filer volumes without stopping postgres. But it's also an acceptable alternative to build out, say, 3x the number of DBs we need, and refresh all of the not-currently-in-use ones on a schedule every night, during a maintenance/downtime window. However, the key here is still that I have a NFS mount that contains existing data (the actual datadir or tablespace, not a pg_dump) and I need to "add" that to the postgres server without altering its contents.

Best Answer

I don't think you can do what you want here - certainly not without changes to PostgreSQL.

Tablespaces are not an independent datastore, and cannot be exchanged between different PostgreSQL instances. Making them so would be a huge change to PostgreSQL's core, with significant impacts on transactional DDL support, etc.

See this related answer I wrote on the topic, explaining some of the issues with attaching a tablespace to a different PostgreSQL instance: How do I access a old saved tablespace after reinstalling postgres?

If you were to create the entire database in a different tablespace then you'd be able to avoid the issues with pg_class and TOAST tables ... but you'd still have to deal with the problems with the pg_clog, pg_xlog, transaction visibility, etc.

pg_xlog issues could potentially be solved if you were able to copy the tablespace at a point after a checkpoint, before any further writes were performed, so the tablespace was "clean". PostgreSQL doesn't currently offer any assistance to do this, but I think it'd be possible to add it with moderate code changes (at the cost of a stall in the production DB when you flushed it for the checkpoint and snapshotted it) by taking an exclusive lock on the internal xlog insert log. Possibly do-able even as an extension function.

The pg_clog and pg_control problem is a lot harder. Transaction IDs are global across the entire PostgreSQL instance. So your tablespace (or whatever other collection of tables less than an entire instance) will have transaction IDs in the future, and/or with different committed/rolled-back statuses in the local pg_clog to what they had in the original pg_clog. Badness ensues. The only way I can think of to deal with this is to VACUUM FREEZE all tables in the database, which will set their xmin and xmax to FrozenTransactionId, an xid defined in the source code that's always visible for all transactions. However, doing this without having unfrozen xids added in parallel requires that you take an EXCLUSIVE lock (blocking INSERT, UPDATE and DELETE) on all tables in the database for as long as it takes to VACUUM FREEZE and then snapshot the tablespace.

So. Maybe it's possible to do what you want, but there's no way it's possible without production impact. You'd have to basically rewrite PostgreSQL to store per-database pg_clog and pg_xlog, keep a per-database pg_control, and use per-database transaction IDs. Then add support for keeping these in a non-default tablespace. Once you'd done that, you'd have the exciting job of making this work with streaming replication and WAL-shipping based PITR, which currently relies on a single WAL stream as an ordered time-line of events. And you'd have to deal with the extra fsync() costs, etc, of per-DB WAL. Finally, you'd need to figure out how you were going to make this work with PostgreSQL's shared relations, like pg_database, pg_authid, etc, that live in the pg_global tablespace - they are transactional tables and participate in transactions along with tables in the current database, and you'd have to find a way to keep that working when you now had two separate transaction IDs (global and per-database) that had to commit or rollback atomically.

This is quite an interesting problem, and I suggest raising it on the pgsql-general mailing list. Link back here for context (and so people can explain all the things I got wrong in the above explanation).

Personally I'd really like to see PostgreSQL able to handle individual databases or tablespaces more independently, because I'd really like to be able to use replication to stream only a particular DB, or a subset of a DB. Right now that's just not possible, meaning you must do frustrating things like log-ship change data for high churn transaction tables and low-churn important archive tables together, or separate them into completely separate Pg instances. It doesn't look easy to do; the new logical replication and BDR feature should help, though.