PostgreSQL – Estimating Bandwidth for Streaming Replication

postgresqlreplication

I have an existing postgresql server that I would like to replicate to another location using streaming replication. How can I best estimate the amount of bandwidth that will be required to keep the databases reasonably synchronized, using the actual workload I have now? I've read through the pg_stat_* list but nothing jumps out at me for an overall transaction rate or amount of data written (except for pg_stat_bgwriter, which has totals since the server was last started). Would the number of buffers (which I believe are 8kB each) written according to the checkpoint log be in the correct ballpark?

Best Answer

Well, very roughly speaking, at steady-state (i.e. after you have a standby server initialized with a basebackup and in-sync with the primary), the amount of bandwidth needed to keep a standby in-sync will be roughly your primary's WAL volume throughput.

Now, what is your primary's WAL volume throughput? Basically, how many 16 MiB WAL files your primary server produces per unit of time. You can either poke around in your primary's pg_xlog directory and see how many new files are being churned through. Or here's a nifty shell command using psql, credit of depesz that you can use:

(echo "ibase=16" psql -qAtX -c "select pg_xlogfile_name(pg_current_xlog_location())" | cut -b 9-16,23-24) | bc

The output of that command will give you a decimal count of what number WAL file you are on. Run that command again in, say, 10 minutes. Subtract the first number from the second, and that gives you how many 16 MiB WAL segments would need to be synchronized between the primary and the standby in that period of time. That will answer your question about:

... the amount of bandwidth that will be required to keep the databases reasonably synchronized

But! A few caveats:

  • The initial step of copying over a base backup to bring a new standby online will need a LOT of bandwidth (since you will effectively need to copy over your entire DATADIR to bring up the new server), and then the standup will have work to do to catch up to the primary's current position. Effectively, this means you'll need plenty of additional bandwidth on top of your steady-state needs to be able to reliably initialize a new standby server in a reasonable time.

  • The volume of WAL produced by the primary will be influenced by settings such as wal_level, which must be cranked up to hot_standby when you actually want to bring a standby server up

  • Useful settings to enable once you're using streaming replication such as hot_standby_feedback will consume further bandwidth.

  • You will likely find that your WAL throughput will vary considerably depending on how busy your database is, autovacuuming, checkpoints, DDL, etc. You'll want to be able to absorb this additional WAL volume to keep your standby in-sync.