PostgreSQL – What Happens When Starting Replication Using tar?

postgresqlreplication

Hi i am following this tutorial every time that i need to set up a binary replication ,in fact i am using the second method called:
"Starting Replication with only a Quick Master Restart"
found here:
https://wiki.postgresql.org/wiki/Binary_Replication_Tutorial

That works fine.

In a test environment, i tried to use tar instead of this rsync:

 rsync -av --exclude pg_xlog --exclude postgresql.conf --exclude postgresql.pid \ 
     data/*  192.168.0.2:/var/lib/postgresql/data/

Like this:

tar -cz data >data.tar.gz

Generating the .tar file with the data and uncompressing that on the slave, how bad can this be for the database?

In the logs, slave shows that it connects sucessfully to the master.

Best Answer

Assuming the replication data directory is starting out empty (which it should be, or you are probably doing something wrong), these commands are basically equivalent. Both read all the data and do something with it.

The tar is going to use a bit more CPU on the server, because it will be running gzip, which will probably max out one CPU as long as there is one free. It will also consume a bit more IO capacity, because it will be writing the .tgz file to disk rather than streaming directly over the network (but you could change the tar command to stream over ssh instead)

Neither of these should make much of a difference, unless you are already on the brink.

Considering your comment about how much slower rsync is: I find rsync somewhat slower, but not drastically so. I wonder if you have dysfunctional implementation of it. I've heard (vaguely) of versions of rsync which interacted with versions of the Linux kernel in a way that cause context swapping storms. Anyway, if tar is so much faster, than it will naturally be imposing a higher IO load on the server as it gets all the data read in less time, this might interfere with server operation while it occurs. On the other hand, whatever rsync is doing that makes it slow might also be consuming resources the database would like to use.

Unfortunately when you got to issues of implementations of one tools interacting with kernel versions, etc. you have an issue that can really only addressed experimentally.

If you are concerned about correctness, both rsync (when used like you show and on an empty target directory) and tar should preserve your data integrity, it is just a matter of which one performs better, and causes the least performance impact on the production server by competing with it for resources. I prefer tar, because rsync invites people to try to do clever things which place their data at risk.