Periodic filesystem replication wih snapshot

I'm looking for a non-realtime replication solution that supports periodic snapshot.

Here's my current situation:

I have 2 email servers running Ubuntu 12.04 LTS
The mail server program I use is Axigen v8.1.1, serving 2000+ mailboxes, with a rate of approx. 2000 emails per hour during workhours
The configuration is Master/Slave, using heartbeat/pacemaker
Axigen uses its own proprietary database to store all settings and messages
Most users access the email server using POP3, but some use IMAP4

What I want to implement:

Every N minutes, there should be a snapshot done on the master to be shipped to Slave
Slave should be able to efficiently store at least M latest snapshots, plus 2 daily snapshots, just in case we need to roll back
(We can live with N minutes of lost emails; all emails are stored in a MailArchiva storage system)

My original plan was to implement Axigen's datastore on a ZFS-on-Linux (ZoL) filesystem, with regular snapshots that would be shipped (incrementally) to Slave. However, I was bitten with ZoL's instability during heavy I/O, where I had experienced several incidents of CPU Soft Lockup. The ZoL discussion group suggested I reduce the size of the ARC cache, but of course that would impact performance, so I instead revert to an ext4-backed storage on Master. (Might still implement ZFS on Slave, though).

I'm considering several options:

Reconfigure the Master so that the datastore is on an LVM-backed storage, and regularly create an LVM snapshot to sync to the Slave using csync2 or rsnapshot (and delete the LVM snapshot after a successful sync). On the Slave side, after every successful sync, do a ZFS snapshot to maintain the required number of snapshots & daily snapshots.
Implement DRBD in a Master/Slave configuration, with a hard disk-backed storage on Master, but a ZVOL-backed storage on Slave.
Implement a Clustered File System that supports snapshots… but which one?

Your thoughts and inputs are very much appreciated.

Edit: due to my department's budget situation, I can't use a commercial solution. Maybe next year, but unfortunately my need is current.

Edit 2: the ZoL instability might not be the instability of ZoL itself per se, but I suspect more because of the unbelievable memory churn of the email server (due to some reasons, I have to implement Perdition in front of the Axigen server, and Perdition creates one process per connection, so the server's memory might be severely fragmented and blocks ZoL from claiming some SLABs to grow its ARC Cache)

Periodic filesystem replication wih snapshot

Best Answer

Related Question

Best Answer

Related Solutions

Running `unison -repeat watch` yields filesystem watcher error

Related Question