How azure sql geo replication works internally

azure-sql-databasetransactional-replication

My company moved to azure and started to use azure Sql database and geo replication.

For our on-premise servers, we heavily rely on Microsoft transactional replication to move data. Now we setup transactional replication between on-premise server (publisher) and azure sql db. Then we found that the transactional replication itself is constantly consuming 100% DTU on azure(we are using a 125 DTU P1).

We upgraded it to 250 DTU and replication is much happier. However, we are not sure it is a great idea to constantly updating to higher DTU. So we did this:

  1. Created two 125DTU servers.
  2. Setup transactional replication between on-premise server and server A.
  3. Setup geo replication between server A and server B.
  4. Our customers will use server B.

The benefit is that heo replication will not trigger the 100% DTU throttle, and when we raised this to a Microsoft support, he also said it is a good idea. The support person also said that sql geo replication is much more efficient to transactional replication.

However, when we asked him what kind of technology azure geo-replication internally uses and why it is better than old transactional replication, he just gave out some marketing reply:

geo-replication leverages the Always On technology of SQL Server to asynchronously replicate committed transactions on the primary database to a secondary database using snapshot isolation. Impact to your main database is minimal as this process runs with the same context where the automated backups run, for example.

Personally, I didn't really get why geo replication is better than old replication from the reply. Does anyone know some technical detail?

Best Answer

As you were told by Support, Azure SQL Database's geo replication uses a special flavor of Always On Availability Groups to replicate data. The "special flavor" doesn't matter much in the context of your question, so I am going to respond in terms of "Replication vs Availability Groups" since it applies to your case, too.

Replication

SQL Server Transaction Replication is a way to replicate specific data between databases. By default, it will replication data changes between the Published and the Subscriber. The Distributor has a Log Reader Agent that reads the transaction Log of the Publisher, and converts that into separate update statements, which are sent to the Subscriber via a Distribution Agent.

A single update statement that updates 1,000 rows on the Publisher will generate 1,000 separate update statements on the Subscriber, doing row-by-row updates based on the PK.

There are alternatives, but in most cases, this is the configuration people use, as it is the default behavior. (For example, you can configure stored procedure executions to replicate, so that rather than replicating row-by-row data updates, the Subscriber executes the stored procedure again... You probably aren't doing that.)

As you can imagine, this row-by-row update on the subscriber can be inefficient. Each individual update on the Subscriber will be quick because it is a single-row update based on a Primary Key--but the row-by-row vs set-based operation isn't always ideal.

Availability Groups

Availability Groups (AGs) handle things differently. With AGs, the entire database is replicated, not just specific data. The replication method involves writing to the transaction log on the secondary server, and then redoing the transaction on the secondary server. These two steps are generally referred to as "send" and "redo" phases.

A simplified version of how it works: The overhead on the Primary server for "send" is essentially just that it has to write to the transaction log locally, and also send that same exact information to the Secondary for it's copy of the transaction log. The overhead on the Secondary server for "redo" is essentially just processing that transaction the same way it was handled on Primary.