Sql-server – Transactional Replication issues

replicationsql servertransactional-replication

I have a question, I am managing 2 SQL servers

  1. is Master Server (Publisher + Distributor)
  2. is Client Server (Subscriber)

The client is basically owned by a client who is buying data from us.

Both Servers are replicated with each other using Transactional-replication.
Only selected Tables of a single database are replicated from the Master Server to Client Server.

Basically there are several problems i want to discuss.

Problem 1:
When I add or delete any data table and take snapshot of replication then all databases vanish from customer side and after that it rebuilds in 20-30 minutes (I only have limited access so don’t get logs)

Problem 2:
When I reinitialize subscription then it takes a lot of time to completing it and it is showing partially applied scripts of tables in monitoring.

Problem 3:
Publisher to Distributor showing high commands and transactions count but actually on distributor to subscriber commands and transaction counts are more than 100’s of times lower as compare to publisher section.

Problem 4:
Latency is Very high between Publisher-Distributor (around 5+ minutes)(Server-1) and jumps to 7 minutes then automatically less and stay on some seconds but average is really very high. Distributor-Subscriber latency seems fine.

Problem 5:
Log Size of Master DB increases and occupying space around 3.5 GB (Normally 3 GB with 80-87% free space) whether Data File Size also increases and now it’s 4 GB+ (Normally 3.3 – 3.5 GB)

Best Answer

First, you can set up Canary table and other detailed monitoring of transactional replication per Kendra Little'ss article Monitoring SQL Server Transactional Replication. Canary tables, the easiest step in the article, are described as:

It’s easy to build your own system for tracking replication latency for each publication. Here are the ingredients for the simplest version:

  • Add a table named dbo.Canary_PubName to each publication

  • dbo.Canary_PubName has a single row with a datetime column in it

  • A SQL Server Agent job on the publisher updates the datetime to the current timestamp every minute

  • A SQL Server Agent job on the subscriber checks dbo.Canary_PubName every minute and alerts if the difference between the current time and the timestamp is greater than N minutes

Then take a look at the activity happening on the tables being replicated. It sounds like most of your problems are related to very high activity - your log size moving from 3GB with only 13-20% used, to 3.5GB (meaning all 3GB was used) is an indicator of big transactions, which may be saturating your replication bandwidth.

Also, if they take 20-30 minutes to replicate when reinitialized, that's a combination of their size and your free bandwidth.

So, the general fixes are:

  • Increase the bandwidth

  • Reduce the amount of data

    • Use the smallest integer based keys you can, and simply replicate lookup tables with descriptions, rather than have descriptions in your tables.

      • BIT when possible, then TINYINT, then SMALLINT, then INT, then BIGINT where you're forced to.
    • Replicate indexed views instead of actual tables (publisher = indexed view, subscriber = table) - send over only the required columns, and only the required rows.

    • Don't use NCHAR/NVARCHAR unless you need to - use CHAR/VARCHAR when you know you only have, or perhaps even should have, no UCS-2 "Unicode" data.

      • Use CHAR when the average size of data is within 2 bytes of the maximum field size, since VARCHAR has a 2 byte overhead to store length
  • Reduce the number of transactions on the replicated data

    • Whatever you've got happening that causes all that activity on the replicated tables, analyze it. Is it required? Could it be re-written to minimize the amount of data being replicated?

      • Don't touch the same row more than once, i.e. instead of INSERT, UPDATE, UPDATE, UPDATE, just a single INSERT to the final table - perhaps work in #temp tables before then? And so on and so forth.
    • If they don't absolutely need up to the minute data on any given table, then don't replicate your the core tables that get all this activity.

      • For each non-realtime-required table, create reporting tables on the publisher that get updated only every so often (once a week, once a day, once an hour, whatever), and replicate those.

      • And make sure to carefully use only the UPDATEs, INSERTs, and DELETEs that are required - the goal is to minimize replication load.

      • Don't use MERGE; it has too many issues.

  • Split the tables up into multiple publications, so you aren't taking a snapshot of, and reinitializing, every table every time you have to add or remove one.

    • I.e. If you add/drop/alter an article in Publication C, you need to reinitialize and snapshot that publication, but publications A, B, D, E, and F aren't affected by this.

      • And thus it moves less data, which in turn takes less time.

      • And the subscriber retains access to all the articles in all the other publications

      • So split them up along functional usage lines; if tables A, E, and G are used together most of the time, while C and D aren't involved in most of those queries, that's a natural split that will leave your client happier; at least some functions continue unobstructed during re-initialization of either set.

  • Increase the bandwidth