Sql-server – Transactional replication latency issue

sql serversql-server-2008-r2transactional-replication

I inherited database systems like this. Currently I have publisher database in SQL Server 2005 compatibility mode on Windows Server 2008 R2 with SQL Server 2008R2 SP2 machine. Distributor is on the same machine. Subscriber is 2008R2 SP2 and database is in SQL Server 2008 compatibility mode. We are using Transactional Replication. Isolation level is Read Committed. Distributor resides on Publisher. Even though when I right click on publication and even though subscription shows as pull subscription, I think it won't matter since distributor resides on publisher itself. Please correct me if I am wrong. Storage system is IBM flex which is shared by five servers including publisher and subscriber.

Since couple of days, I see latency of few hours, it catches up in the morning and starts going up again in the afternoon. I followed https://www.mssqltips.com/sqlservertip/3598/troubleshooting-transactional-replication-latency-issues-in-sql-server/ to see exactly what was happening. I ran following query.

USE distribution 
go 
EXEC Sp_browsereplcmds 
@xact_seqno_start = '<seq#>' -- seq# is same for start and end
,@xact_seqno_end = '<seq#>'
,@publisher_database_id = <publisher database id --this is different than database_id

I see that there are supposedly massive updates being done on few tables involved in replication and Log Reader is just scanning transaction log, not able to replicate anything till transaction completes. Interestingly, I can't see any blocking on either publisher and/or subscriber. Will changing isolation level to Read Committed Snapshot Isolation (RCSI) help here? Will it help to change polling interval to 1 and readbatchsize to 1000 or 5000. What's the command to change that setting?

I changed Log Reader Agent default profile as follows. Polling Interval from 5 to 1 and ReadBatchSize to 5000. This brought latency from 13 hours to zero almost instantly. But I see that it went back to 13 hours.

Replication is in sync and I don't have a single clue for actual root cause which caused the latency and now it went away.

Best Answer

I had to finally call Microsoft support and just a simple command called DBCC LOG INFO on publisher revealed a possible root cause. I saw more than 8600 VLFs! and this was the cause of that latency. Also, our log file is preallocated to 538GB.

Replication was out of sync by almost 19 hours by the time I got follow up call from Microsoft help 4:00pm next day after I opened case with Microsoft. Steps to be taken were really simple. Backup publisher db log couple of time and try to shrink log file. Set increment factor for log file by 8GB or 12GB instead of percentage or 500MB. So, that next time log file grows, it will create 16 VLFs per 8GB or 12GB depending on your increment factor.

After I backed up log, I was able to shrink log file to 350GB and total VLFs to around 5300. Still higher. Latency didn't come down though. It went upto 22 hours. I started wondering whether number of VLFs was just one of the causes. However, around 11:30pm, latency reduced to around 7h30m and I freed up more space around that time, reducing VLFs to 2001. By 2am, replication was in sync. I hurried up and backed up log twice and shrink the log file to 10GB and grew it back to around 248GB. Total number VLFs as of now are 184 and replication is in sync since then. Whew! Log file is almost empty.

Let me know if you have any questions about this. I am glad to help. Hopefully someone else doesn't have to call Microsoft for this issue.