Sql-server – Transactional replication reports ok, but tracer token not arriving

sql-server-2016transactional-replication

I've got an odd problem here. I was trialling the use of 'initialize from lsn' in our test replication setup and now have broken all publications from one of our databases, rebuilding from scratch (applying a new snapshot) has not fixed the issue.

In Replication monitor, distributor agent running continuously and reporting:

No replicated transactions are available

Tracer token is reporting that:

  • Publisher to Distributor: 00:00:02
  • Distributor to Subscriber: Pending…
  • Total Latency: Pending…

When I investigated the distribution logs using:

SELECT time,
       CAST(comments AS XML) AS comments,
       runstatus,
       duration,
       xact_seqno,
       delivered_transactions,
       delivered_commands,
       average_commands,
       delivery_time,
       delivery_rate,
       delivery_latency / (1000 * 60) AS delivery_latency_Min,
       agent_id
FROM dbo.MSlogreader_history WITH (NOLOCK)
WHERE agent_id = 5
ORDER BY time DESC;

SELECT *
FROM dbo.MSdistribution_history
WHERE agent_id = 125
ORDER BY time DESC;

I can see that the latest xact_seqno from:

  • LogReader agent is: 0x00000378000384880003
  • Distribution agent is: 0x000CA68B000010A8000A000000070000

As a comparison on one of the other databases (going to the same subscriber), I get:

  • LogReader agent is: 0x00000D140001F52A0003
  • Distribution agent is: 0x00000D140001F52A0003000000000000

To me it looks like the distribution agent is remembering the old broken xact_seqno (from when I made a mistake when testing 'initialize from lsn'), but now it appears to have permanently broken all replication from that database.

Servers:

  • PubA – One publisher hosting several databases that are replicated
  • Dist – A separate distributor
  • SubA – First subscriber
  • SubB – Second subscriber

Databases:

  • PubA.DB1 = Publisher DB that is working fine
  • PubA.DB2 = Publisher DB that is broken
  • SubA.PullWorking1 = 1st subscriber has a working pull subscription from DB1
  • SubA.PullBroken2 = 1st subscriber also has a broken pull subscription from DB2
  • SubA.PushWorking1 = 2nd subscriber has a working push subscription from DB1
  • SubA.PushBroken2 = 2nd subscriber also has a broken push subscription from DB2

Publications:

  • PubA.DB1.PushPub_works
  • PubA.DB1.PullPub1_works
  • PubA.DB1.PullPub2_works
  • PubA.DB2.PushPub_broken
  • PubA.DB2.PullPub_broken

Best Answer

Ok, in case anyone ever has this problem I found the fix (though not the root cause).

Fix

  1. Stop the distribution agents for all subscribers to publications from this database
  2. Get the LSN that is causing the problem from one of the subscribers:

SELECT publication, '0x' + CONVERT(VARCHAR(32),MAX(transaction_timestamp),2) as LastPubLSN FROM dbo.MSreplication_subscriptions GROUP BY publication

  1. Find a valid LSN to start from:

DECLARE @Publisher sysname = N'PubA', @PubDB sysname = N'DB2', @Publication sysname = N'PushPub_Broken'; SELECT TOP 1000 trans.entry_time, trans.publisher_database_id, trans.xact_seqno FROM dbo.MSpublications AS p JOIN master..sysservers AS srv ON srv.srvid = p.publisher_id JOIN dbo.MSpublisher_databases AS d ON d.publisher_id = p.publisher_id AND d.publisher_db = p.publisher_db JOIN dbo.MSrepl_transactions AS trans ON trans.publisher_database_id = d.id WHERE p.publication = @Publication AND p.publisher_db = @PubDB AND srv.srvname = @Publisher AND trans.xact_seqno >= p.min_autonosync_lsn ORDER BY trans.entry_time DESC

  1. Check what invalid commands are in the repl queue:

DECLARE @PubDB INT = 5 /* use the publisher_database_id retrieved above */ select * from MSrepl_commands with (READPAST) where publisher_database_id = @PubDB and command_id = 1 and type <> 30 and /* 30 = normal repl commands */ type <> -2147483611 /* -2147483611 = is ignored by sp_MSget_repl_commands (it's snapshot related?) */

  1. Backup and delete the commands that are causing the blocking, this should be the same LSN that is causing the problem (see step 2)
  2. Reset the LSN on each of the subscriber DBs

EXEC sp_setsubscriptionxactseqno @publisher = 'PubA', @publisher_db = 'DB2', @publication = 'PushPub_Broken', @xact_seqno = 0x0000... /* New valid LSN to start from (step 3) */

  1. Start up your distribution agent and verify that all is working.

How I found it

So after finding and trialling sp_setsubscriptionxactseqno in conjunction with knowing how to get the LastLSN applied to the subscriber, I ran a trace on the distributor, I was able to see that the sequence was:

exec sp_MSget_repl_commands 126,/* good LSN that I was telling it to use */,0,10000000
go
exec sp_MShelp_distribution_agentid 6,N'PushPub_broken',N'DB2',8,N'PushBroken2',0
go
exec sp_MSget_repl_commands 126,/* bad LSN that it was resetting too */,0,10000000
go

Running the first command myself, I could see 2 resultsets returned, the 2nd of which included the bad LSN. Obviously something inside this proc was causing it to overwrite the LSN I wanted it to apply.

Finding the code to sp_MSget_repl_commands, I was able to identify that it was:

select @max_xact_seqno = max(xact_seqno) from MSrepl_commands with (READPAST)
  where 
     publisher_database_id = @publisher_database_id and
     command_id = 1 and
     type <> -2147483611 

From there, I found the culprit records and was able to restart replication (after resyncing my data)