I've got an odd problem here. I was trialling the use of 'initialize from lsn' in our test replication setup and now have broken all publications from one of our databases, rebuilding from scratch (applying a new snapshot) has not fixed the issue.
In Replication monitor, distributor agent running continuously and reporting:
No replicated transactions are available
Tracer token is reporting that:
- Publisher to Distributor: 00:00:02
- Distributor to Subscriber: Pending…
- Total Latency: Pending…
When I investigated the distribution logs using:
SELECT time,
CAST(comments AS XML) AS comments,
runstatus,
duration,
xact_seqno,
delivered_transactions,
delivered_commands,
average_commands,
delivery_time,
delivery_rate,
delivery_latency / (1000 * 60) AS delivery_latency_Min,
agent_id
FROM dbo.MSlogreader_history WITH (NOLOCK)
WHERE agent_id = 5
ORDER BY time DESC;
SELECT *
FROM dbo.MSdistribution_history
WHERE agent_id = 125
ORDER BY time DESC;
I can see that the latest xact_seqno from:
- LogReader agent is: 0x00000378000384880003
- Distribution agent is: 0x000CA68B000010A8000A000000070000
As a comparison on one of the other databases (going to the same subscriber), I get:
- LogReader agent is: 0x00000D140001F52A0003
- Distribution agent is: 0x00000D140001F52A0003000000000000
To me it looks like the distribution agent is remembering the old broken xact_seqno (from when I made a mistake when testing 'initialize from lsn'), but now it appears to have permanently broken all replication from that database.
Servers:
- PubA – One publisher hosting several databases that are replicated
- Dist – A separate distributor
- SubA – First subscriber
- SubB – Second subscriber
Databases:
- PubA.DB1 = Publisher DB that is working fine
- PubA.DB2 = Publisher DB that is broken
- SubA.PullWorking1 = 1st subscriber has a working pull subscription from DB1
- SubA.PullBroken2 = 1st subscriber also has a broken pull subscription from DB2
- SubA.PushWorking1 = 2nd subscriber has a working push subscription from DB1
- SubA.PushBroken2 = 2nd subscriber also has a broken push subscription from DB2
Publications:
- PubA.DB1.PushPub_works
- PubA.DB1.PullPub1_works
- PubA.DB1.PullPub2_works
- PubA.DB2.PushPub_broken
- PubA.DB2.PullPub_broken
Best Answer
Ok, in case anyone ever has this problem I found the fix (though not the root cause).
Fix
SELECT publication, '0x' + CONVERT(VARCHAR(32),MAX(transaction_timestamp),2) as LastPubLSN FROM dbo.MSreplication_subscriptions GROUP BY publication
DECLARE @Publisher sysname = N'PubA', @PubDB sysname = N'DB2', @Publication sysname = N'PushPub_Broken'; SELECT TOP 1000 trans.entry_time, trans.publisher_database_id, trans.xact_seqno FROM dbo.MSpublications AS p JOIN master..sysservers AS srv ON srv.srvid = p.publisher_id JOIN dbo.MSpublisher_databases AS d ON d.publisher_id = p.publisher_id AND d.publisher_db = p.publisher_db JOIN dbo.MSrepl_transactions AS trans ON trans.publisher_database_id = d.id WHERE p.publication = @Publication AND p.publisher_db = @PubDB AND srv.srvname = @Publisher AND trans.xact_seqno >= p.min_autonosync_lsn ORDER BY trans.entry_time DESC
DECLARE @PubDB INT = 5 /* use the publisher_database_id retrieved above */ select * from MSrepl_commands with (READPAST) where publisher_database_id = @PubDB and command_id = 1 and type <> 30 and /* 30 = normal repl commands */ type <> -2147483611 /* -2147483611 = is ignored by sp_MSget_repl_commands (it's snapshot related?) */
EXEC sp_setsubscriptionxactseqno @publisher = 'PubA', @publisher_db = 'DB2', @publication = 'PushPub_Broken', @xact_seqno = 0x0000... /* New valid LSN to start from (step 3) */
How I found it
So after finding and trialling sp_setsubscriptionxactseqno in conjunction with knowing how to get the LastLSN applied to the subscriber, I ran a trace on the distributor, I was able to see that the sequence was:
Running the first command myself, I could see 2 resultsets returned, the 2nd of which included the bad LSN. Obviously something inside this proc was causing it to overwrite the LSN I wanted it to apply.
Finding the code to
sp_MSget_repl_commands
, I was able to identify that it was:From there, I found the culprit records and was able to restart replication (after resyncing my data)