Sql-server – What would cause a transactional replication subscription to restart repeatedly with no errors

azure-sql-databasesql servertransactional-replication

I have set up a transactional replication publication/subscription to test a migration from an on-premise SQL 2016 Enterprise Edition installation to Azure SQL DB and I'm finding that when the database snapshot finishes uploading to Azure, it just starts over again from the beginning. It finishes within 5-6 hours and uploads ~125GB of data, so I don't believe it is anything to do with the subscription expiring or exceeding any timeout thresholds. It is set to run continuously, rather than on a schedule.

I've looked at the distribution history in the MSdistribution_history table and can't see anything which mentions the subscription being marked for reinitialization, all that is there are entries showing 'Bulk copied data into table …', and then at the point where I can tell it restarted, it goes back to 'Creating primary key index on table…' and 'Applied script x…'. I have also looked at the MSrepl_errors table and msdb..sysreplicationalerts and both are empty.

I have tried deleting the Azure SQL DB and publication and going through the process from the beginning again, but I am seeing the same thing after having done this.

Is this something that someone has seen before and could explain the cause of and how to fix it?

Best Answer

Just in case anyone else comes across this problem, I believe I have found the solution.

I ran through all of the steps again from beginning to end in preparing the database for replication, and I realised that one of the triggers we have on the database was not marked with the 'Not for replication' property. It previously had been, but a code change had been made and deployed a version of the trigger without this property. After changing this back I set up a new snapshot and applied it successfully with no problems.

Due to the size of the database I was trying to replicate, I haven't tested flipping the setting back to off and trying it again to confirm this was what fixed it, although I did try it with a smaller database and didn't see the same behaviour. My thought on this is that the trigger was causing some sort of error with the data but not properly throwing an error message.

To check whether this property is set or not just run the following script and look at the is_not_for_replication property - 1 means it is set, 0 means it is not.

SELECT name,
is_not_for_replication
FROM sys.triggers
WHERE is_ms_shipped = 0;

If you are seeing the issue described above and then find that one or more triggers do not have this property set, update the definition of the stored procedure to include the not for replication property and try again. Setting the property is explained here: https://stackoverflow.com/questions/44882512/how-to-create-trigger-with-not-for-replication-in-merge-replication-of-mssql