Can anyone shed some light on this behavior?
The behavior you're describing is not working as described. The end result is the same but the inner workings are different than what you're thinking.
The Availability Group for this specific DB is synchronized and healthy.
I had to add a second transaction log file (logfile2) to the primary. The file did not replicate until I restarted the instance (was doing other maintenance). After the instance was restarted the transaction log appeared at the primary and secondary. 12 hours have passed since the instance was restarted and the log files show the same time/datestamp AND file size.
Restarting the instance didn't actually fix anything. If the databases were synchronized then the log blocks were hardened on the secondary. This doesn't mean there were redone yet, jus that they were hardened to disk.
By restarting the instance you've effectively forced redo to complete to a point and thus creating the second log file. If it is taking this much time to redo log blocks, there may be an issue with how much load the secondary could handle and this is one of the ways you'll see it manifest.
The last item that is indicative of this behavior is a blocked redo thread, which is most likely the cause.
It appears however that transactions hitting logfile2 on the primary are not being synchronized to the secondary for over 20 minutes.
Then the AG wouldn't be synchronized, it'd change to synchronizing. This is more likely what I descried above, where the log blocks were hardened and weren't yet redone (blocked redo thread).
Assuming the log blocks were sent, the AG is synchronized, and the redo thread was not blocked... this could be due to the transaction still being open. Until the transactions is committed (trancount = 0) your queries on the secondary won't see it as it isn't a yet committed transaction.
Assuming the transaction was committed (trancount = 0), the default isolation level of read committed
is mapped to the snapshot isolation level on the secondary. This could also be due to isolation level changes and the query running.
In addition to the normal transaction log backup that runs every 2 hours, which includes a CHECKPOINT command, I had to install a SQL agent job that performs a CHECKPOINT on the DB every 15 minutes. The reasoning is that transactions going to logfile2 are languishing there and not getting applied to the database; so, I force the flush of the log files to the database, which then synchronizes to the secondary.
This is not how the process works. Checkpoint itself doesn't flush log records, they may need to be flushed from the log buffer because of WAL (Write Ahead Logging) enforcement for the data pages being flushed but it will do nothing to the log file. The setup done with the agent job and checkpoint is not helping one bit.
The unit of transport is a log block, not individual records. When log blocks end and are flushed to disk, if the database is part of AlwaysOn Availability Groups it is copied into (depending on version) another buffer and eventually sent through the transport when the secondary checks in. This is a gross over simplification but shows how it works.
What to do next?
- Checked for blocked redo threads on the secondary.
- Make sure the AG is synchronized and not synchronizing
- Make sure the transactions have @@TRANCOUNT = 0
- Make sure the secondary has enough disk IOPs to handle the load without growing the redo queue size
Best Answer
Are all replicas configured to become primary in case of failover?
In case YES, you need to create the same job in all instances, scheduled at the same time.
Keep in mind that each instance has a different MSDB, which means different jobs/schedules,etc..
So you will need to create a extra "check" in your jobs, to make sure it's running in the primary replica. Here you can find a complete post about it.