If I disconnect DEV-AWEB5
Define "disconnect", if you will. My guess is you kept the box up but took SQL Server down.
I cannot connect to the Group Listener (DevListener), but I can ping it and it will respond to my ping
That's because the listener is just a virtual network name (VNN) within the WSFC cluster resource group for the represented availability group. Your DEV_AWEB5 node still owns the cluster resource group, but it's just the AG cluster resource most likely that is in a failed state. The VNN must still be online (expected behavior). It's simply pointing to whatever node is owning that resource group (in this case, DEV-AWEB5). In fact, if you had PowerShell remoting enabled, and you ran the following:
Invoke-Command -ComputerName "YourListenerName" -ScriptBlock { $env:computername }
Likewise, if you can RDP into DEV-AWEB5 (provided you have the capability and accessibility, etc.) then you'd be able to RDP using the listener name (mstsc /v:YourListenerName
). It's just a VNN.
The return of that would be the computer name of your owning node.
By all of your symptoms, I'd be willing to bet that you've reached your failover threshold. The failover threshold determines how many times the cluster will attempt to failover your resource group in a specified time period. The default of these values max failovers n - 1 (where n is the count of nodes) in a period of 6 hours. You can see that through the following WSFC PowerShell command:
Get-ClusterGroup -Name "YourAgName" |
Select-Object Name, FailoverThreshold, FailoverPeriod
That just gives you the settings (which you can modify if you so choose, of course).
The best way to prove that this is the case for you, you would need to generate the cluster log (the system event logs only go into detail as far as " has failed", or something like that).
Get-ClusterLog -Node "YourClusterNode" -TimeSpan <amount_of_minutes_since_failure>
That'll by default get put into the "C:\Windows\Cluster\Reports" folder, and the file is called "Cluster.log".
If you were to open up that cluster log, you should be able to find the following string in there, indicating exactly what happened and why it happened:
Not failing over group [YourClusterGroupName], failoverCount [# of failovers], failover threshold [failover threshold value], nodeAvailCount [node available count].
The above message is simply WSFC telling you that it will not failover your group because it's happened too much (you hit the threshold).
Why does this happen? Simply to prevent the Ping-Pong effect of cluster resources going back and forth too frequently between nodes.
Whereas this would be common to hit these thresholds in failover testing, in production it would typically point to a problem that should be investigated.
If the secondary is up and running, when the log block is flushed to disk (either because it is full or a commit), the record gets pushed to the log writer on the primary and to the log scanner (log reader) process on the primary simultaneously. Then the log scanner communicates with the secondary and the secondary then pulls the transaction from the log scanner on the primary to the secondary and processes the log record. The primary log writer doesn't push transactions across, it just communicates with the secondary, it only does that to see if it is up so that it knows it doesn't have to mark the replica as NOT SYNCRONIZED.
When the secondary is not up, then the log writer cant communicate with the secondary so it marks it as NOT SYNCHRONIZED and stores the records in the tran log on the primary. If you look at sys.databases.log_reuse_wait_desc column it should show AVAILABILITY_REPLICA which means the primary is hanging on to all the records.
Once the secondary is up, it will communicate with the primary to request a log scan, it then processes the transactions and communicates with the primary using progress messages to indicate the hardened LSN, presumably the primary is then adjusting its MinLSN, which in turn means the records prior to MinLSN will get deleted as checkpoints happen and hence VLFs will get truncated releasing space when you do a log backup.
But yes short answer is, if your secondary is down you need as big a log file as you need for as long as it is down. Once it is backup and synched at some time you may need to remove the db from the always on group to shrink the log if it is humungous and you dont want it that big.
Best Answer
As you noted, the docs refer to "log stream transport" when performing automatic seeding. Similar to the way that you can stream movies to your device without downloading a specific file, automatic seeding streams data to the secondary without storing an actual backup file.
While the technical implementation details aren't exactly this, it's essentially taking a backup, but rather than sending it to a storage device or file share, it sends it to the secondary server. Rather than writing a file, it immediately begins to restore to the secondary server. The pseudo-backup is buffered and restored in a single step, without a literal backup being written to file.