Well one thing you may want to consider doing is validating your application(s) still work on a test instance that contains the same database and upgrading it to the newer service pack. While rare and usually changes are backward compatible, there could be behavior changes to things like the optimizer that would be impossible to guess exactly how they might affect your application without testing. The risk is certainly less with a service pack than a version upgrade, for example, but installing a service pack on a cluster does not exactly have a trivial "undo" button - so in this scenario the backups won't really help you directly (they are still an absolutely sound idea, of course).
Clustering is complex, and there are lots of moving parts (no pun intended). Let me try to break this down into more manageable chunks:
From a terminology perspective, there's your Windows Server Failover Cluster (WSFC), and your SQL Server Failover Cluster Instances (FCI). I try to avoid saying "Cluster" and use these acronyms to avoid ambiguity.
Quorum:
The quorum is the number of votes necessary to transact business on your WSFC. Depending on your WSFC configuration, voters can be nodes (servers), a drive, or a file share. You need more than 50% of your votes in order for the WSFC to be online. If you lose 50% or more of your voters, then the WSFC and all clustered services (including your FCI) will go offline and not come back until you have (or force) quorum.
In your configuration, you have two nodes, and one file share for a total of three votes. Any one of those voters can go offline. When you lost the file share, you still had two nodes online, so your WSFC and all clustered services stayed online.
Cluster Owner/Host Server:
When you say that "Node2 was now specified as the active node by Windows", I suspect you are referring to the "Current Host Server" for the cluster. So what is that?
Your WSFC has a network name and an IP address. That name & IP has to be tied to a machine that is part of your cluster. More specifically, it can be tied to any one machine in your cluster. This is part of your WSFC, but not your FCI.
In your scenario, you have three FCIs on a two-node WSFC. It would be a perfectly valid to have one FCI on Node1, and two FCIs on Node2. And the "Current Host Server" for the WSFC could be either node. SQL Server won't care.
So what happened: As you said, there were no adverse effects on the databases. I'd expect that, because SQL Server isn't tied to that WSFC host server. I don't think I wouldn't have expected the host server to move when the file share failed--but I'd let your Windows guys dig into that more. From a SQL perspective, everything worked as expected.
Best Answer
First make sure you have both logins and jobs fully synchronised between your instances. there are different ways to achieve these.
I had good results by using ola hallegren maintenance jobs for these.
Always point to the
listener
instead of individual IPs or server names.inside every step of every job I test if I am in the primary server, in my current availability group called
sqlprodag
. Something like this:and the job will only run on the primary server.
planned manual failover of an availability group