If you evict nodes you'll end up formatting the nodes and rebuilding them. Once a node has been evicted from the cluster you won't be able to uninstall the SQL Server instance from the node. You can try manually removing all the registry keys and files, but honestly it's probably faster to reinstall Windows and try getting it back into the cluster.
Has Microsoft not been able to get you towards any resolution on this?
Either way you're looking at a cluster rebuild basically from scratch here with 21 SQL installs, which I'm sure as you know takes forever. There are some ways to make this take less take (building a new two node cluster, installing everything there, then doing a controlled migration between clusters, use DNS to redirect connections to the new names so applications don't need to change, etc.
If this was my environment I'd burn it down and bring it back fresh, obviously with as little downtime as possible.
--sales pitch--
It might be a good idea to bring someone in that has a lot of clustering experience (me, someone else, either way) to help with the rebuild to make sure that everything is setup exactly as it should be and to help take the pressure off your internal team (who probably hasn't been getting a lot of sleep the last two weeks). If you've got a lot of clustering experience in house then this isn't needed of course. I'm just worried because something went horribly wrong doing something which souldn't have been that big of a deal to do.
--end sales pitch--
Disclaimer: I'm a consultant.
So just in case someone wants to know what caused this, It was a group policy!
Some time ago, unbeknown to me, the domain controllers for the domain in question had been upgraded to Server 2012. Along with this came a whole bunch of Windows Server 2012 group policies. Additional policies had been added to one of the parent server OU's with a filter applied.
Unfortunately the filter had a typo, and so it was being applied to all of my database servers on this domain.
I had almost ruled out it being a GP issue, as I seemed to have all the permissions I needed, and I could see connections coming in on the correct ports between each node. The servers were happily running as single nodes!
I asked the server team to move the servers (just to see) to the 'computers' OU, and after a forced gpupdate, bingo!
Unfortunately, I am unsure exactly which policy caused the problem as there were a lot! There was a number to do with NTLM usage and authentication, and I'm convinced the issue was related to this. I will at some point set up a test lab to try and replicate the issue.
So there you go! Always check group policies, even if you believe (like I did) that nothing had changed!
Best Answer
All nodes in a cluster should be using the same service account for SQL Server. This is true for both an FCI instance and an Availability Group setup.
For security sake, consider using Group Managed Service Accounts.https://blogs.technet.microsoft.com/askpfeplat/2012/12/16/windows-server-2012-group-managed-service-accounts/
This will allow you to have a service account that will have an strong, automatically changing password for your SQL Servers. Auditors love this. Your risk is minimum since Active Directory is handling the password change and downtime is not required to do this.