UPDATE 10: The original question and previous updates follow this update below (and will soon be cleaned up into one current update).
Last week be broke the clusters, and now everything is truly a single server for ServerA and a single server for ServerB. We also have dev copies of ServerA (DevSQLServer1) and ServerB (DevSQLServer2) which share the same SQL Service accounts.
We've since setup replication from ServerB to ServerA to side-step the issue in production for now, but we encountered the same issue on our dev environment this week as well.
I tried using xp_cmdshell
to run klist purge
and I cycled the SQL Services for both DevSQLServer1 and DevSQLServer2, but the issue still persisted. At this point, I'm pretty sure the issue is due to the botched Windows update November 10, 2020—KB4586830 (OS Build 14393.4046) that specifically states "you might encounter Kerberos authentication and ticket renewal issues". Unfortunately this seemingly was installed on a multitude of our servers (both SQL Servers and Domain Controllers) and subsequent Windows updates that have ran, remove the botched update from the Windows Update History, so I have no reliable way to tell which servers even originally received the botched update.
Per John Eisbrener's questions this is what we currently see (sorry for all the obfuscation):
ServerB Successful SPN Registration SQL Logs:
ServerB Successful SPN Registration SQL Logs:
- Notice both servers are successfully registering SPNs for themselves
for both port 1433 and no port specified.
ServerA Registered SPNs:
- I'm not sure what the second "CN=" line (second to last line) is but it literally says "CN=ServerASvc", but that's not the name of the SQL Service account. The SQL Service Account is "SQLServiceAccount1".
- I'm also not sure why the last line has an SPN for just the server name "ServerA" (which we do have a Host / CName for in our DC), but seems redundant to the one of the previous lines.
ServerB Registered SPNs:
ServerA's service account is SQLServiceAccount1 and is also the same service account used on our DevSqlServer1.
ServerB's service account is SQLServiceAccount2 and is also the same service account used on our DevSqlServer2.
Running the query SELECT net_transport, auth_scheme FROM sys.dm_exec_connections WHERE session_id = @@SPID
returns TCP
and KERBEROS
on both ServerA and ServerB for me.
** For all intents and purposes, ServerA is really WindowsFailoverClusterA, and ServerB is really WindowsFailoverClusterB. I apologize for just realizing now that this fact might be important too. Also, different SQL Server Instance Service Accounts are running for each one (ServiceAccountA for the servers / instance in WindowsFailoverClusterA, and ServiceAccountB for the servers / instances WindowsFailoverClusterB.)
I have two SQL Server 2016 Standard Edition servers on the same domain. Server A has two linked server connections setup to Server B.
The first linked server connection is using a dedicated remote SQL Login security context. That one always was and still is working fine.
The second connection uses the "login's current security context" which is a Windows Authentication (AD) Login on Server A. This was also working fine up until today.
Today out of nowhere, everyone who connects to Server A and then tries to run a query that uses the second linked server connection, ends up getting hit with this error:
OLE DB provider "SQLNCLI11" for linked server "LinkedServerName"
returned message "Cannot generate SSPI context". Msg -2146893044,
Level 16, State 1, Line 0 SQL Server Network Interfaces: The logon
attempt failed
The only other potentially relevant thing I'm aware of, is a lot of users AD passwords are due to expire this week (due to our security policy). Supposedly some of them have already changed their AD passwords and are still hitting the above error, but I'm not sure how reliable that information is (can't always trust the end users :).
Also the users are able to directly connect to both Server A and Server B fine currently, and even execute queries against objects on Server B directly, but when they try to run queries referencing the same objects on Server B via the linked server on Server A is the only time they get the above error, currently.
Update 1: So far I've found when I changed my AD password (which was due to expire in 2 days) and restarted my machine, it fixed the issue for me. (Logging out and logging back in after changing my AD password did not fix the issue, I had to physically restart my machine.)
Update 2: Per John K. N.'s suggestion, I dove into the SQL Server Error Log for Server A and found the following error that occurred 5 hours ago (this morning EST):
03/17/2021 05:04:39,Server,Unknown,The SQL Server Network Interface
library could not register the Service Principal Name (SPN) [
MSSQLSvc/ServerA.MyDomainName.com:1433 ] for the SQL Server service.
Windows return code: 0x2098 state: 20. Failure to register a SPN
might cause integrated authentication to use NTLM instead of Kerberos.
Also the following 2 errors occurred 1 second before that in the log as well (not sure if relevant):
03/17/2021 05:04:38,Server,Unknown,Implied authentication manager
initialization failed. Implied authentication will be disabled.
03/17/2021 05:04:38,Server,Unknown,InitializeExternalUserGroupSid
failed. Implied authentication will be disabled.
Update 3: After some quick research on the error code from the SPN registration error message above (0x2098) I found this Pinal Dave article which seems to confirm it's a non-issue if I see active Kerberos connections in the sys.dm_exec_connections
DMV, which I currently do. So I guess I'm back to square 1.
Update 4: Scratch the previous update, I think Pinal Dave is wrong in my case. Despite me having active Kerberos connections, the SPN my SQL Error Log mentions it failed to register, is definitely not registered on that machine. When I run a setspn -l ServerA
command in an elevated Console window on that machine, it is not in the list of registered SPNs, so I think my issue might still be related to the above error from Update 2.
Update 5: I'm scrambling, so I'm trying to follow all answers, and am jumping back and forth, apologies for lack of response yesterday.
I realized the setspn -l ServerA
command wasn't really what I wanted to run, and I needed to run setspn -l ServerA_SQLServiceAccountName
after I did that, I did see the SPN that was in my SQL Error Log. So I think Pinal Dave was right in my case. What I don't fully get is the SPN it was complaining about was for the same server that the SQL Server Instance's Service Account is running on (I don't fully understand SPNs, and am confused why the SQL Server Service needs to trust the server it's currently running on.)
Today I woke up and everyone on my domain is hosed again, myself included.
I tried flushing the DNS on ServerA, and now we're receiving the error "Linked Server error: Login Failed for user 'NT AUTHORITY\ANONYMOUS LOGON'" instead of my original error, when querying the linked server from ServerA.
UPDATE 6: It appears I am facing some sort of SPN / double-hop issue because when I RDP directly to ServerA and connect to my SQL Instance on ServerA (from within the RDP session – same account as I'm connecting to the instance when not RDP'ed in) and try querying the linked server, everything is all good. The question is what SPN am I missing?
To recap my SPNs:
- My ServerA's SQL Server Instance's Service Account (let's call it ServiceAccountA) has an SPN to trust its own server.
- My ServerB's SQL Server Instance's Service Account (ServiceAccountB) has an SPN to trust its own server.
- Delegation is setup for ServiceAccountA to be able to access the SPN for ServiceAccountB.
UPDATE 7: The night before this issue started happening the primary node in WindowsFailoverClusterB was purposefully failed over so Windows updates could be installed on it (and a few of our other servers). Specifically this update November 10, 2020—KB4586830 (OS Build 14393.4046)
which apparently states this:
After installing this update on domain controllers (DCs) and read-only
domain controllers (RODCs) in your environment, you might encounter
Kerberos authentication and ticket renewal issues. This is caused by
an issue in how CVE-2020-17049 was addressed in these updates.
Sounds suspect. It also got installed on the other node of WindowsFailoverClusterB (the current primary) but we rolled it back. Things seemed to start working again, but then started breaking again for just our web server (which hits our WindowsFailoverClusterA to query the linked server to WindowsFailoverClusterB).
In addition to that, this DBA.StackExchange answer by Hannah Vernon sounds very similar to what we experienced since we have Windows Failover Clusters setup on top of these SQL Servers.
The only difference is we previously weren't letting the Service Accounts register / deregister the SPNs, when the issue started. Now we are letting them do so, and there haven't been any failover events, and we triple checked all SPNs are setup correctly currently. So I don't think that's our issue here.
UPDATE 8: Still down on day 3, as mentioned above, we triple checked the SPNs, and the AD delegation properties of the Service Accounts. All ducks are in order. We're fully down again: web server, SSMS, etc, no matter how we try to query the linked server, which I think is due to the default 10 hour ticket time of Kerberos lapsing, and it not correctly renewing a new instance of the same ticket. We're currently facing this error:
Msg 18456, Level 14, State 1, Line 1 Login failed for user 'NT
AUTHORITY\ANONYMOUS LOGON'.
We've been trying to troubleshoot using the Kerberos Configuration Manager and the only issue it found (this entire time) is a warning saying "Kerberos Configuration Manager reports that "TCP must be enabled to use Kerberos Authentication":
We've checked TCP is enabled in all nodes of both of our clusters, so I'm not sure why it's saying this warning or how to troubleshoot it, and if it's our root issue or not.
UPDATE 9:
Another thing worth noting is we've only been intermittently down for most of the time. That is, some users are able to successfully query the linked server when using our applications (web, mobile, etc) and some aren't. So if it was a true SPN issue, I would think no one would be able to query the linked server – an all or nothing situation. This definitely feels more like either a random Kerberos issue (possibly related to the Windows Update I linked in my Update 7) or something with how our Clusters are referencing / being referenced / complicating things because they're Clusters. 🙁
At this point, we've tried purge
on our KLIST Kerberos tickets, ipconfig /flushdns
both on the client machine and servers (SQL Servers, DNS Servers, and Web Server), and have tried restarting the client machines and server machines, in addition to everything else previously mentioned, to no avail. At best we've gotten very intermittent mixed results with sometimes restarting the client machine sort of works, temporarily.
I think the fact that this question was upvoted and bookmarked a few times in only a couple days shows I'm not the only one facing this issue too unfortunately. So hopefully the solution can prove useful to multiple people.
Best Answer
You have already stated, respectively listed what the issue could be:
So what is SSPI? Microsoft describes it as:
Reference: Security Support Provider Interface (SSPI) (Microsoft Docs)
Basically it's how an application authenticates itself, the user or other components against an endpoint asking for authentication.
In the Architectural Overview there is the description that ...
Reference: SSPI Architectural Overview (Microsoft Docs)
NTLM
When your old application starts in the other domain it will pass on the old NTLM SID value and compare it with what is stored in the SQL Server instance.
If there is a match [Domain/User] with a SQL Server Login [Domain/User] then SQL Server will let the user login to the relevant databases. SQL Server will do the same for a [Domain/User] that is assigned to a SQL Server Windows Group Login.
Kerberos
Kerberos relies on the ability of a user or service account being able to generate a token (of a request) and to compare this with a central provider to determine if a user is who he/she/them claim to be.
emphasis mine
Reference 1.1 What is Kerberos and How Does it Work? (MIT Kerberos V 5.0)
Explanation Summary
So your Server A is trying to access Server B via a Kerberos Authentication but failling to succeed, because the SPN could be missing.
To achieve a successful authentication the Server(s) should have a Service Principal Name (SPN). Which is described as:
And a bit further down the possible root cause:
Reference Service Principal Names (Microsoft Docs)
To Cut a Long Story Short
If either your application (host) and/or your SQL Server Instance are unable to register a Service Principal Name then your application/server will be unable to authenticate users via Kerberos and your linked server may fail.
Check the ERRORLOG of your SQL Server instance and verify that the SQL Server instance was able to register its own SPN. There will be an entry during the instance startup procedure.
The SQL Server Service Account can do this if it is a Domain Service Account and some other conditions are met. Some being:
and
Reference: Register a Service Principal Name for Kerberos Connections (Microsoft Docs)
Output of successful SPN registration
If the service can't register the SPN, then:
Output of Unsuccessful SPN registration
Once your SQL Server instance has its SPN registered, then you should be up and running.
Solutions
Manual SPN Registration
Well you have to be a Domain Administrator or allowed to change Computer Account objects on the OU of the domain. Then you can run SETSPN with the relevant parameters:
Reference: Manual SPN Registration (Microsoft Docs)
Once an SPN has been configured for a service account, then you should be able to start the server again and access the database.
SPN registration requires a restart of the service.
Modifiy OU Object's Permissions
In order for the SQL Server's Service Account to create a SPN automatically in Active Directory it (the service account) should have the permissions to modify its own object (OU).
Reference: Register a Service Principal Name for Kerberos Connections (Microsoft Docs)
Assigning the Permissions
Reference: To grant permission to modify SPNs (Microsoft Docs)
SPN registration requires a restart of the service. You should see a successful SPN registration in the ERRORLOG of the permissions are set correctly
Increase Kerberos Token Size
In some cases an account can belong to too many objects/AD groups/ ... which will result in a rather larger Kerberos Token. In these cases you might have to increase the max token size.... Symptoms may include:
But
Reference: Problems with Kerberos authentication when a user belongs to many groups (Microsoft | Docs)
Fix
Reference: How to use Group Policy to add the MaxTokenSize registry entry to multiple computers (Microsoft | Docs)
Solution Based on Referenced Answer and Multiple Updates to Question
After reading through all the additional information in the question and other answers, I have come to the conclusion that all possible solutions have been looked at. I think this Q & A could possibly be closed as duplicate of the question which you referenced with Hannah Vernon's answer.
However I would like to provide a solution on how to apply a fix for this issue.
Reference: Register a Service Principal Name for Kerberos Connections (Microsoft | SQL Docs)
Interesting Bit of Information
Reference: Register a SPN for SQL Server Authentication with Kerberos (MSSQLTips.com)
So the solution would be to actdually revoke the Service Account the ability to modify its own object, which would deny it the possibility of removing an SPN during failover.
Start Active Directory Computers and Users and make sure you are looking at the Advanced View version.
Open up the Service Accounts properties and switch to the Security tab.
Select the Self item from the Group- or Username section and look at the Permissions for SELF.
Remove the Full Access and/or Write and/or Write servicePrincipalName and/or Write Public Information permission if they are selected.
The Write servicePrincipalName might be located in the Advanced section of the Permissions (click on the button Advanced).
This should prevent further issues when the cluster does a failover because instances are restarted, which would result in the SPN being de-registered.