Sql-server – Linked Server connection started failing with “Cannot generate SSPI context … SQL Server Network Interfaces: The logon attempt failed” error

errorslinked-serverlogonsql serversql-server-2016

UPDATE 10: The original question and previous updates follow this update below (and will soon be cleaned up into one current update).

Last week be broke the clusters, and now everything is truly a single server for ServerA and a single server for ServerB. We also have dev copies of ServerA (DevSQLServer1) and ServerB (DevSQLServer2) which share the same SQL Service accounts.

We've since setup replication from ServerB to ServerA to side-step the issue in production for now, but we encountered the same issue on our dev environment this week as well.

I tried using xp_cmdshell to run klist purge and I cycled the SQL Services for both DevSQLServer1 and DevSQLServer2, but the issue still persisted. At this point, I'm pretty sure the issue is due to the botched Windows update November 10, 2020—KB4586830 (OS Build 14393.4046) that specifically states "you might encounter Kerberos authentication and ticket renewal issues". Unfortunately this seemingly was installed on a multitude of our servers (both SQL Servers and Domain Controllers) and subsequent Windows updates that have ran, remove the botched update from the Windows Update History, so I have no reliable way to tell which servers even originally received the botched update.

Per John Eisbrener's questions this is what we currently see (sorry for all the obfuscation):

ServerB Successful SPN Registration SQL Logs:

Notice both servers are successfully registering SPNs for themselves
for both port 1433 and no port specified.

ServerA Registered SPNs:

I'm not sure what the second "CN=" line (second to last line) is but it literally says "CN=ServerASvc", but that's not the name of the SQL Service account. The SQL Service Account is "SQLServiceAccount1".
I'm also not sure why the last line has an SPN for just the server name "ServerA" (which we do have a Host / CName for in our DC), but seems redundant to the one of the previous lines.

ServerB Registered SPNs:

ServerA's service account is SQLServiceAccount1 and is also the same service account used on our DevSqlServer1.

ServerB's service account is SQLServiceAccount2 and is also the same service account used on our DevSqlServer2.

Running the query SELECT net_transport, auth_scheme FROM sys.dm_exec_connections WHERE session_id = @@SPID returns TCP and KERBEROS on both ServerA and ServerB for me.

** For all intents and purposes, ServerA is really WindowsFailoverClusterA, and ServerB is really WindowsFailoverClusterB. I apologize for just realizing now that this fact might be important too. Also, different SQL Server Instance Service Accounts are running for each one (ServiceAccountA for the servers / instance in WindowsFailoverClusterA, and ServiceAccountB for the servers / instances WindowsFailoverClusterB.)

I have two SQL Server 2016 Standard Edition servers on the same domain. Server A has two linked server connections setup to Server B.

The first linked server connection is using a dedicated remote SQL Login security context. That one always was and still is working fine.

The second connection uses the "login's current security context" which is a Windows Authentication (AD) Login on Server A. This was also working fine up until today.

Today out of nowhere, everyone who connects to Server A and then tries to run a query that uses the second linked server connection, ends up getting hit with this error:

OLE DB provider "SQLNCLI11" for linked server "LinkedServerName"
returned message "Cannot generate SSPI context". Msg -2146893044,
Level 16, State 1, Line 0 SQL Server Network Interfaces: The logon
attempt failed

The only other potentially relevant thing I'm aware of, is a lot of users AD passwords are due to expire this week (due to our security policy). Supposedly some of them have already changed their AD passwords and are still hitting the above error, but I'm not sure how reliable that information is (can't always trust the end users :).

Also the users are able to directly connect to both Server A and Server B fine currently, and even execute queries against objects on Server B directly, but when they try to run queries referencing the same objects on Server B via the linked server on Server A is the only time they get the above error, currently.

Update 1: So far I've found when I changed my AD password (which was due to expire in 2 days) and restarted my machine, it fixed the issue for me. (Logging out and logging back in after changing my AD password did not fix the issue, I had to physically restart my machine.)

Update 2: Per John K. N.'s suggestion, I dove into the SQL Server Error Log for Server A and found the following error that occurred 5 hours ago (this morning EST):

03/17/2021 05:04:39,Server,Unknown,The SQL Server Network Interface
library could not register the Service Principal Name (SPN) [
MSSQLSvc/ServerA.MyDomainName.com:1433 ] for the SQL Server service.
Windows return code: 0x2098 state: 20. Failure to register a SPN
might cause integrated authentication to use NTLM instead of Kerberos.

Also the following 2 errors occurred 1 second before that in the log as well (not sure if relevant):

03/17/2021 05:04:38,Server,Unknown,Implied authentication manager
initialization failed. Implied authentication will be disabled.

03/17/2021 05:04:38,Server,Unknown,InitializeExternalUserGroupSid
failed. Implied authentication will be disabled.

Update 3: After some quick research on the error code from the SPN registration error message above (0x2098) I found this Pinal Dave article which seems to confirm it's a non-issue if I see active Kerberos connections in the sys.dm_exec_connections DMV, which I currently do. So I guess I'm back to square 1.

Update 4: Scratch the previous update, I think Pinal Dave is wrong in my case. Despite me having active Kerberos connections, the SPN my SQL Error Log mentions it failed to register, is definitely not registered on that machine. When I run a setspn -l ServerA command in an elevated Console window on that machine, it is not in the list of registered SPNs, so I think my issue might still be related to the above error from Update 2.

Update 5: I'm scrambling, so I'm trying to follow all answers, and am jumping back and forth, apologies for lack of response yesterday.

I realized the setspn -l ServerA command wasn't really what I wanted to run, and I needed to run setspn -l ServerA_SQLServiceAccountName after I did that, I did see the SPN that was in my SQL Error Log. So I think Pinal Dave was right in my case. What I don't fully get is the SPN it was complaining about was for the same server that the SQL Server Instance's Service Account is running on (I don't fully understand SPNs, and am confused why the SQL Server Service needs to trust the server it's currently running on.)

Today I woke up and everyone on my domain is hosed again, myself included.
I tried flushing the DNS on ServerA, and now we're receiving the error "Linked Server error: Login Failed for user 'NT AUTHORITY\ANONYMOUS LOGON'" instead of my original error, when querying the linked server from ServerA.

UPDATE 6: It appears I am facing some sort of SPN / double-hop issue because when I RDP directly to ServerA and connect to my SQL Instance on ServerA (from within the RDP session – same account as I'm connecting to the instance when not RDP'ed in) and try querying the linked server, everything is all good. The question is what SPN am I missing?

To recap my SPNs:

My ServerA's SQL Server Instance's Service Account (let's call it ServiceAccountA) has an SPN to trust its own server.
My ServerB's SQL Server Instance's Service Account (ServiceAccountB) has an SPN to trust its own server.
Delegation is setup for ServiceAccountA to be able to access the SPN for ServiceAccountB.

UPDATE 7: The night before this issue started happening the primary node in WindowsFailoverClusterB was purposefully failed over so Windows updates could be installed on it (and a few of our other servers). Specifically this update November 10, 2020—KB4586830 (OS Build 14393.4046)
which apparently states this:

After installing this update on domain controllers (DCs) and read-only
domain controllers (RODCs) in your environment, you might encounter
Kerberos authentication and ticket renewal issues. This is caused by
an issue in how CVE-2020-17049 was addressed in these updates.

Sounds suspect. It also got installed on the other node of WindowsFailoverClusterB (the current primary) but we rolled it back. Things seemed to start working again, but then started breaking again for just our web server (which hits our WindowsFailoverClusterA to query the linked server to WindowsFailoverClusterB).

In addition to that, this DBA.StackExchange answer by Hannah Vernon sounds very similar to what we experienced since we have Windows Failover Clusters setup on top of these SQL Servers.

The only difference is we previously weren't letting the Service Accounts register / deregister the SPNs, when the issue started. Now we are letting them do so, and there haven't been any failover events, and we triple checked all SPNs are setup correctly currently. So I don't think that's our issue here.

UPDATE 8: Still down on day 3, as mentioned above, we triple checked the SPNs, and the AD delegation properties of the Service Accounts. All ducks are in order. We're fully down again: web server, SSMS, etc, no matter how we try to query the linked server, which I think is due to the default 10 hour ticket time of Kerberos lapsing, and it not correctly renewing a new instance of the same ticket. We're currently facing this error:

Msg 18456, Level 14, State 1, Line 1 Login failed for user 'NT
AUTHORITY\ANONYMOUS LOGON'.

We've been trying to troubleshoot using the Kerberos Configuration Manager and the only issue it found (this entire time) is a warning saying "Kerberos Configuration Manager reports that "TCP must be enabled to use Kerberos Authentication":

We've checked TCP is enabled in all nodes of both of our clusters, so I'm not sure why it's saying this warning or how to troubleshoot it, and if it's our root issue or not.

UPDATE 9:

Another thing worth noting is we've only been intermittently down for most of the time. That is, some users are able to successfully query the linked server when using our applications (web, mobile, etc) and some aren't. So if it was a true SPN issue, I would think no one would be able to query the linked server – an all or nothing situation. This definitely feels more like either a random Kerberos issue (possibly related to the Windows Update I linked in my Update 7) or something with how our Clusters are referencing / being referenced / complicating things because they're Clusters. 🙁

At this point, we've tried purge on our KLIST Kerberos tickets, ipconfig /flushdns both on the client machine and servers (SQL Servers, DNS Servers, and Web Server), and have tried restarting the client machines and server machines, in addition to everything else previously mentioned, to no avail. At best we've gotten very intermittent mixed results with sometimes restarting the client machine sort of works, temporarily.

I think the fact that this question was upvoted and bookmarked a few times in only a couple days shows I'm not the only one facing this issue too unfortunately. So hopefully the solution can prove useful to multiple people.

Best Answer

You have already stated, respectively listed what the issue could be:

OLE DB provider "SQLNCLI11" for linked server "LinkedServerName" returned message "Cannot generate SSPI context". Msg -2146893044, Level 16, State 1, Line 0 SQL Server Network Interfaces: The logon attempt failed

So what is SSPI? Microsoft describes it as:

In conjunction with its operating systems, Microsoft offers the Security Support Provider Interface (SSPI). The SSPI provides a universal, industry-standard interface for secure distributed applications.

^{Reference: Security Support Provider Interface (SSPI) (Microsoft Docs)}

Basically it's how an application authenticates itself, the user or other components against an endpoint asking for authentication.

In the Architectural Overview there is the description that ...

A variety of SSPs and packages are available. Windows ships with the NTLM security package and the Microsoft Kerberos protocol security package. In addition, you may choose to install the Secure Socket Layer (SSL) security package, or any other SSPI-compatible SSP.

^{Reference: SSPI Architectural Overview (Microsoft Docs)}

NTLM

When your old application starts in the other domain it will pass on the old NTLM SID value and compare it with what is stored in the SQL Server instance.

If there is a match [Domain/User] with a SQL Server Login [Domain/User] then SQL Server will let the user login to the relevant databases. SQL Server will do the same for a [Domain/User] that is assigned to a SQL Server Windows Group Login.

Kerberos

Kerberos relies on the ability of a user or service account being able to generate a token (of a request) and to compare this with a central provider to determine if a user is who he/she/them claim to be.

_{emphasis mine}

Kerberos V5 is based on the Kerberos authentication system developed at MIT. Under Kerberos, a client (generally either a user or a service) sends a request for a ticket to the Key Distribution Center (KDC). The KDC creates a ticket-granting ticket (TGT) for the client, encrypts it using the client's password as the key, and sends the encrypted TGT back to the client. The client then attempts to decrypt the TGT, using its password. If the client successfully decrypts the TGT (i.e., if the client gave the correct password), it keeps the decrypted TGT, which indicates proof of the client's identity.

^{Reference 1.1 What is Kerberos and How Does it Work? (MIT Kerberos V 5.0)}

Explanation Summary

So your Server A is trying to access Server B via a Kerberos Authentication but failling to succeed, because the SPN could be missing.

To achieve a successful authentication the Server(s) should have a Service Principal Name (SPN). Which is described as:

A service principal name (SPN) is a unique identifier of a service instance. SPNs are used by Kerberos authentication to associate a service instance with a service logon account. This allows a client application to request that the service authenticate an account even if the client does not have the account name.

And a bit further down the possible root cause:

If you install multiple instances of a service on computers throughout a forest, each instance must have its own SPN. A given service instance can have multiple SPNs if there are multiple names that clients might use for authentication. For example, an SPN always includes the name of the host computer on which the service instance is running, so a service instance might register an SPN for each name or alias of its host. For more information about SPN format and composing a unique SPN, see Name Formats for Unique SPNs.

^{Reference Service Principal Names (Microsoft Docs)}

To Cut a Long Story Short

If either your application (host) and/or your SQL Server Instance are unable to register a Service Principal Name then your application/server will be unable to authenticate users via Kerberos and your linked server may fail.

Check the ERRORLOG of your SQL Server instance and verify that the SQL Server instance was able to register its own SPN. There will be an entry during the instance startup procedure.

The SQL Server Service Account can do this if it is a Domain Service Account and some other conditions are met. Some being:

The client and server computers must be part of the same Windows domain, or in trusted domains.

and

A Service Principal Name (SPN) must be registered with Active Directory, which assumes the role of the Key Distribution Center in a Windows domain. The SPN, after it's registered, maps to the Windows account that started the SQL Server instance service. If the SPN registration hasn't been performed or fails, the Windows security layer can't determine the account associated with the SPN, and Kerberos authentication isn't used.

^{Reference: Register a Service Principal Name for Kerberos Connections (Microsoft Docs)}

Output of successful SPN registration

2020-12-01 11:14:18.87 Server      
SQL Server is attempting to register a Service Principal Name (SPN) for the SQL Server service. 
Kerberos authentication will not be possible until a SPN is registered for the SQL Server service. 
This is an informational message. No user action is required.
...
2020-12-01 11:14:18.94 Server      
The SQL Server Network Interface library successfully registered the Service Principal Name (SPN) 
[ MSSQLSvc/SERVERNAME.TG.CH ] for the SQL Server service.

If the service can't register the SPN, then:

Note

If the server can't automatically register the SPN, the SPN must be registered manually. See Manual SPN Registration.

Output of Unsuccessful SPN registration

2020-11-27 15:41:56.24 Server      
SQL Server is attempting to register a Service Principal Name (SPN) for the SQL Server service. 
Kerberos authentication will not be possible until a SPN is registered for the SQL Server service. 
This is an informational message. No user action is required.
...
2020-11-27 15:41:56.43 Server      
The SQL Server Network Interface library could not register the Service Principal Name (SPN) 
[ MSSQLSvc/SERVERNAME:INSTANCENAME ] for the SQL Server service. 
Windows return code: 0x200b, state: 15.
...
2020-11-27 15:41:56.43 Server      
The SQL Server Network Interface library could not register the Service Principal Name (SPN) 
[ MSSQLSvc/SERVERNAME:INSTANCEPORT ] for the SQL Server service. 
Windows return code: 0x200b, state: 15.

Once your SQL Server instance has its SPN registered, then you should be up and running.

Solutions

Manual SPN Registration

Well you have to be a Domain Administrator or allowed to change Computer Account objects on the OU of the domain. Then you can run SETSPN with the relevant parameters:

setspn -A MSSQLSvc/host.domain.tld:1433 domain\accountname

^{Reference: Manual SPN Registration (Microsoft Docs)}

Once an SPN has been configured for a service account, then you should be able to start the server again and access the database.

SPN registration requires a restart of the service.

Modifiy OU Object's Permissions

In order for the SQL Server's Service Account to create a SPN automatically in Active Directory it (the service account) should have the permissions to modify its own object (OU).

When the Database Engine service starts, it attempts to register the Service Principal Name (SPN). Suppose the account starting SQL Server doesn't have permission to register an SPN in Active Directory Domain Services. In that case, this call fails, and a warning message is logged in the Application event log as well as the SQL Server error log. To register the SPN, the Database Engine must be running under a built-in account, such as Local System (not recommended), or NETWORK SERVICE, or an account that has permission to register an SPN. You can register an SPN using a domain administrator account, but this is not recommended in a production environment. When SQL Server runs on the Windows 7 or Windows Server 2008 R2 operating system, you can run SQL Server using a virtual account or a managed service account (MSA). Both virtual accounts and MSA's can register an SPN. If SQL Server isn't running under one of these accounts, the SPN isn't registered at startup, and the domain administrator must register the SPN manually.

^{Reference: Register a Service Principal Name for Kerberos Connections (Microsoft Docs)}

Assigning the Permissions

To grant permission to modify SPNs

Open Active Directory Users and Computers. To open Active Directory Users and Computers, click Start, click Run, type dsa.msc, and then press ENTER.

Click View, and verify that the Advanced Features check box is selected.

Click Advanced Features, if it is not selected. If the domain to which you want to allow a disjoint namespace does not appear in the console, take the following steps:

In the console tree, right-click Active Directory Users and Computers, and then click Connect to Domain.

In the Domain box, type the name of the Active Directory domain to which you want to allow the disjoint namespace, and then click OK. As an alternative, you can use the Browse button to locate the Active Directory domain.

In the console tree, right-click the node that represents the domain to which you want to allow a disjoint namespace, and then click Properties.

On Security tab, click Advanced.

On the Permissions tab, click Add.

In Enter the object name to select, type the group or user account name to which you want to delegate permission, and then click OK.

Configure the Apply onto box for Computer objects.

At the bottom of the Permissions box, select the Allow check box that corresponds to the Validated write to service principal name permissions, and then click OK on the three open dialog boxes to confirm your changes.

Close Active Directory Users and Computers.

^{Reference: To grant permission to modify SPNs (Microsoft Docs)}

SPN registration requires a restart of the service. You should see a successful SPN registration in the ERRORLOG of the permissions are set correctly

Increase Kerberos Token Size

In some cases an account can belong to too many objects/AD groups/ ... which will result in a rather larger Kerberos Token. In these cases you might have to increase the max token size.... Symptoms may include:

A user who belongs to a large number of security groups has problems authenticating. When authenticating, the user may see a message such as HTTP 400 > - Bad Request (Request Header too long). The user also has problems accessing resources, and the user's Group Policy settings may not update correctly.

But

Under similar conditions, Windows NTLM authentication works as expected. You may not see the Kerberos authentication problem unless you analyze the Windows behavior. However, in such scenarios, Windows may not be able to update Group Policy settings.

^{Reference: Problems with Kerberos authentication when a user belongs to many groups (Microsoft | Docs)}

Fix

On a domain controller that is running Windows Server 2003, Windows Server 2008, Windows Server 2008 R2, or Windows Server 2012, you can use Group Policy to add the following registry entry to multiple computers:
Key: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Lsa\Kerberos\Parameters
Entry: MaxTokenSize
Data type: REG_DWORD
Value: 48000

^{Reference: How to use Group Policy to add the MaxTokenSize registry entry to multiple computers (Microsoft | Docs)}

Solution Based on Referenced Answer and Multiple Updates to Question

After reading through all the additional information in the question and other answers, I have come to the conclusion that all possible solutions have been looked at. I think this Q & A could possibly be closed as duplicate of the question which you referenced with Hannah Vernon's answer.

However I would like to provide a solution on how to apply a fix for this issue.

When an instance of the SQL Server Database Engine starts, SQL Server tries to register the SPN for the SQL Server service. When the instance is stopped, SQL Server tries to unregister the SPN. For a TCP/IP connection, the SPN is registered in the format MSSQLSvc/:.Both named instances and the default instance are registered as MSSQLSvc, relying on the value to differentiate the instances.

For other connections that support Kerberos the SPN is registered in the format MSSQLSvc// for a named instance. The format for registering the default instance is MSSQLSvc/.

Manual intervention might be required to register or unregister the SPN if the service account lacks the permissions that are required for these actions.

^{Reference: Register a Service Principal Name for Kerberos Connections (Microsoft | SQL Docs)}

Interesting Bit of Information

One thing that should be noted is granting these rights is not recommended (see http://support.microsoft.com/kb/319723) if SQL Server is clustered or if you have multiple domain controllers as latency in Active Directory replication can cause connectivity issues with your SQL Server instance.

^{Reference: Register a SPN for SQL Server Authentication with Kerberos (MSSQLTips.com)}

So the solution would be to actdually revoke the Service Account the ability to modify its own object, which would deny it the possibility of removing an SPN during failover.

Start Active Directory Computers and Users and make sure you are looking at the Advanced View version.
Open up the Service Accounts properties and switch to the Security tab.
Select the Self item from the Group- or Username section and look at the Permissions for SELF.
Remove the Full Access and/or Write and/or Write servicePrincipalName and/or Write Public Information permission if they are selected.
The Write servicePrincipalName might be located in the Advanced section of the Permissions (click on the button Advanced).

This should prevent further issues when the cluster does a failover because instances are restarted, which would result in the SPN being de-registered.