Sql-server – Intermittent SQL Delegated Authentication Errors

active-directorykerberossql serverwindows-server

Here is an odd issue that is causing me to lose all my hair.

Every few days or so our delegated authentication from one SQL server to another fails with the error:

Login failed for user 'NT AUTHORITY\ANONYMOUS LOGON'. The SQL server logs show: "Login failed for user 'NT AUTHORITY\ANONYMOUS LOGON'. Reason: Could not find a login matching the name provided. [CLIENT: 10.1.x.x] Error: 18456, Severity: 14, State: 5."

Our environment consists of 6 Domain controllers ranging from Windows 2012 to Windows 2019, The SQL servers are Windows Server 2016 running SQL Server 2017.

After a few hours the issue goes away by itself, unless the Kerberos tickets are purged on either one of the affected SQL servers which resolves the issue in about 15 minutes.

At first we thought it was an issue with a particular SQL server so we replaced it, however over the last year it has affected several servers with several different delegated connections.

To date we have tried updating SQL server, deleting and recreating SPNs, upgrading the bulk of our DCs to Windows 2019, updating the domain functionality level, and multiple Microsoft tickets. We have a suspicion that leaving SSMS sessions to the SQL Servers overnight triggers the Kerberos delegated connections to fail more often but it may be superstition at this point.

Any ideas? It works 80% of the time which has made it especially tricky to hunt down. The domain controllers appear to be syncing normally, but I'm not sure what to look for in the case that the DCs are the problem.

Oddly, we are able to connect to the individual servers via protected AD accounts during the outage and the log ins are listed as Kerberos, it's just the AD delegated permissions that stop working.

We do the DC maintenance over the weekends and the issue seems to pop up at any time of any day.

Best Answer

If your SQL Servers are clustered, and their service accounts are configured to allow them to manage their own Service Principal Names, you might be running into a known issue.

In an environment with physically distanced Domain Controllers, a planned failover event can cause the Kerberos delegation for double-hop authentication to fail with symptoms matching the error message in your question.

When SQL Server shutdown occurs, the SQL Server service attempts to deregister the Kerberos SPN with a Domain Controller. Then cluster failover occurs, and the SQL Server service is started up on one of the other nodes. This can happen quickly enough that the other Domain Controllers in the organization haven't yet received the synchronization for the removal of the SPN. At this point, SQL Server will try to register the SPN, but can't because it hasn't yet been removed on all Domain Controllers. As a result, the attempt at registering the SPN fails. SQL Server starts up normally, accepting auth requests via NTLM and no one is any wiser unless someone attempts double-hop authentication.

There isn't a great solution to this, other than having permanent SPN entries created in the Active Directory domain, and preventing SQL Server from managing them automatically.

Related Question