Sql-server – Why does the scalar UDF perform so differently on two different (but extremely similar) servers

functionsperformancequery-performancesql serversql-server-2016

Recently, I was troubleshooting a strange performance issue that impacted an application's production environment, but not any of the lower environments. I managed to replicate the issue in its simplest form with this query:

SELECT product_id, dbo.TranslateStatusToActive(status_id) FROM prod_Products

TranslateStatusToActive is a very simple scalar UDF, which basically just joins the value given to another table and returns 1 or 0 based on a case statement. I'd post the code, but it's a vendor written function and I'm not particularly interested in getting sued today. (Yes, the logic can be inlined. Yes, it fixes the performance issue. Yes, we've convinced the vendor to implement the change. This is not my question.)

When executing in production, the query would take between 10 and 20 seconds to return results. In development, the same query returns in less than 3 seconds. The execution plans are nearly identical, except for showing that CPU time was around 15000 ms in production and 3000 ms elsewhere.

I suspected there were some environmental differences, so I set up another server that replicated the conditions of production as closely as possible: I made sure the number of CPUs, the amount of memory given to SQL Server, and the specific patch level (13.0.4451) were the same.

I restored a copy of the production database to this new sandbox server, and to my surprise, the query executed about as quickly as it did in development. Once again, the plan and data were identical, save for the extra CPU time. The waits listed in the execution plan were the same types and within a few ms of each other on every environment.

At a loss for what to do next, I enabled optimize for ad hoc workloads on the production server. This fixed the performance issue! One thing, though: neither of the other environments had this setting enabled. I had been regularly clearing the procedure and system caches in each environment during testing, so I don't think it was the result of changing a setting causing a recompile.

Questions

  • What could cause the UDF to run so differently in each environment despite identical plan and near identical systems?
  • Why did the production environment need to have optimize for ad hoc workloads enabled to perform as well as the other environments, which did not have it enabled?
  • Is there some setting I didn't think to check that might cause such a large difference?

Development is shared, while production is currently used only by this application. The third box's usage would be almost the same as the production one. I cleared pretty much every cache they give a DBCC command for. The development environment is used regularly as a training system, so I am fairly confident it wasn't a plan caching issue.

The only difference with the third box is that there isn't an app hooked up to it, but there was little to no app usage while I was testing the function in production, so the difference was, based on my experience working in this environment, negligible. The only thing I couldn't do was restart the production server, but Microsoft's documentation explicitly states that enabling optimize for ad hoc workloads doesn't clear or affect any existing plans, so I don't see what the difference would be.

Best Answer

The situation that you described can happen when there's some kind of monitoring enabled (trace, extended event session, some third party tool), which does some kind of logging or work per UDF execution (or even per statement inside the UDF).

If the UDF is executed many times in a query there can be a very large amount of overhead to do that monitoring. If the monitoring is only occurring on one server then you would see a large performance difference between them.