Sql-server – Is an SQL Server trace ‘considered harmful’ in production

performanceprofilersql server

Context: Careful DBAs

We've been bought out a few times and our latest overlords have DBAs who keep carefully guard their production database servers.

Overall this is good.

However, we have run into one point of contention between developers and DBAs: they will not run traces against production database servers.

Our experience as developers: SQL Server trace + RML Utilities rocks

In the past, when application instances ran sluggish, we would run a trace via an .sql script, zip up the output, and process it with Microsoft's RML Utilities.

Note that we do not run the GUI-based profiler tool as that dramatically slows down performance (the database needs to wait on updating the GUI). These traces simply "log activity" to the database server's local filesystem. Our scripts run a trace for a prescribed number of minutes at a time (for example, 10 minutes).

The combination of trace and RML Utilities works great: The report is easy to use. It shows hotspots (see below), the reports give helpful information, they allow drilling down to the actual execution plans, etc. It's much better than anything Oracle offers (or at least a few years ago).

(Hotspot: if a query takes 1/10 second, but executes 400,000 times, it wil appear higher in the report, higher than the once-executed query that takes 30 seconds.)

I've done a bit of performance work over the years. My experience is that capturing 'actual data' from 'real databases under real load' trumps programmers 'taking a guess at' and simulating the problem they imagine. This is why I think that the SQL Server trace is pretty much 'silver bullet' for killing the problem.

My Questions

What are the main, legitimate concerns against running SQL Server traces in production (filling up the disk, slowing down performance, etc.)?
Are these concerns valid? If so, how can we address them?
Taking a trace and profiling the "actual problem" seems like the clearest step toward understanding and solving the situation. What are the alternatives?

Best Answer

The latest version of the RML Utilities (09.04.0051 as at today) support Extended Events tracing, so maybe you could work with your DBAs to do some controlled tracing. A screenshot from the help file:

They do however have a point because the default trace template provided with RML includes statement-level events (SQL:StmtStarting, SQL:StmtCompleted, SP:StmtStarting, SP:StmtCompleted), ie every statement in every proc. This could easily bring a server to its knees under the wrong conditions, eg scalar function in a SELECT statement running across a big table.

Extended Events (XE) were a bit limited in versions prior to 2012, but it is more lightweight. You still need to be careful as SQL statement-level events will still incur overhead for very intensive workloads (sqlserver.sp_statement_starting, sqlserver.sp_statement_completed, sqlserver.sql_statement_starting, sqlserver.sql_statement_completed). Execution plan capture (sqlserver.query_post_execution_showplan) can also be intensive. Consider using the XETraceCaptureDefLightWeight.sql template which is provided and includes less events.

In summary I suggest moving to XE and using the lightweight XE definition provided with the latest versions of RML.

Best Answer

Related Solutions

Sql-server – Any pointers on reducing the privileges in production but not making the job excessively difficult

Sql-server – SQL Server statements intermittently slow on SQL Server 2008 R2

Related Question