SQL Server – Diagnosing Excessive ‘Lock’ Time Issues

sql serversql-server-2016wait-types

Given:

Production environment
Application Servers using Hibernate
SQL Server Studio Manager v17.5
SQL Server 2016 in a clustered HAG setup
SQL Servers do NOT have the Query Store feature enabled
The author of this question is a software engineer with enough SQL
Server knowledge to be categorized as mostly harmless

update 1

Database growth settings: Unlimited, 1024000 KB, data only
instant_file_initialization_enabled – Yes
is_auto_update_stats_async_on – No

update 2

The server has 4 CPU cores
There are spikes of waiting tasks of over 3,000,000. I have no idea yet what they are. This must be the reason for the large 'lock' times.
These spikes occur every 10 or 15 seconds. I have the following graph updating once a second:

The problem:

The root problem is that at seemingly random times of a busy day a couple of SQL queries timeout, however, for the purposes of this question I am interested in whether the screen grab is indicative of a problem in itself. Perhaps this is to subjective, but I have no experience with this value.

Action:

The failures themselves do not point directly to a concrete issue and therefore I am currently gathering evidence and attempting a process of elimination where possible.
Currently I am investigating whether excessive wait times and a 'perfect storm' of queries could cause a cascade of locks and thus a query timeout.

Evidence Gathered:

Several queries are resulting in either full index scans or full
table scans.
Several screen grabs with execution plans showing table scans.
Cursory inspection shows that indexes do exist – yet not used. I might be able to sanitize the screen grabs if they will prove useful.
The screen grab below showing a large wait time.

Question:

What other information would help determine if locking and wait times might be the cause of the query timeouts? For example, I have the following screen grab from sql server studio manager activity monitor. The value looked surprising to me.

Best Answer

So the elephant in the room. Is that wait time excessive?

The lock waits from the screenshot show 18,024,389 ms / sec as the recent (average) wait time over the last couple of minutes. That means for every second of "wall clock time," there are 18,000 seconds (5 hours?!) of lock waits accumulated by queries. That's so tremendously bad that I wonder if it's just a bug in the Activity Monitor UI.

Depending on how many cores the server has, and how many queries are running concurrently, even the smaller number (2.5 seconds of lock waits per second of wall clock time) is not ideal.

Those waits could imply a blocking chain (you can use sp_WhoIsActive to identify the lead blocker and then try to fix why it's blocking everything). Regardless, they can definitely contribute to these client side timeouts you've described - every second that a query waits on locks is a second that it's not making progress on the actual query being run.

Related Solutions

Sql-server – Select from linked server to local server causing waits on both sides

Pulling data through a linked server is unlikely to be quick especially if you are joining remote tables to local tables. (Just use Profiler to watch what your local server sends to the remote server and I think that you'll be convinced.) Linked Servers are convenient and often "good enough", but not when you start blowing through your batch processing windows.

I have seen overall processing go faster by pulling the remote data into local (temporary or permanent) tables and then "doing the joins" locally to figure out what to put into the production tables.

If you have a large amount of data, I suggest looking into using a SSIS package or bcp and probably bringing the data into a local staging table first and converting the code to run off of local tables. This might involve a lot of surgery to your jobs since you need to have a job step (or steps) to run bcp code and/or packages.

If you are using SQL 2008 or better, MERGE should be available and it could provide more efficient operations. I am unsure if MERGE works with tables on a linked server and, if it does, performance may not be much better than what you have. So, you are still stuck getting the data onto the local server.

Also, you could look into partition switching, but I think would be the most costly thing to do in terms of time spent on getting things to work. You could view it as a learning experience.

Alternatively, you might consider increasing the query timeout, which is a server-level setting on the local server. That might help with the timeout error (until your data volume increases, then you would need to tweak it again), but I doubt that it will help with the TDS error problem.

If you want to minimize the length of time that your production tables are locked (due to the INSERT, not the SELECT with nolock), you would want to get all of the data from the ETL system onto the local box first, then go and insert it all. You want to avoid pulling data for one table, inserting that, pulling the data for the next table, inserting that, etc. (I'm presuming that you are doing all of this in a declared transaction that I'm not seeing.)

Sql-server – SQL Server 2012 CPU spike due to LinkedServer – how to find the offending query

To answer the question: How to find the offending query:

Since the spikes in the graph you posted last for several minutes you have plenty of time to use the following method:

Download sysinternals process explorer

start process explorer and find the SQL Server process.
right click and select properties
look at the thread tab.
Sort on the CPU column and note the thread id (TID) that is consuming the most CPU.

Use this query and lookup the query that is currently being executed by that thread:

SELECT r.session_id, st.text, qp.query_plan
FROM sys.dm_os_threads AS ot
JOIN sys.dm_os_tasks AS t
ON t.worker_address = ot.worker_address
JOIN sys.dm_exec_requests AS r
ON t.session_id = r.session_id
CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) AS st
CROSS APPLY sys.dm_exec_query_plan(r.plan_handle) AS qp
WHERE os_thread_id = <thread id>

Best Answer

Related Solutions

Sql-server – Select from linked server to local server causing waits on both sides

Sql-server – SQL Server 2012 CPU spike due to LinkedServer – how to find the offending query

Related Question