Sql-server – SQL Server 2019: Memory performance with graph queries (possible memory leak)

graphmemorymemory-managersql-server-2019

I'm currently working on implementing an upgrade to SQL Server 2019 in order to make use of the graph features that are available in it. Our database stores records of files and their children, and the graph features allow us to quickly find all of a file's relations in either direction. Our current dev environment is using SQL Server 2019 Standard (15.0.4023.6) on a Linux server.

I'm noticing a concerning problem when I run my graph queries. The server's 'internal' resource pool appears to not free up all resources after a graph query. Left unchecked, this fills up the resource pool. Larger queries will fail until the SQL Server process to be restarted. Depending on server load, this could happen in as little as 1-2 hours. This can also fill up the tempdb and threaten to fill the storage drive. The files for the tempdb also cannot be shrunk/truncated significantly until the server is restarted. In configuration, 'memory.memorylimitmb' is not set, so this problem happens when the resource pool starts to have used the better part of the default 80% of system memory (12.8 GB, with 16GB of system memory)

To set up the tables within a demo database:

CREATE TABLE FileNode (ID BIGINT NOT NULL CONSTRAINT PK_FileNode PRIMARY KEY) AS NODE

GO

CREATE TABLE FileNodeArchiveEdge AS EDGE

GO

CREATE INDEX [IX_FileNodeArchiveEdge_ChildFile] ON [dbo].[FileNodeArchiveEdge] ($from_id)

GO

CREATE INDEX [IX_FileNodeArchiveEdge_ParentFile] ON [dbo].[FileNodeArchiveEdge] ($to_id)

GO

To populate the demo database tables:

INSERT INTO [FileNode] (ID) VALUES
            (1),(2),(3),(4),(5),
            (6),(7),(8),(9),(10),
            (11),(12),(13),(14),(15)

-- Convenient intermediate table
DECLARE @bridge TABLE (f BIGINT, t BIGINT)
INSERT INTO @bridge (f, t) VALUES
    (1,4),
    (4,9),
    (4,10),
    (1,5),
    (5,11),
    (11,12),
    (2,5),
    (2,6),
    (6,13),
    (6,14),
    (13,15),
    (14,15),
    (15,12),
    (7,14),
    (3,7),
    (3,8)

INSERT INTO FileNodeArchiveEdge
($from_id, $to_id)
SELECT 
    (SELECT $node_id FROM FileNode WHERE ID = f),
    (SELECT $node_id FROM FileNode WHERE ID = t)
FROM @bridge

To get all child IDs of a file (repeating this same query will eat up memory resources and cause 'USERSTORE_SCHEMAMGR' to grow uncontrollably):

DECLARE @parentId BIGINT = 1
SELECT 
    LAST_VALUE(f2.ID) WITHIN GROUP (GRAPH PATH)
FROM
    FileNode f1,
    FileNodeArchiveEdge FOR PATH contains_file,
    FileNode FOR PATH f2
WHERE
    f1.ID = @parentId
    AND MATCH(SHORTEST_PATH(f1(-(contains_file)->f2)+))

Re-running the provided query to retrieve all child nodes of a particular file will eventually see the 'USERSTORE_SCHEMAMGR' memory clerk type eat up the bulk of used resources. The tempdb will also grow uncontrollably.

The demonstration database queries are too small to trigger an explicit message reporting that the 'internal' resource pool. However, running a larger query on the same server should trigger the warning, and performance should still be impacted.

The following queries are useful for monitoring server performance:

-- Memory clerk usage
SELECT TOP(10) mc.[type] AS [Memory Clerk Type],
   CAST((SUM(mc.pages_kb)/1024.0) AS DECIMAL (15,2)) AS [Memory Usage (MB)]
FROM sys.dm_os_memory_clerks AS mc WITH (NOLOCK)
GROUP BY mc.[type]
ORDER BY SUM(mc.pages_kb) DESC OPTION (RECOMPILE);

/*
Example output of above query:

Memory Clerk Type                                            Memory Usage (MB)
------------------------------------------------------------ -----------------
USERSTORE_SCHEMAMGR                                                    9224.26
MEMORYCLERK_SQLSTORENG                                                 1114.73
MEMORYCLERK_SQLBUFFERPOOL                                               471.50
CACHESTORE_SEHOBTCOLUMNATTRIBUTE                                        376.47
MEMORYCLERK_SOSNODE                                                     292.02
MEMORYCLERK_SQLGENERAL                                                   19.84
MEMORYCLERK_SQLCLR                                                       12.04
MEMORYCLERK_SQLQUERYPLAN                                                  2.99
MEMORYCLERK_SQLLOGPOOL                                                    2.61
MEMORYCLERK_SQLTRACE                                                      2.14

*/

SELECT cache_memory_kb/1024.0 AS [cache_memory_MB],compile_memory_kb/1024 AS compile_memory_MB, used_memory_kb/1024.0 AS [used_memory_MB] FROM sys.dm_resource_governor_resource_pools

To rapidly query the server and eat up resources, I'm using the following BASH loop:

l=1000 # Number of loops
# The loop will probably need to be run 2M times or so to start to see significant usage.

c=0 # Loop tracker

touch marker # Alternate emergency stop: Remove the marker file from another terminal session.
time while [ $c -lt $l ] && [ -f "marker" ]; do
    c="$((${c}+1))"
    echo ${c}/${l}
    # Notes: SQLCMDPASSWORD has been set in environment variable
    #        child-query.sql contains the above child query to loop for the children of file ID 1.
    time sqlcmd -U db_user -S localhost -d DemoDatabase -i child-query.sql > /dev/null || break
done
rm marker

The DROPCLEANBUFFERS/FREEPROCCACHE/FLUSHPROCINDB DBCC commands complete successfully, but don't appear to have an effect.

Is there a configuration or procedure that could solve this problem, or is this a fundamental server problem?

Best Answer

Related Solutions

SQL Server Page Allocation Failure – Error FAIL_PAGE_ALLOCATION 1

The output of errorlog had dbcc memorystatus dump and what I noticed was

Process/System Counts                         Value(in Bytes)
---------------------------------------- ----------
Available Physical Memory                1217605632---1.1 G
Available Virtual Memory                 140627167866880
Available Paging File                    5656502272
Working Set                               305238016
Percent of Committed Memory in WS                99
Page Faults                                27923310
System physical memory high                       0
System physical memory low                        0
Process physical memory low                       1--Memory Low
Process virtual memory low                        0
2016-06-14 04:28:27.41 Server

Please note the available physical memory is very low. There was almost no memory in buffer pool

Regarding clerk which is consuming more memory

MEMORYCLERK_SQLQERESERVATIONS (node 0)           KB
---------------------------------------- ----------
VM Reserved                                       0
VM Committed                                      0
Locked Pages Allocated                            0
SM Reserved                                       0
SM Committed                                      0
Pages Allocated                            22599824  --21.5 G

Page Life Expectancy                             64

Now on server where max server memory is 28 G if MEMORYCLERK_SQLQERESERVATIONS is taking 21.5 G that is definitely a problem. This is what causing the OOM condition.

What is MEMORYCLERK_SQLQERESERVATIONS

This is a memory clerk in SQL Server which tracks memory allocated to query which involves Sort or hash operations during execution. These operators can be the largest memory consumers for a query.

Why OOM error due to this

When query involving sort and hash operations is executed it will make a reservation request based on the original query plan which contained a sort or a hash operator. Then as the query executes, it requests the memory and SQL Server will grant that request partially or fully depending on memory availability. There is a memory clerk (accountant) named ‘MEMORYCLERK_SQLQERESERVATIONS’ which tracks memory allocation to such requests . Now in your scenario following could be happening

Query is requesting so much memory grant that SQL Server is only able to provide it a limited amount, this limited amount is called "Required Memory", so that it starts executing and while executing the query, because memory requirement was large and SQL Server cannot provide it as there was no memory in resource pool, the query fails with OOM error. The memory required when query is running is called "Additional Memory"
There was Bug fixed in SQL Server 2012 Sp1 CU4 where query requested huge amount of memory grant causing it to be drastically slow or subsequently failing with OOM error. The possibility that bug resurfaced cannot be ruled out considering fact that QEReservations hogged all of the buffer pool
Since the clerk has already taken 90 % of memory. Required Memory for new query is not available and query fails with OOM error.
Your tables and indexes has skewed statistics which is forcing optimizer to build sub optimal plan causing it to request much more memory grant than actually required and in turn creating issues.
Lastly the queries running on SQL Server requires some serious tuning.

As per This Blogs.msdn article

What Can a Developer Actually Do about Sort/Hash Operations?

Speaking of re-writing queries, here are some things to look for in a query that may lead to large memory grants.

Reasons why a query would use a SORT operator (not all inclusive list):
ORDER BY (T-SQL)

GROUP BY (T-SQL)

DISTINCT (T-SQL)

Merge Join operator selected by the optimizer and one of the inputs of the Merge join has to be sorted because a clustered index is
not available on that column.

Reasons why a query would use a Hash Match operator (not all inclusive list):
JOIN (T-SQL) – if SQL ends up performing a Hash Join. Typically, lack of good indexes may lead to the most expensive of join operators
– Hash Join. Look at query plan.
DISTINCT (T-SQL) – a Hash Aggregate could be used to perform the distinct. Look at query plan.

SUM/AVG/MAX/MIN (T-SQL)– any aggregate operation could potentially be performed as a Hash Aggregate . Look at query plan.

UNION – a Hash Aggregate could be used to remove the duplicates.

To further understand the problem I would require you to add output of below queries into your question. I would also like you to add output of Paul Randal Wait stats query. The source of query is This Blog, I suggest you to read the blog.

SELECT * FROM sys.dm_exec_query_memory_grants where grant_time is null

--Find who uses the most query memory grant:

SELECT mg.granted_memory_kb, mg.session_id, t.text, qp.query_plan
FROM sys.dm_exec_query_memory_grants AS mg
CROSS APPLY sys.dm_exec_sql_text(mg.sql_handle) AS t
CROSS APPLY sys.dm_exec_query_plan(mg.plan_handle) AS qp
ORDER BY 1 DESC OPTION (MAXDOP 1)

--Search cache for queries with memory grants:

SELECT t.text, cp.objtype,qp.query_plan
FROM sys.dm_exec_cached_plans AS cp
JOIN sys.dm_exec_query_stats AS qs ON cp.plan_handle = qs.plan_handle
CROSS APPLY sys.dm_exec_query_plan(cp.plan_handle) AS qp
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS t
WHERE qp.query_plan.exist(‘declare namespace n=”http://schemas.microsoft.com/sqlserver/2004/07/showplan“; //n:MemoryFractions’) = 1

There are few other things I would like you to check for queries running on system.

Select granted_query_memory,session_id,command from sys.dm_exec_requests

This will show you how much memory is granted to queries running on the system.

If you can see XML actual execution plan you have MemoryGrant=xxxxx can you collect this value for costly queries.

All the above will show us if there is problem in query or some other issue as to why it is requesting so much memory for execution.

EDIT

From various query outputs you pasted.

You can see the requested_memory_kb for large number of queries are approx 5G, this is large memory grant, ideally it should be few MB's. Do note that required_memory_kb is just around 5 MB and granted_query_memory is NULL this is because due to memory pressure SQL Server is just able to provide minimum memory to start the query but not able to provide additional memory for query execution resulting query to fail with OOM error.

The query costs for queries requesting huge memory is also high which leads me to believe that either statistics are skewed or queries are written poorly. Other possibility would be query not supported by proper index. Number of queries requesting such a huge memory grant is good in number.

For above queries see granted_query_memory it is all in GB. The first 3 queries running used approx 15 G of memory which almost used 50 % of memory. In SQL Server millions of process run which require memory in some way so you can see if 3 queries are using 50% of available memory OOM issue is bound to occur.

Solution

You should seriously consider tuning the first 4 queries in above screenshot

Make sure you run index rebuild and stats update at least weekly so that skewed stats does not force optimizer to produce bad plan.

Use resource governor and create a resource pool and workload group and run queries which are requesting large memory grant in this pool. You can limit the memory request with parameter request_max_memory_grant_percentage. An example is shown in this Blog. This is just alternate method till you tune all your queries.

Sql-server – SQL Server Occupies more than allocated memory. Possible memory leak

My questions are Is SQL Server leaking memory?

Highly unlikely but you should plan for SQL Server 2012 SP4 ASAP. From my past experience I could say that since SQL Server and other applications like SSAS and SSRS are all running on same machine SQL Server might be facing memory pressure and so might be the case for SSAS/RS. I have seen lot of systems like this and it all boils down to memory pressure.

You have 32 G of RAM and you have given just 16 G to SQL Server make it to 20 G and see if this help. Adding more memory would definitely help if you can go for it.

There can be plethora of reasons why queries are timing out and Troubleshooting SQL Server performance issues can help you to find the root cause of it.

But when I select memory occupied by Percentage, it shows 80%-85% which is NOT RIGHT:

I would start by saying Task Manager is not a correct place to gauge SQL Server memory consumption, it will not tell you correct value when SQL Server service account has Locked Pages in Memory(LPIM) privilege. This is because normally task manager tracks Process Private bytes memory which is pageable and allocated via VirtualAlloc() function but with LPIM chunk of memory allocation is done by AWE API which is NON pageable so task manager does not tracks it and this can lead to incorrect value. For percentage thing which you are looking is actually Percentage of Process Private bytes not the complete memory and it does not provide any relevant information so just stop looking at it.

It is quite normal for SQL Server to utilize memory allocated to it and to know how much physical memory SQL Server is using please use query

select * from sys.dm_os_process_memory

PS: It is always recommended to move other applications on different machine (if possible) and let SQL Server run solely on it own system this will help SQL Server run faster and better

Best Answer

Related Solutions

SQL Server Page Allocation Failure – Error FAIL_PAGE_ALLOCATION 1

Sql-server – SQL Server Occupies more than allocated memory. Possible memory leak

Related Question