Sql-server – SSAS queries time out when Windows File System Cache drops

performancesql serversql-server-2008ssas

We have a SQL 2008 Analysis Services instance on Windows 2003 that is experiencing near-daily performance issues (query timeouts) during peak activity. A second identical instance queried from the same application (using round-robin IP to load balance) doesn't have the same issue. We've even switched the two instances, and the problem seems to "stay" with the physical server. We know it's not an SSAS configuration issue. The cube is rebuilt once per day.

Here's what we've been able to determine: the timeouts occur when the Memory\System Cache Resident Bytes perfmon counter drops unexpectedly (we've been calling this a "cache flush"). For example, this morning it dropped from about 14GB to 8.4GB within about 2 minutes. When this happens, the Memory:Page Reads/sec jumps from 0 up to 800-1200, and things start going sideways.

You can see the immediate impact on the MSOLAP counters: Storage Engine Query\Data reads/sec drops dangerously from 50,000 to 5,000 (these obviously weren't physical reads, they were reads from the system cache; this is confirmed by a similar drop to Process(msmdsrv)\IO Read Operations/sec). Storage Engine Query\Avg time/query climbs from 500-2000 up to 20,000-50,000 and queries start timing out. Threads\Processing pool busy threads jump from ~50 up to the max configured value of 240, and the Threads\Processing pool job queue length starts climbing.

Queries continue timing out over the next 30 minutes or so, until (I presume) the file cache is re-filled with the missing data. Memory\Page Reads/sec goes back down to near zero, and things appear to return to "normal".

During this entire process, there is no significant change to Process(msmdsrv):Working Set (~10.5GB) , Process(Total):Working Set (~13.9GB) or MSOLAP:Memory\Memory Usage KB (~50GB). This is a 48-core server with 128GB RAM, running only SSAS (the other server that isn't having trouble is a 24-core box, we haven't found any other significant differences between them).

Instance memory settings are as follows:

<TotalMemoryLimit>80000000000</TotalMemoryLimit>
<LowMemoryLimit>70000000000</LowMemoryLimit>
<PreAllocate>40000000000</PreAllocate>

Based especially on the Memory Usage KB, I don't believe that memory is an issue, other than figuring out why SSAS isn't using its internal memory to cache the entire cube (the cube is about 32GB).

We are currently working to optimize the cube and query design (which I highly suspect will make the problem disappear entirely).

In the meantime, we need to

  • understand this windows file cache behavior
  • figure out why its consistently a problem on one physical server and not the other

Here are my questions:

  1. I assume this is "normal" behavior by SSAS? Is there any way to force the entire 32GB cube into SSAS memory, to eliminate the need to read from the windows file cache entirely?

  2. Is there any way to determine what exactly is causing this cache flush, and why it doesn't occur on the (very similar) other host?

  3. Any other ways to mitigate this problem??

I found some other references to SSAS and the windows file cache (the links found here for example), but these seem to imply that the file caching is a bad thing, and talk about a utility to deliberately flush the cache, which seems like the opposite of what I need.

Edit: Added a 100pt bounty

Best Answer

I believe you seeing the symptoms of an issue with Windows 2003 requiring contiguous memory and this causes the running processes including Analysis Service to trim their Working Set memory. With large amounts of memory allocated this trim process can take a significant amount of time to complete, while this is running new allocations will be blocked, causing the process to stall. This issue is fixed in Windows 2008 onwards this support article is for SQL Server but has all the links to the relevant pages http://support.microsoft.com/kb/918483 in particular this quote

In Windows Server 2008, the allocation of physically contiguous memory is greatly enhanced. Requests to allocate contiguous memory are much more likely to succeed because the memory manager now dynamically replaces pages, typically without trimming the working set or performing I/O operations. In addition, many more types of pages—such as kernel stacks and file system metadata pages, among others—are now candidates for replacement. Consequently, more contiguous memory is generally available at any given time. In addition, the cost to obtain such allocations is greatly reduced.

I believe that with this issue you will either have to upgrade to Windows 2008 onwards or test granting Lock Pages In Memory to the Analysis Services account. I have also had issues with memory allocation on Windows 2003 which requires a reboot to clear.

This SQLCAT blog compares the changes between Windows Server 2003 & 2008.

This blog mentions some of the issues of LPIM with Analysis Services (near the end)

Also this knowledge base article on working set trimming causing performance problems and this article shows how to Identifying Drivers That Allocate Contiguous Memory