I believe you seeing the symptoms of an issue with Windows 2003 requiring contiguous memory and this causes the running processes including Analysis Service to trim their Working Set memory. With large amounts of memory allocated this trim process can take a significant amount of time to complete, while this is running new allocations will be blocked, causing the process to stall.
This issue is fixed in Windows 2008 onwards this support article is for SQL Server but has all the links to the relevant pages http://support.microsoft.com/kb/918483 in particular this quote
In Windows Server 2008, the allocation of physically contiguous memory
is greatly enhanced. Requests to allocate contiguous memory are much
more likely to succeed because the memory manager now dynamically
replaces pages, typically without trimming the working set or
performing I/O operations. In addition, many more types of pages—such
as kernel stacks and file system metadata pages, among others—are now
candidates for replacement. Consequently, more contiguous memory is
generally available at any given time. In addition, the cost to obtain
such allocations is greatly reduced.
I believe that with this issue you will either have to upgrade to Windows 2008 onwards or test granting Lock Pages In Memory to the Analysis Services account. I have also had issues with memory allocation on Windows 2003 which requires a reboot to clear.
This SQLCAT blog compares the changes between Windows Server 2003 & 2008.
This blog mentions some of the issues of LPIM with Analysis Services (near the end)
Also this knowledge base article on working set trimming causing performance problems and this article shows how to Identifying Drivers That Allocate Contiguous Memory
This is an awesome post.
To answer your final question, I'd speculate that your answer is "yes".
That said, I probably would have pursued soft numa before resorting to the trace flags. I think you are right about the numa node allocation and that's could be at the root of your problem. Via soft numa, you could scale out the requests, depending on your count of numa nodes (4?) - to 4, if that's the correct number, and then assign, via ip address, each host to a specific numa node, in addition to that, I'd disable hyper threading. Combined, the issue would likely decrease, however, it would do so at the cost of fewer schedulers.
On a seperate thought, I'd look at forced parameterization - the fact that your load is driving your CPU so high is very interesting and it may be worth looking into that.
Lastly, on multi-numa node systems, I typically have the output of the following queries dumping to a table every N seconds. Makes for some interesting analysis when workload changes or trace flags are implemented:
SELECT getdate() as poll_time, node_id, node_state_desc, memory_node_id, online_scheduler_count, active_worker_count, avg_load_balance, idle_scheduler_count
FROM sys.dm_os_nodes WITH (NOLOCK)
WHERE node_state_desc <> N'ONLINE DAC'
and
SELECT top 10 getdate() as sample_poll, wait_type, count (*)
FROM sys.dm_os_waiting_tasks
WHERE [wait_type] NOT IN
('CLR_SEMAPHORE','LAZYWRITER_SLEEP','RESOURCE_QUEUE','SLEEP_TASK','SLEEP_SYSTEMTASK',
'SQLTRACE_BUFFER_FLUSH','WAITFOR', 'BROKER_TASK_STOP',
'BROKER_RECEIVE_WAITFOR', 'OLEDB','CLR_MANUAL_EVENT', 'CLR_AUTO_EVENT' )
GROUP BY wait_type
ORDER BY COUNT (*) DESC
Best Answer
The metric is trying to measure the number of logical IO operations performed by each request, in average. If your server is, lets say, serving a busy web site, then is probably a good metric (perhaps 100 is even too high for such!).
The witness will only have internal activity. In other words it will touch data files internally, due to various background processes, despite never seeing any actual batch. As such all internal logical IO will be reported as caused by the few batches this server has seen.
I would say that the metric Thomas is recommending has some value but only in a a workload similar to what Thomas had in mind. A server that has only background activity, like your witness, is not going to report anything meaningful with this ratio. A server that runs intensive DW reports may see only 1 batch request per hour and run millions of page lookups for that one request. An absolute number like 'must be below 100' is only valuable accompanied by smallprint disclaimers like must be bellow 100 if your workload has frequent enough batch request to establish significant statistical trends and each batch requests is expected to perform under 100 page lookups in average. You get the idea...