This is an awesome post.
To answer your final question, I'd speculate that your answer is "yes".
That said, I probably would have pursued soft numa before resorting to the trace flags. I think you are right about the numa node allocation and that's could be at the root of your problem. Via soft numa, you could scale out the requests, depending on your count of numa nodes (4?) - to 4, if that's the correct number, and then assign, via ip address, each host to a specific numa node, in addition to that, I'd disable hyper threading. Combined, the issue would likely decrease, however, it would do so at the cost of fewer schedulers.
On a seperate thought, I'd look at forced parameterization - the fact that your load is driving your CPU so high is very interesting and it may be worth looking into that.
Lastly, on multi-numa node systems, I typically have the output of the following queries dumping to a table every N seconds. Makes for some interesting analysis when workload changes or trace flags are implemented:
SELECT getdate() as poll_time, node_id, node_state_desc, memory_node_id, online_scheduler_count, active_worker_count, avg_load_balance, idle_scheduler_count
FROM sys.dm_os_nodes WITH (NOLOCK)
WHERE node_state_desc <> N'ONLINE DAC'
and
SELECT top 10 getdate() as sample_poll, wait_type, count (*)
FROM sys.dm_os_waiting_tasks
WHERE [wait_type] NOT IN
('CLR_SEMAPHORE','LAZYWRITER_SLEEP','RESOURCE_QUEUE','SLEEP_TASK','SLEEP_SYSTEMTASK',
'SQLTRACE_BUFFER_FLUSH','WAITFOR', 'BROKER_TASK_STOP',
'BROKER_RECEIVE_WAITFOR', 'OLEDB','CLR_MANUAL_EVENT', 'CLR_AUTO_EVENT' )
GROUP BY wait_type
ORDER BY COUNT (*) DESC
In any system, if you make the part of the system that is not the bottleneck more efficient, there is the potential to make the system as a whole less efficient. If, for example, your system's bottleneck is the CPU, making I/O more efficient by increasing the size of the buffer cache may end up causing many more threads to be waiting on the CPU which can make the system as a whole much slower as the operating system is now spending more time swapping between the competing thread, the CPU caches get overwhelmed, and you get the classic hockey stick graph of response time where once you get to the "knee", adding a bit more load adds a lot to the response time of the system.
If you have a system where I/O requests are highly uncorrelated, increasing the size of the cache could mean that Oracle will spend more time looking through the cache for a particular block before being forced to do a physical read after all. That's probably relatively unlikely in an OLTP application but it's possible in a data warehousing type system where various reports are asking for blocks that no other report will want to reuse.
Of course, whether this is actually the problem that they encountered in the past is still an open question. It is entirely possible that it was simply a matter of having a cold cache that needed to be warmed up or a perception problem where someone drew an incorrect conclusion from an earlier test or a poorly constructed test where, for example, someone inadvertently increased the size of the buffer cache at the expense of the PGA rather than at the expense of the operating system I/O cache. I tend to share your distrust of human recollections of the past that aren't corroborated by some nice AWR reports so I'm not claiming that either of the above explanations are likely merely that they are potential explanations if the human recollections are correct.
Best Answer
You can set the value for
db_cache_size
, and the value you set for it will be considered the minimum value it can be with AMM enabled.