This is an awesome post.
To answer your final question, I'd speculate that your answer is "yes".
That said, I probably would have pursued soft numa before resorting to the trace flags. I think you are right about the numa node allocation and that's could be at the root of your problem. Via soft numa, you could scale out the requests, depending on your count of numa nodes (4?) - to 4, if that's the correct number, and then assign, via ip address, each host to a specific numa node, in addition to that, I'd disable hyper threading. Combined, the issue would likely decrease, however, it would do so at the cost of fewer schedulers.
On a seperate thought, I'd look at forced parameterization - the fact that your load is driving your CPU so high is very interesting and it may be worth looking into that.
Lastly, on multi-numa node systems, I typically have the output of the following queries dumping to a table every N seconds. Makes for some interesting analysis when workload changes or trace flags are implemented:
SELECT getdate() as poll_time, node_id, node_state_desc, memory_node_id, online_scheduler_count, active_worker_count, avg_load_balance, idle_scheduler_count
FROM sys.dm_os_nodes WITH (NOLOCK)
WHERE node_state_desc <> N'ONLINE DAC'
and
SELECT top 10 getdate() as sample_poll, wait_type, count (*)
FROM sys.dm_os_waiting_tasks
WHERE [wait_type] NOT IN
('CLR_SEMAPHORE','LAZYWRITER_SLEEP','RESOURCE_QUEUE','SLEEP_TASK','SLEEP_SYSTEMTASK',
'SQLTRACE_BUFFER_FLUSH','WAITFOR', 'BROKER_TASK_STOP',
'BROKER_RECEIVE_WAITFOR', 'OLEDB','CLR_MANUAL_EVENT', 'CLR_AUTO_EVENT' )
GROUP BY wait_type
ORDER BY COUNT (*) DESC
Typically I recommend to my clients leaving the MAXDOP alone. SQL Server is NUMA aware so it knows which NUMA node the processors are in and which NUMA node the memory is in and it'll do it's best to assign the work to a scheduler on the correct NUMA node for the data that you are looking for.
Looking into this sort of stuff also requires knowing how many NUMA nodes you have. Is each physical processor in its own NUMA node or are they node in the same NUMA node?
Best Answer
While it is true that accessing memory in a remote NUMA node is slower than accessing memory in the local NUMA node, the difference is typically so slight as to be completely eclipsed by other performance or configuration issues.
Confirm you have configured the maximum degree of parallelism correctly, since that controls how many threads SQL Server will use for queries that will benefit from parallel processing. SQL Server will attempt to process an individual query within NUMA node memory boundaries, if the number of threads is less than or equal to the number of cores in a NUMA node.
You'll also want to ensure you have the maximum server memory configured correctly, leaving enough room for each SQL Server instance, and the operating system, etc.
You should also probably have lock pages in memory configured, to ensure memory is not being stolen from SQL Server.