This is an awesome post.
To answer your final question, I'd speculate that your answer is "yes".
That said, I probably would have pursued soft numa before resorting to the trace flags. I think you are right about the numa node allocation and that's could be at the root of your problem. Via soft numa, you could scale out the requests, depending on your count of numa nodes (4?) - to 4, if that's the correct number, and then assign, via ip address, each host to a specific numa node, in addition to that, I'd disable hyper threading. Combined, the issue would likely decrease, however, it would do so at the cost of fewer schedulers.
On a seperate thought, I'd look at forced parameterization - the fact that your load is driving your CPU so high is very interesting and it may be worth looking into that.
Lastly, on multi-numa node systems, I typically have the output of the following queries dumping to a table every N seconds. Makes for some interesting analysis when workload changes or trace flags are implemented:
SELECT getdate() as poll_time, node_id, node_state_desc, memory_node_id, online_scheduler_count, active_worker_count, avg_load_balance, idle_scheduler_count
FROM sys.dm_os_nodes WITH (NOLOCK)
WHERE node_state_desc <> N'ONLINE DAC'
and
SELECT top 10 getdate() as sample_poll, wait_type, count (*)
FROM sys.dm_os_waiting_tasks
WHERE [wait_type] NOT IN
('CLR_SEMAPHORE','LAZYWRITER_SLEEP','RESOURCE_QUEUE','SLEEP_TASK','SLEEP_SYSTEMTASK',
'SQLTRACE_BUFFER_FLUSH','WAITFOR', 'BROKER_TASK_STOP',
'BROKER_RECEIVE_WAITFOR', 'OLEDB','CLR_MANUAL_EVENT', 'CLR_AUTO_EVENT' )
GROUP BY wait_type
ORDER BY COUNT (*) DESC
Best Answer
Large Pages change allocation of memory from 4KB to 2MB (normally), which means that the TLB (translation look-aside buffer) is improved in the CPU increasing the performance.
I've successfully used Trace flag 834 on production server in VMware for over a year. One important part before enable trace flag is that in VMware you should reserve at least the amount of memory dedicated to SQL Server, this is because the lock-pages in memory privilege required to enable large pages, "locks" the memory so it can't be swapped (paginated) to disk. So if the balloning driver kicks in, it won't try to force to reclaim memory from SQL Server.
Additionally the host should not be overcommitted in memory, if so it will degrade performance.
It should be carefully tested if there is a performance gain, so if unsure, just don't enable it. Also I have read that "Large Pages May Be Harmful on NUMA Systems" but I'm not sure if it applies to Windows/SQL Server
Trace flag 834 applies only to 64-bit versions of SQL Server. You must have the Lock pages in memory user right to turn on trace flag 834. You can turn on trace flag 834 only at startup.