This is an awesome post.
To answer your final question, I'd speculate that your answer is "yes".
That said, I probably would have pursued soft numa before resorting to the trace flags. I think you are right about the numa node allocation and that's could be at the root of your problem. Via soft numa, you could scale out the requests, depending on your count of numa nodes (4?) - to 4, if that's the correct number, and then assign, via ip address, each host to a specific numa node, in addition to that, I'd disable hyper threading. Combined, the issue would likely decrease, however, it would do so at the cost of fewer schedulers.
On a seperate thought, I'd look at forced parameterization - the fact that your load is driving your CPU so high is very interesting and it may be worth looking into that.
Lastly, on multi-numa node systems, I typically have the output of the following queries dumping to a table every N seconds. Makes for some interesting analysis when workload changes or trace flags are implemented:
SELECT getdate() as poll_time, node_id, node_state_desc, memory_node_id, online_scheduler_count, active_worker_count, avg_load_balance, idle_scheduler_count
FROM sys.dm_os_nodes WITH (NOLOCK)
WHERE node_state_desc <> N'ONLINE DAC'
and
SELECT top 10 getdate() as sample_poll, wait_type, count (*)
FROM sys.dm_os_waiting_tasks
WHERE [wait_type] NOT IN
('CLR_SEMAPHORE','LAZYWRITER_SLEEP','RESOURCE_QUEUE','SLEEP_TASK','SLEEP_SYSTEMTASK',
'SQLTRACE_BUFFER_FLUSH','WAITFOR', 'BROKER_TASK_STOP',
'BROKER_RECEIVE_WAITFOR', 'OLEDB','CLR_MANUAL_EVENT', 'CLR_AUTO_EVENT' )
GROUP BY wait_type
ORDER BY COUNT (*) DESC
Best Answer
Changing CPU affinity was never a common practice but did have it's uses on WindowsNT and later on Windows Server 2000/3,
The main issue was that processor load could be misaligned on multiprocessor systems and this allowed for freeing up resources. This could also be helpful on systems that where not dedicated to running SQL Server. So yes this could be beneficial in some edge cases. Old Small Business Servers come to mind.
On later versions of Windows the OS is more likely to distribute load between processor so for performance reasons this will not help but can be beneficial on servers running multiple instances if you want to limit those to specific CPU or NUMA node