Windows – How does the Windows 10 scheduler deal with Hyper Threading since Core Parking is disabled by default for Intel CPUs

cpuhyper-threadingperformancewindows 10

I'm running Windows 10 (1607) on an Intel Xeon E3-1231v3 CPU (Haswell, 4 physical cores, 8 logical cores).

When I first had Windows 7 installed on this machine, I could observe that four out of eight logical cores were parked until an application needed more than 4 threads. One can check with Windows resource monitor whether cores are parked or not (example).
As far as I understand, this is an important technique to keep the threads balanced across physical cores, as explained on the Microsoft website: "the Core Parking algorithm and infrastructure is also used to balance processor performance between logical processors on Windows 7 client systems with processors that include Intel Hyper-Threading Technology."

However after upgrading to Windows 10, I noticed that there is no core parking. All logical cores are active all the time and when you run an application using less than four threads you can see how the scheduler equally distributes them across all logical cpu cores. Microsoft employees have confirmed that Core Parking is disabled in Windows 10.

But I wonder why? What was the reason for this? Is there a replacement and if yes, how does it look like? Has Microsoft implemented a new scheduler strategy that made core parking obsolete?


Appendix:

Here is an example on how core parking introduced in Windows 7 can benefit performance (in comparison to Vista which didn't have core parking feature yet). What you can see is that on Vista, HT (Hyper Threading) harms performance while on Windows 7 it doesn't:

enter image description here

enter image description here

(source)

I tried to enable Core Parking as mentioned here, but what I observed was that the Core Parking algorithm isn't Hyper Threading aware anymore. It parked cores 4,5,6,7, while it should have parked core 1,3,5,7 to avoid that threads are assigned to the same physical core. Windows enumerates cores in such a way that two successive indices belong to the same physical core. Very strange. It seems Microsoft has messed this up fundamentally. And no one noticed…

Furthermore, I did some CPU benchmarks using exactly 4 threads.

CPU affinity set to all cores (Windows defualt):

Average running time: 17.094498, standard deviation: 2.472625

CPU affinity set to every other core (so that it runs on different physical cores, best possible scheduling):

Average running time: 15.014045, standard deviation: 1.302473

CPU affinity set to the worst possible scheduling (four logical cores on two physical cores):

Average running time: 20.811493, standard deviation: 1.405621

So there is a performance difference. And you can see that the Windows defualt scheduling ranks between the best and worst possible scheduling, as we would expect it to happen with a non-hyperthreading aware scheduler. However, as pointed out in the comments, there may be other causes responsible for this, like fewer context switches, inference by monitoring applications, etc. So we still don't have a definitive answer here.

Source code for my benchmark:

#include <stdlib.h>
#include <Windows.h>
#include <math.h>

double runBenchmark(int num_cores) {
  int size = 1000;
  double** source = new double*[size];
  for (int x = 0; x < size; x++) {
    source[x] = new double[size];
  }
  double** target = new double*[size * 2];
  for (int x = 0; x < size * 2; x++) {
    target[x] = new double[size * 2];
  }
  #pragma omp parallel for num_threads(num_cores)
  for (int x = 0; x < size; x++) {
    for (int y = 0; y < size; y++) {
      source[y][x] = rand();
    }
  }
  #pragma omp parallel for num_threads(num_cores)
  for (int x = 0; x < size-1; x++) {
    for (int y = 0; y < size-1; y++) {
      target[x * 2][y * 2] = 0.25 * (source[x][y] + source[x + 1][y] + source[x][y + 1] + source[x + 1][y + 1]);
    }
  }
  double result = target[rand() % size][rand() % size];
  for (int x = 0; x < size * 2; x++) delete[] target[x];
  for (int x = 0; x < size; x++) delete[] source[x];
  delete[] target;
  delete[] source;
  return result;
}

int main(int argc, char** argv)
{
  int num_cores = 4;
  system("pause");  // So we can set cpu affinity before the benchmark starts 
  const int iters = 1000;
  double avgElapsedTime = 0.0;
  double elapsedTimes[iters];
  for (int i = 0; i < iters; i++) {
    LARGE_INTEGER frequency;
    LARGE_INTEGER t1, t2;
    QueryPerformanceFrequency(&frequency);
    QueryPerformanceCounter(&t1);
    runBenchmark(num_cores);
    QueryPerformanceCounter(&t2);
    elapsedTimes[i] = (t2.QuadPart - t1.QuadPart) * 1000.0 / frequency.QuadPart;
    avgElapsedTime += elapsedTimes[i];
  }
  avgElapsedTime = avgElapsedTime / iters;
  double variance = 0;
  for (int i = 0; i < iters; i++) {
    variance += (elapsedTimes[i] - avgElapsedTime) * (elapsedTimes[i] - avgElapsedTime);
  }
  variance = sqrt(variance / iters);
  printf("Average running time: %f, standard deviation: %f", avgElapsedTime, variance);
  return 0;
}

Best Answer

Huh, I could tell you the story but you are going to hate it and I'm going to hate writing it :-)

Short version - Win10 screwed up everything it could and is in perpetual state of starving cores due to systemic problem known as cpu oversubscription (way too many threads, no one can ever service them, something is choking at any point, forever). That's why it desperately needs these fake CPU-s, shortens base scheduler timer to 1 ms and can't let you have parking anything. It would just scorch the system. Open Process Explorer and add up the number of threads, now do the math :-)

CPU Sets API was introduced to give at least some fighting chance to those who know and have the time to write the code to wrestle the beast. You can de-facto park fake CPU-s by putting them in a CPU-Set that you won't give to anyone and create default set to throw it to piranhas. But you can't do it on client sku-s (you could technically, it's just not going to be honored) since kernel would do into panic state and either totally ignore CPU Sets or some other things are going to start crashing. It has to defend system's integrity at any cost.

The whole state of affairs is by and large a tabu since it would require major rewrites and everyone culling the no of frivolous threads and admitting that they messed up. Hyperthreads actually have to be permanently disabled (they heat up cores under real load, degrade performance and destabilize HTM - the principal reason why it never became mainstream). Big SQL Server shops are doing it as a first setup step and so is Azure. Bing is not, they run servers with de-facto client setup since they'd need much more cores to dare to switch. The problem percolated into Server 2016.

SQL Server is the sole real user of CPU Sets (as usual :-), 99% of perf-advanced things in Win has always been done just for SQL Server, starting with super efficient memory mapped file handling that kills people coming from Linux since they assume different semantics).

To play with this safely you'd need 16 cores min for a client box, 32 for a server (that actually does something real :-) You have to put at least 4 cores in default set so that kernel and system services can barely breathe but that's still just a dual core laptop equivalent (you still have perpetual choking), meaning 6-8 to let the system breathe properly.

Win10 needs 4 cores and 16 GB to just barely breathe. Laptops get away with 2 cores and 2 fake "CPU-s" if there's nothing demanding to do since their usual work distribution is such that there's always enough things that have to wait anyway (long queue on memaloc "helps" a lot :-).

This is still not going to help you with OpenMP (or any automatic parallelization) unless you have a way of telling it explicitly to use your CPU Set (individual threads have to be assigned to CPU Set) and nothing else. You still need process affinity set as well, it's precondition for CPU Sets.

Server 2k8 was the last good one (yes that means Win7 as well :-). People were bulk loading a TB in 10 min with it and SQL Server. Now people brag if they can load it in one hour - under Linux :-) So chances are that the state of affairs is not much better "over there" either. Linux had CPU Sets way before Win.

Related Question