Which cores correspond to each "CPU" below?
Assuming we have Core 1, 2, 3, and 4, CPU4 and CPU5 represent core 3.
Does (say) CPU 6 and CPU 7 below represent one core; the HT and the real core?
There is no distinction between the two - they both have physical hardware interfaces to the CPU, the logical interface is implemented in hardware (see the Intel Core Processor Datasheet, Volume 1 for more details). Basically, each core has two seperate execution units, but it shares some common resources between them. This is why in certain cases hyperthreading can actually reduce performance.
If, for example, CPU 6 represents a real core and CPU 7 an HT core, will a thread assigned just to just CPU7 get only the left over resources of a real core? (assuming the core is running other tasks)
See above. A thread assigned to ONLY CPU6 or ONLY CPU7 will execute at the exact same speed (assuming the thread does the same work, and the other cores in the processor are at idle). Windows knows about HT-enabled processors, and the process scheduler takes these things into account.
Is the hyperthreaded managed entirely within the processor such that threads are juggled internally? If so, is that at the CPU scope or the core scope? Example: If CPU 6 and 7 represent one core, does it not matter which a process is assigned to because the CPU will assign resources as appropriate to a running thread?
Both. The actual hardware itself does not schedule what cores to run programs on, that's the operating system's job. The CPU itself, however, is responsible for sharing resources between the actual execution units, and Intel dictates how you can write code to make this as efficient as possible.
I notice that long-running single-threaded processes are bounced around cores quite a bit, at least according to task manager. Does this mean that assigning a process to a single core will improve performance by a little bit (by avoiding context switches and cache invalidations, etc.)? If so, can I know I am not assigning to "just a virtual core"?
That is normal behaviour, and no, assigning it to a single core will not improve performance. That being said, if for some reason you want to ensure a single process is only executed on a single, physical core, assign it to any single logical processor.
The reason the process "bounces around" is due to the process scheduler. This is normal behaviour, and you will most likely experience reduced performance by limiting what cores the process can execute on (regardless of how many threads it has), since the process scheduler now has to work harder to make everything work with your imposed restrictions. Yes, this penalty may be negligible in most cases, but the bottom line is unless you have a reason to do this, don't!
Huh, I could tell you the story but you are going to hate it and I'm going to hate writing it :-)
Short version - Win10 screwed up everything it could and is in perpetual state of starving cores due to systemic problem known as cpu oversubscription (way too many threads, no one can ever service them, something is choking at any point, forever). That's why it desperately needs these fake CPU-s, shortens base scheduler timer to 1 ms and can't let you have parking anything. It would just scorch the system. Open Process Explorer and add up the number of threads, now do the math :-)
CPU Sets API was introduced to give at least some fighting chance to those who know and have the time to write the code to wrestle the beast. You can de-facto park fake CPU-s by putting them in a CPU-Set that you won't give to anyone and create default set to throw it to piranhas. But you can't do it on client sku-s (you could technically, it's just not going to be honored) since kernel would do into panic state and either totally ignore CPU Sets or some other things are going to start crashing. It has to defend system's integrity at any cost.
The whole state of affairs is by and large a tabu since it would require major rewrites and everyone culling the no of frivolous threads and admitting that they messed up. Hyperthreads actually have to be permanently disabled (they heat up cores under real load, degrade performance and destabilize HTM - the principal reason why it never became mainstream). Big SQL Server shops are doing it as a first setup step and so is Azure. Bing is not, they run servers with de-facto client setup since they'd need much more cores to dare to switch. The problem percolated into Server 2016.
SQL Server is the sole real user of CPU Sets (as usual :-), 99% of perf-advanced things in Win has always been done just for SQL Server, starting with super efficient memory mapped file handling that kills people coming from Linux since they assume different semantics).
To play with this safely you'd need 16 cores min for a client box, 32 for a server (that actually does something real :-) You have to put at least 4 cores in default set so that kernel and system services can barely breathe but that's still just a dual core laptop equivalent (you still have perpetual choking), meaning 6-8 to let the system breathe properly.
Win10 needs 4 cores and 16 GB to just barely breathe. Laptops get away with 2 cores and 2 fake "CPU-s" if there's nothing demanding to do since their usual work distribution is such that there's always enough things that have to wait anyway (long queue on memaloc "helps" a lot :-).
This is still not going to help you with OpenMP (or any automatic parallelization) unless you have a way of telling it explicitly to use your CPU Set (individual threads have to be assigned to CPU Set) and nothing else. You still need process affinity set as well, it's precondition for CPU Sets.
Server 2k8 was the last good one (yes that means Win7 as well :-). People were bulk loading a TB in 10 min with it and SQL Server. Now people brag if they can load it in one hour - under Linux :-) So chances are that the state of affairs is not much better "over there" either. Linux had CPU Sets way before Win.
Best Answer
TL;dr version: if you were doing something highly CPU intensive, such as transcoding video using Handbrake, then you wouldn't want to use more cores than CPUs as there would be nowhere for the work to be done. In this case where most threads will spend 90% of their time asleep waiting of reads or writes having more threads works for you rather than against.
Copying files is not a particularly CPU-bound task. While having more cores may help prevent other tasks from blocking out your copying tool it is unlikely that each thread is running anywhere near 100% on each core.
Each copying thread will send a read request to the hard disk and then will go to sleep while waiting for the read request to be fulfilled. Your spinning rust disk generally has a seek time of 9milliseconds, practically an eternity in CPU terms, and the copying task would not simply spin around saying "is it ready yet?" and wasting CPU cycles. Doing so would lock that thread at 100% CPU and waste resources. No, what happens is that the thread issues a read and the thread is put to sleep until the read completes and the data is ready for the next step.
In the meantime another thread does the same, gets blocked on a read and is put to sleep. This happens for all 16 of your threads. (In reality your reads and writes will be happening at random times as they get out of sync, but you get the idea)
Once one of the threads has data ready for it then Windows reschedules it and it starts processing it for being written. As far as the thread is concerned the process is the same. It says "write this data to file x at location y" and Windows takes the data and deschedules the thread. Windows does the background work to figure out where the file is, moves the data (potentially across the network adding more milliseconds to the delay) and then returns control to the thread once the write succeeded.
No one thread will be burning all the time on a CPU core and so more threads than you have CPUs is not a problem. No thread will be awake long enough for it to be a problem.
If you only had a single CPU with lots of other threads running then you could be bottlenecking on the CPU, but in a multicore system with this kind of workload I would be surprised if the CPU is the problem.
You are more likely to be bottlenecked on hard drive performance and are hitting the queue depth for the read or write buffers on the drives. By using more threads you are pushing something to its limits, be it disk or network, and the only way to find out what is the best number of threads is to do what you have done and experiment with it.
On a system with SSD to SSD copying I would suspect that a lower number of threads might be better as there would be less latency than copying files from spinning rust HDDs, pushing across the network and writing to spinning rust, but I have no evidence to support that supposition.