GPU cores can effectively run many threads at the same time, due to the way they switch between threads for latency hiding. In fact, you need to run many threads per core to fully utilize your GPU.
A GPU is deeply pipelined, which means that even if new instructions are starting every cycle, each individual instruction may take many cycles to run. Sometimes, an instruction depends on the result of a previous instruction, so it can't start (enter the pipeline) until that previous instruction finishes (exits the pipeline). Or it may depend on data from RAM that will take a few cycles to access. On a CPU, this would result in a "pipeline stall" (or "bubble"), which leaves part of the pipeline sitting idle for a number of cycles, just waiting for the new instruction to start. This is a waste of computing resources, but it can be unavoidable.
Unlike a CPU, an GPU core is able to switch between threads very quickly — on the order of a cycle or two. So when one thread stalls for a few cycles because its next instruction can't start yet, the GPU can just switch over to some other thread and start its next instruction instead. If that thread stalls, the GPU switches threads again, and so on. These additional threads are doing useful work in pipeline stages that would otherwise have been idle during those cycles, so if there are enough threads to fill up each other's gaps, the GPU can do work in every pipeline stage on every cycle. Latency in any one thread is hidden by the other threads.
This is the same principle that underlies Intel's Hyper-threading feature, which makes a single core appear as two logical cores. In the worst case, threads running on those two cores will compete with each other for hardware resources, and each run at half speed. But in many cases, one thread can utilize resources that the other can't — ALUs that aren't needed at the moment, pipeline stages that would be idle due to stalls — so that both threads run at more than 50% of the speed they'd achieve if running alone. The design of a GPU basically extends this benefit to more than two threads.
You might find it helpful to read NVIDIA's CUDA Best Practices Guide, specifically chapter 10 ("Execution Configuration Optimizations"), which provides more detailed information about how to arrange your threads to keep the GPU busy.
As another user commented, it's mostly OS-dependent.
if a CPU has 2 logical cores, it can run two programs 100% concurrent,
yes?
Concurrently yes, in parallel no. See: https://softwareengineering.stackexchange.com/questions/190719/the-difference-between-concurrent-and-parallel-execution
For example, say I have 100 processes running on 2 cores ... will the
OS try and divide 50 on each core for load balance? Will they be
randomly scattered?
Each OS has it's own scheduling algorithm.
Say I launch mspaint.exe on a quad-core Intel chip ... where will it
be executed from (core 1, 2, 3, 4?), and will it continue executing
there until close?
We don't know where it will be executed and it will most probably not continue executing from start to finish on the same core. Again, depends on the OS scheduler.
Is it truly possible to pick a specific core, or program for
multi-cores directly without having a transparent daemon or the OS
doing it randomly for you?
Apparently yes: https://stackoverflow.com/questions/663958/how-to-control-which-core-a-process-runs-on
How so, if all people say is "just use
threads"? Is using multi-threads mapped to cores? If so, how is using
a thread tailored to a core without OS intervention if threads on a
single-core do not concurrently work?
I didn't understand the question here, but the basic idea with threads is that you create them and the OS runs using its scheduling algorithm, there's no need for you to control in which logical or physical core it will run (there may be cases you might want to do that, I'm not sure why).
Best Answer
A CPU is a much more general purpose machine than a GPU. We might talk about using a GPU as a "general purpose" GPU, but they have different strengths.
CPU cores are capable of a wide variety of operations and deal with (what can for all intents be considered to be) a random branching instruction stream. Multiple programs all vying for time on the processor and being controlled by the operating system. They cache and predict as much as they can while still trying to remain capable of dealing with sudden changes in the instruction stream.
GPUs on the other hand are processors designed to deal with data streams. Their processors are designed to work with a small series of instructions (a shader program) across a potentially vast amount of data. HD, 2k and 4k screens contain a huge number of pixels, and a shader must run programs across every pixel in successive runs to achieve particular effects. To that end their programs are (compared to a CPU) smaller, their per-core caches similarly smaller, but their bandwidth to memory phenomenally faster.
They might, with suitable programming, be able to achieve the same tasks, but the focus of instructions vs data processing is what separates a CPU from a GPU.
As such their cores are designed to work to those strengths. For a long while GPU shader cores have operated around 1-2GHz (modern intel graphics cores list their speeds as 500MHz to 1.5GHz) while CPUs have been anywhere between 1.5 and 4GHz and more.
Instruction processing benefits more from speed of individual units because it can be difficult or impossible to break an instruction stream down into multiple streams, hence CPUs need to be faster to deal with instructions quicker. The problem is that the faster you run a core the more heat it generates so you hit a limit in how fast you can run it. (There are other technical limitations that affect clock speed but that's something for another story.)
Data processing on the other hand lends itself to running the same task (program) on different data sets and parallelism, hence the more cores you can throw at the task the better. Running cores at a slower speed generates less heat. Less heat means you can put in more cores therefore better throughput of data. Hence data tasks benefit from a different (smaller, leaner) type of core to a CPU.
The end result is that we have two distinct types of processor. One is aimed at general purpose instruction streams, and another that is aimed at bulk data handling.