Why can I run 23,000 CUDA threads on the GeForce GTX 480 GPU

gpu

Someone has told me that the GeForce GTX 480 GPU can run 23,000 CUDA threads concurrently. However, I am confused about why.

Each core of this GPU contains 2 groups of 16 SIMD units. Each SIMD unit has 8 ALUs and instruction contexts. There are 15 cores on the GPU.

Hence, shouldn't this GPU be able to run only 2 * 16 * 8 * 15 = 3840 CUDA threads at once?

Best Answer

GPU cores can effectively run many threads at the same time, due to the way they switch between threads for latency hiding. In fact, you need to run many threads per core to fully utilize your GPU.

A GPU is deeply pipelined, which means that even if new instructions are starting every cycle, each individual instruction may take many cycles to run. Sometimes, an instruction depends on the result of a previous instruction, so it can't start (enter the pipeline) until that previous instruction finishes (exits the pipeline). Or it may depend on data from RAM that will take a few cycles to access. On a CPU, this would result in a "pipeline stall" (or "bubble"), which leaves part of the pipeline sitting idle for a number of cycles, just waiting for the new instruction to start. This is a waste of computing resources, but it can be unavoidable.

Unlike a CPU, an GPU core is able to switch between threads very quickly — on the order of a cycle or two. So when one thread stalls for a few cycles because its next instruction can't start yet, the GPU can just switch over to some other thread and start its next instruction instead. If that thread stalls, the GPU switches threads again, and so on. These additional threads are doing useful work in pipeline stages that would otherwise have been idle during those cycles, so if there are enough threads to fill up each other's gaps, the GPU can do work in every pipeline stage on every cycle. Latency in any one thread is hidden by the other threads.

This is the same principle that underlies Intel's Hyper-threading feature, which makes a single core appear as two logical cores. In the worst case, threads running on those two cores will compete with each other for hardware resources, and each run at half speed. But in many cases, one thread can utilize resources that the other can't — ALUs that aren't needed at the moment, pipeline stages that would be idle due to stalls ­— so that both threads run at more than 50% of the speed they'd achieve if running alone. The design of a GPU basically extends this benefit to more than two threads.

You might find it helpful to read NVIDIA's CUDA Best Practices Guide, specifically chapter 10 ("Execution Configuration Optimizations"), which provides more detailed information about how to arrange your threads to keep the GPU busy.

Related Question