GPU Processors – Understanding Hundreds of Processors Inside a GPU

cpugpugraphics card

I just started learning about parallel programming course on Udacity and already I am kind of confused. Here at this video segment: https://youtu.be/gbj0oauFFI8?t=52s

We're told that the average GPU has thousands of ALUs and Hundreds of processors. I am confused by the "hundreds of processors" part. Why are there that many? Shouldn't it be just one…? GPU does stand for graphics processor unit. Isn't a GPU like a CPU, one processor with the thousands of ALUs inside, BUT entirely specialized for certain tasks? How do these "processors" come into play?

If I'm wrong then I assume that each processor has perhaps about 10 (because 10* hundred CPU's = 1000s of ALUs) ALU's inside it? Is there a layout I can see so I can verify this?

Thank you.

Best Answer

A modern graphics processor is a highly complex device and can have thousands of processing cores. The Nvidia GTX 970 for example has 1664 cores. These cores are grouped into batches that work together.

For an Nvidia card the cores are grouped together in batches of 16 or 32 depending on the underlying architecture (Kepler or Fermi) and each core in that batch would run the same task.

The distinction between a batch and a core though is an important one because while each core in a batch must run the same task its dataset can be separate.

Your central processor unit is large and only has a few cores because it is a highly generalised processor capable of large scale decision making and flow control. The graphics card eschews a large amount of the control and switching logic in favour the ability to run a massive number of tasks in parallel.

If you insist on having a picture to prove it then the image below (from GTX 660Ti Direct CU II TOP review) shows 5 green areas which are largely similar and would contain several hundred cores each for a total of 1344 active cores split across what looks to me to be 15 functional blocks:

enter image description here

Looking closely each block appear to have 4 sets of control logic on the side suggesting that each of the 15 larger blocks you can see has 4 SMX units.

This gives us 15*4 processing blocks (60) with 32 cores each for a complete total of 1920 cores, batches of them will be disabled because they either malfunctioned or simply to facilitate segregating them into different performance groups. This would give us the correct number of active cores.

A good source of information on how the batches map together is on Stack Overflow: https://stackoverflow.com/questions/10460742/how-do-cuda-blocks-warps-threads-map-onto-cuda-cores

Related Question