Can One CPU Core Perform Multiple Operations Per Tick?

cpucpu-architecturecpu-cachemulti-core

A core has its own execution units and load/store buffers (additional "cache" – in addition to L1).

  1. Do those execution units have their own registers? Do cores also have their own dedicated registers? Or we just have CPU registers shared by all cores (and their execution units)? Or we have shared registers and some other registers are core-dedicated?

  2. Can multiple CPU machine-instructions be performed during one tick on one core (but core's different execution units – also in parallel in hyper-threading mode)?

  3. Does each core really have its own (dedicated) FPU and ALU as its execution units? I thought that a CPU has a single FPU (regardless of number of cores).

Best Answer

To answer directly modern x86 CPUs are indeed superscalar and capable of fetching, scheduling and executing multiple instructions per clock cycle.

As a slightly extreme example, a modern i7 6950X core is apparently capable of 10.6 instructions per clock cycle (per core) when performing the Dhrystone MIPS benchmark, most likely due to instruction fusion and other smart features in and around the core making it more efficient than a simple 1:1 instruction stream.

The front end of the CPU handles instruction decoding and passes on uOPs (broken down or even fused instructions) to the execution engine which then routes and dispatches instructions to the various units capable of handling different instruction types.

In a Skylake CPU there are multiple units capable of doing integer arithmetic and logic (INT ALU) and also vector processing as well as FP math. In theory an instruction could be dispatched to each one of those units at the same time for execution, but generally there is a limit on how many uOPs can be dispatched at once and to what units.

There is also the problem of instructions having different timings and not all processing units becoming available at the same time.

As to registers, internally the CPU can remap and replace the registers used by a program to better suit the actual execution units. In the image below you see that Skylake has over 300 registers; 180 integer and 168 vector registers. These will be used as required.

Wikichip is an awesome place to find out more about CPU architecture in general. Below is an image showing the functional blocks in a Skylake CPU core.

enter image description here

You cannot dispatch two instructions to the same port in one clock cycle, but instruction can be queued per port or allocated to another port for execution as long as it is capable of executing that instruction type.

Related Question