How is the micro-op cache tagged

cachecpu-architecturecpu-cacheintel-core-i7

According to Real World Technologies’ article on “Intel’s Sandy Bridge Microarchitecture”:

“Sandy Bridge’s uop cache is organized into 32 sets and 8 ways, with 6 uops per line, for a total of 1.5K uops capacity. The uop cache is strictly included in the L1 instruction cache. Each line also holds metadata including the number of valid uops in the line and the length of the x86 instructions corresponding to the uop cache line. Each 32B window that is mapped into the uop cache can span 3 of the 8 ways in a set, for a maximum of 18 uops – roughly 1.8B/uop. If a 32B window has more than 18 uops, it cannot fit in the uop cache and must use the traditional front-end. Microcoded instructions are not held in the uop cache, and are instead represented by a pointer to the microcode ROM and optionally the first few uops.”

'Each 32B window (from the instruction cache) is mapped into the uop cache, can span 3 of the 8 ways of a set'

So assume we have a 32B instruction window which would be half of a L1 instruction cache line, on that line, only the offset bits would be different but the tag and set bits would be the same for all bytes on the line.

Once a 32 byte window has been decoded, the uops are entered into the uop cache with the same virtual address that was used to retrieve the 16 byte fetch block from L1 instruction cache (so that they can be probed in parallel at every 32B margin)

It says that these uops can span 3 of the 8 ways in a set, but that would mean that they would have to have the same set bits but different tag bits to end up in the same set (meaning they wouldn't have been on the same line in the L1I cache), does this mean that the uop cache is slightly differently arranged, a single virtual address at the start of a line and the uops just fill up into the next way in the set and the next way in the set. How is it ensured that the next 32B instruction window which would still have the same tag and same sets bits but different offset bits (2nd half of the 64 B line in L1I) is mapped to the 4th way of that set.

Postulation: the uop cache way is tagged with virtual index physical tag, the next way is tagged with nothing, the third with nothing, the 4th is tagged with a virtual index / physical tag where the difference is that the offset has changed from 0 to 32, so in essence, a way can be selected using different offset bits as opposed to the manner L1I cache is tagged: with the offset bits functioning as an offset for the cache line.

Can anyone clarify the layout of uop caches or how this tagging actually works?

Best Answer

Note that AMD Zen also has a uop cache, but less is known about its internals. So you're asking specifically about Intel's uop cache in Sandybridge-family.

According to Agner Fog's testing (https://www.agner.org/optimize/, specifically his microarch pdf), it's virtually addressed (VIVT), saving the latency / power of iTLB lookups for uop-cache hits. And making it possible to still very tightly integrate the iTLB with the L1i cache, like normal for a VIPT L1 cache.

(also related: Which cache mapping technique is used in intel core i7 processor? for a summary of that and other caches, and https://stackoverflow.com/tags/x86/info for more performance / uarch links.)

Once a 32 byte window has been decoded

This is where you went wrong in your thought process.

The uop cache only caches uops that get decoded along the path of (speculative) execution. x86 instructions can only be decoded correctly if you know the correct starting point. The bytes after an unconditional jmp might not be the start of an instruction at all.

Plus, you don't want to pollute the uop cache with many single-byte filler instructions between functions (e.g. 0x90 NOP or 0xcc int3 used by MSVC). Or in general, with "cold" instructions that aren't reached during normal execution following a taken branch. A uop-cache "line" / way ends early with an unconditional jump, or with a call.

The legacy decoders are either decoding instructions the CPU expects to actually execute (feeding them into the uop cache for reuse later, and the IDQ directly for use right away), or they're powered down. Unlike P4, the legacy decoders are not weak; they're similar to the decoders in Core2 / Nehalem so execution from L1i is generally ok except in high-throughput code with a large average instruction size. They don't need to try to "build traces" ahead of time. (The uop cache is not a trace cache anyway; it doesn't follow jumps. But anyway it doesn't try to populate the uop cache for all 32 instruction bytes that could be cached right away.)

But interestingly, Agner says "The same piece of code can have multiple entries in the μop cache if it has multiple jump entries"

My best guess at how the cache-lookup machinery actually works:

Given a 64-bit virtual address to fetch code from:

The low 5 bits are the offset relative to a 32-byte boundary.
The next 5 bits are an index. Not 6 bits for 64-byte L1i lines; fetch from the uop cache doesn't directly care about that.
The higher bits (up to bit 48) are the tag.

Use the 5-bit index to select a set.
Fetch all 8 ways from that set (tag + metadata, and also data in parallel because this is a high-performance cache).

Compare in parallel for all 8 ways:

tag bits all match
offset is within the start+length range of x86 machine code this way caches uops for. (A way can only cache uops for 1 contiguous block of x86 machine code).

At most 1 way in the set will have both conditions true for a given instruction address. If there is one, this is your hit, and you can fetch uops from the one way that matched. (Like with a regular byte cache, except you need to check the metadata to select which uop to start fetching from if you jumped into the middle of a way.)

This is guesswork based on how the uop cache performs and when it throws out ways. But it may help you get a useful mental model of it.

Note that the address doesn't need to be 16-byte aligned. It needs to efficiently support branch targets that aren't aligned, as well as straight-line code with instruction boundaries that don't line up with 32-byte boundaries. (As best I can tell, instructions that cross a 32-byte boundary are cached in a uop-cache way for the start address of the instruction, even if it ends in the next L1i cache line across a 64-byte boundary.)

L1i fetch blocks / pre-decode for instruction length are aligned, but full decoding in the legacy decoders works on up-to-16 bytes of whatever alignment, taken from the queue between pre-decode and decode. Alignment of loop entry points to certain alignment boundaries is less important than it used to be.

Then I guess there's a check that the fetch address exactly matches one of the instruction start addresses in the selected way. This is not supported efficiently, because only obfuscated code decodes the same bytes two different ways.

The uop cache can't cache both ways at the same time, so on detecting this the CPU has to fall back to the legacy decoders and throw out the uop cache ways for this 32B block (which it already detected with the tag comparator).

Then it can start re-populating the uop-cache as it decodes uops from this point.

A similar thing happens when 3 ways are already full, but there are more uops from the same 32B block of x86 machine code. The uop-cache throws out all 3 ways for that block. (I'm not sure if it remembers not to try to cache them for next time, or if it just builds the cache every time and throws it away when it hits the limit, in a loop with 20 single-byte nop instructions for example.)

See Branch alignment for loops involving micro-coded instructions on Intel SnB-family CPUs for some details about this case. Note that micro-coded instructions like div use up a whole way of the uop cache on their own, and can easily lead to filling up all 3 ways and triggering DSB-to-MITE switches (uop cache to legacy-decode switches can create a 1 cycle bubble in the front-end).

That Q&A has a lot of detailed experiments and conclusions about how uops get cached. Not so much about how the uop cache is physically implemented; that's purely guesswork on my part here.

Also note that Intel CPUs before Skylake can only add 4 uops to the IDQ from the uop cache, but somehow don't bottleneck when there are ways in the uop cache that have 3 or 6 uops instead of 4. So IDK if there's some kind of buffering for non-branching uop fetch. This is a bit of a mystery. You'd kind of expect fetch to go in 4, 2, 4, 2 pattern if fetching from full lines of 6 uops each, but we don't see a front-end bottleneck like that for loops running from the uop cache with 2-byte instructions like xor eax,eax. Intel has stated that the uop cache can only fetch uops from 1 way per cycle, so maybe the 4-uop limit is just for adding to the IDQ, not actually for reading from the uop cache into some merge buffer.

Best Answer

My best guess at how the cache-lookup machinery actually works:

Related Solutions

How to calculate the size of the page table from this data

What does “address resolution at the byte level” mean

Related Question