“Sandy Bridge’s uop cache is organized into 32 sets and 8 ways, with 6 uops per line, for a total of 1.5K uops capacity. The uop cache is strictly included in the L1 instruction cache. Each line also holds metadata including the number of valid uops in the line and the length of the x86 instructions corresponding to the uop cache line. Each 32B window that is mapped into the uop cache can span 3 of the 8 ways in a set, for a maximum of 18 uops – roughly 1.8B/uop. If a 32B window has more than 18 uops, it cannot fit in the uop cache and must use the traditional front-end. Microcoded instructions are not held in the uop cache, and are instead represented by a pointer to the microcode ROM and optionally the first few uops.”
'Each 32B window (from the instruction cache) is mapped into the uop cache, can span 3 of the 8 ways of a set'
So assume we have a 32B instruction window which would be half of a L1 instruction cache line, on that line, only the offset bits would be different but the tag and set bits would be the same for all bytes on the line.
Once a 32 byte window has been decoded, the uops are entered into the uop cache with the same virtual address that was used to retrieve the 16 byte fetch block from L1 instruction cache (so that they can be probed in parallel at every 32B margin)
It says that these uops can span 3 of the 8 ways in a set, but that would mean that they would have to have the same set bits but different tag bits to end up in the same set (meaning they wouldn't have been on the same line in the L1I cache), does this mean that the uop cache is slightly differently arranged, a single virtual address at the start of a line and the uops just fill up into the next way in the set and the next way in the set. How is it ensured that the next 32B instruction window which would still have the same tag and same sets bits but different offset bits (2nd half of the 64 B line in L1I) is mapped to the 4th way of that set.
Postulation: the uop cache way is tagged with virtual index physical tag, the next way is tagged with nothing, the third with nothing, the 4th is tagged with a virtual index / physical tag where the difference is that the offset has changed from 0 to 32, so in essence, a way can be selected using different offset bits as opposed to the manner L1I cache is tagged: with the offset bits functioning as an offset for the cache line.
Can anyone clarify the layout of uop caches or how this tagging actually works?