Digital Design and Computer Architecture - Lecture 15b: out of order execution, DataFlow & Load/Store Handling

January 19, 2022

This post is a derivative of Digital Design and Computer Architecture Lecture by Prof. Onur Mutlu, used under CC BY-NC-SA 4.0.

You can watch this lecture on Youtube and see pdf.

I write this summary for personal learning purposes.

Required Readings

Out-of-order execution
- H&H, Chapter 7.8-7.9
Smith and Sohi, “The Microarchitecture of Superscalar Processors,” Proceedings of the IEEE, 1995
- More advanced pipelining
- Interrupt and exception handling
- Out-of-order and superscalar execution concepts
Optional
- Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro 1999.

Agenda for Today & Next Few Lectures

Out-of-Order Execution
Other Execution Paradigms

Some Questions

What is needed in hardware to perform tag broadcast and value capture?
- lots of comparators
- make a value valid
- Wake up an instruction. If both sources are ready, then in the next cycle the instruction can start executing. If multiple instructions in the same functional unit become ready, you need to determine the order.
Does the tag have to be the ID of the Reservation Station Entry?
- No. The tag can be any unique identifier. It ensures that particular instance of a register is named uniquely.
What can potentially become the critical path?
- Tag broadcast -> value capture -> instruction wake up

Some More Questions (Design Choices)

We’re not going to really fully answer because there a lot of design choices that you make in an out of order machine.

When is a reservation station entry deallocated?
Should the reservation stations be dedicated to each functional unit or global across functional units? (E.g, an adder has a different set of reservation stations compared to a multiplier? Or, shared reservation stations?)
- Centralized vs. Distributed: What are the tradeoffs?
  - If it’s centralized, it’s going to be a bigger data structure. Assuming you keep the same number of instructions, it’s going to be a bigger memory so it maybe not be as easy to design. But you may get better balanced if you have a lot of same instructions.
  - If you have distributed, you can specialize the reservation station for the functional unit. For example, if you have only one source register needed, your reservation station doesn’t need to handle two tags.
Should reservation stations and ROB store data values or should there be a centralized physical register file where all data values are stored?
- We should have somewhere centralized basically. For example, if you’re always reading for whatever reason from R1, you have a lot of duplicate values. You could reduce the storage by not storing the values in RS but storing them in a physical register file.
Timing: Exactly when does an instruction broadcast its tag?
- Think about exactly when it does an instruction broadcast is tagged.
Many other design choices for OoO engines

Out-of-Order Execution with Precise Exceptions

An out of order exercise with precise exceptions.

Assume ADD (4 cycle execute), MUL (6 cycle execute)
Assume one adder and one multiplier
How many cycles
- in a non-pipelined machine
- in an in-order-dispatch pipelined machine with reorder buffer (no forwarding and full forwarding)
- in an out-of-order dispatch pipelined machine with reorder buffer (full forwarding)

Idea: Use a reorder buffer to reorder instructions before committing them to architectural state. When an instruction finishes, it writes its value to the RAT and also to the ROB. When instruction becomes the oldest in the machine you take the value and update the ARF. So the ARF is always updated in program order.

An instruction updates the RAT when it completes execution
- Also called frontend register file. It is very similar to what we’ve shown.
An instruction updates a separate architectural register file when it retires
- It will hold the architectural state and be used for precise exceptions
- i.e., when it is the oldest in the machine and has completed execution
- In other words, the architectural register file is always updated in program order
On an exception: flush pipeline and copy architectural register file into frontend register file. Because you have the correct states in the ARF.

Modern OoO Execution w/ Precise Exceptions

Most modern processors use the following
Reorder buffer to support in-order retirement of instructions
A single register file to store all registers
- Both speculative(renamed) and architectural registers
- INT and FP are still separate
Two register maps to distinguish between the front-end and the architectural states.
- Future/frontend register map -> used for renaming
- Architectural register map -> used for maintaining precise state

Enabling OoO Execution, Revisited

How do we enable out-of-order execution?
1. Link the consumer of a value to the producer
2. Register renaming: Associate a “tag” with each data value
Buffer instructions until they are ready
1. Insert instruction into reservation stations after renaming
Keep track of readiness of source values of an instruction
1. Broadcast the “tag” when the value is produced
2. Instructions compare their “source tags” to the broadcast tag
  1. if match, source value becomes ready
When all source values of an instruction are ready, dispatch the instruction to functional unit (FU)
1. Wakeup and select/schedule the instruction

Summary of OOO Execution Concepts

Register renaming eliminates false dependencies, enables linking of producer to consumers
Buffering in reservation stations enables the pipeline to move for independent instructions
Tag broadcast enables communication (of readiness of produced value) between instructions
Wakeup and select enables out-of-order dispatch

OOO Execution: Restricted Dataflow

An out-of-order engine dynamically builds the dataflow graph of a piece of the program
- which piece? -> in the machine (reservation station / instruction window)
The dataflow graph is limited to the instruction window
- Instruction window: all decoded but not yet retired instructions
Can we do it for the whole program? -> If you have a lot of reservation station

Questions to Ponder

Why is OoO execution beneficial?
- What if all operations take a single cycle? -> there’s no benefit
- Out-of-execution is a way of dealing with long latency
- Latency tolerance: OoO execution tolerates the latency of multi-cycle operations by executing independent operations concurrently
What if an instruction takes 1000 cycles?
- How large of an instruction window do we need to continue decoding? -> If do we need to continue decoding, it needs a thousand entries.
- How many cycles of latency can OoO tolerate? -> as many as you have buffering for.
- What limits the latency tolerance scalability of Tomasulo’s algorithm?
  - Instruction window size: how many decoded but not yet retired instructions you can keep in the machine.

Modern OoO Designs

Handling Out-of-Order Execution of Loads and Stores

Registers versus Memory

So far, we considered mainly registers as part of state. What about memory? What are the fundamental differences between registers and memory?

Register dependences known statically – memory dependences determined dynamically
- If you look at an instructions, it’s sources R3, you know the sources and the destination. You can do renaming easily because you know everything at the front-end of the machine after decode the instruction. But memory, you need to execute the instruction a little bit to get the address. So you don’t know the memory address of an instruction at the beginning of the pipeline in the decode stage.
Register state is small – memory state is large
Register state is not visible to other threads/processors – memory state is shared between threads/processors (in a shared memory multiprocessor)

Memory Dependence Handling (I)

Need to obey memory dependences in an out-of-order machine and need to do so while providing high performance. It’s not just about registers. You need to ensure that memory dependences are correctly obeyed. Meaning that a load may be dependent on a store and need to ensure that load gets the correct value from the correct store.
Observation and Problem: Memory address is not known until a load/store executes
Corollary 1: Renaming memory addresses is difficult
Corollary 2: Determining dependence or independence of loads/stores has to be handled after their (partial) execution
Corollary 3: When a load/store has its address ready, there may be older/younger stores/loads with unknown addresses in the machine

Memory Dependence Handling (II)

When do you schedule a load instruction in an OOO engine?
- Problem: A younger load can have its address ready before an older store’s address is known
- Known as the memory disambiguation problem or the unknown address problem
Approaches
- Conservative: Stall the load until all previous stores have computed their addresses (or even retired from the machine). When all of them have compute their addresses, now you know which store you’re depending on. If you’re even more conservative, you just wait until all of the stores are out of the machine.
- Aggressive: Assume load is independent of unknown-address stores and schedule the load right away. Basically the prediction is that hopefully this load is not going to depend on any other of stores. You need to check later on did I predict this correctly.
- Intelligent: Predict (with a more sophisticated predictor) if the load is dependent on any unknown address store

Handling of Store-Load Dependences

A load’s dependence status is not known until all previous store addresses are available.
How does the OOO engine detect dependence of a load instruction on a previous store?
- Option 1: Wait until all previous stores committed (no need to check for address match)
- Option 2: Keep a list of pending stores in a store buffer and check whether load address matches a previous store address
How does the OOO engine treat the scheduling of a load instruction wrt previous stores?
- Option 1: Assume load dependent on all previous stores
- Option 2: Assume load independent of all previous stores
- Option 3: Predict the dependence of a load on an outstanding store

Memory Disambiguation (I)

Option 1: Assume load is dependent on all previous stores
- No need for recovery
- Too conservative: delays independent loads unnecessarily
Option 2: Assume load is independent of all previous stores
- Simple and can be common case: no delay for independent loads
- Requires recovery and re-execution of load and dependents on misprediction
Option 3: Predict the dependence of a load on an outstanding store
- More accurate. Load store dependencies persist over time
- Still requires recovery/re-execution on misprediction
- Alpha 21264 : Initially assume load independent, delay loads found to be dependent
- Moshovos et al., “Dynamic speculation and synchronization of data dependences,” ISCA 1997.
- Chrysos and Emer, “Memory Dependence Prediction Using Store Sets,” ISCA 1998.

Memory Disambiguation (II)

Chrysos and Emer, “Memory Dependence Prediction Using Store Sets,” ISCA 1998.
Predicting store-load dependencies important for performance
Simple predictors (based on past history) can achieve most of the potential performance

Data Forwarding Between Stores and Loads

We cannot update memory out of program order
- Need to buffer all store and load instructions in instruction window
Even if we know all addresses of past stores when we generate the address of a load, two questions still remain:
1. How do we check whether or not it is dependent on a store
2. How do we forward data to the load if it is dependent on a store
Modern processors use a LQ (load queue) and a SQ for this
- Can be combined or separate between loads and stores
- A load searches the SQ after it computes its address.
- A store searches the LQ after it computes its address.

Out-of-Order Completion of Memory Ops

When a store instruction finishes execution, it writes its address and data in its reorder buffer entry (or SQ entry)
When a later load instruction generates its address, it:
- searches the SQ with its address
- accesses memory with its address
- receives the value from the youngest older instruction that wrote to that address (either from ROB or memory)
This is a complicated “search logic” implemented as a Content Addressable Memory
- Content is “memory address” (but also need size and age )
- Called store-to-load forwarding logic

Store-Load Forwarding Complexity

Content Addressable Search (based on Load Address)
Range Search (based on Address and Size of both the Load and earlier Stores)
- Because you may partially overlap with the address. For example, this load maybe accessing bytes 8 to 12 and the store maybe writing bytes 11 to 12.
Age-Based Search (for last written values)
Load data can come from a combination of multiple places
- One or more stores in the Store Buffer (SQ)
- Memory/cache

Share on

Twitter Facebook LinkedIn

Juhyung Kim