Digital Design and Computer Architecture - Lecture 15b: out of order execution, DataFlow & Load/Store Handling
This post is a derivative of Digital Design and Computer Architecture Lecture by Prof. Onur Mutlu, used under CC BY-NC-SA 4.0.
You can watch this lecture on Youtube and see pdf.
I write this summary for personal learning purposes.
Required Readings
- Out-of-order execution
- H&H, Chapter 7.8-7.9
- Smith and Sohi, “The Microarchitecture of Superscalar Processors,” Proceedings of the IEEE, 1995
- More advanced pipelining
- Interrupt and exception handling
- Out-of-order and superscalar execution concepts
- Optional
- Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro 1999.
Agenda for Today & Next Few Lectures
- Out-of-Order Execution
- Other Execution Paradigms
Some Questions
- What is needed in hardware to perform tag broadcast and value capture?
- lots of comparators
- make a value valid
- Wake up an instruction. If both sources are ready, then in the next cycle the instruction can start executing. If multiple instructions in the same functional unit become ready, you need to determine the order.
- Does the tag have to be the ID of the Reservation Station Entry?
- No. The tag can be any unique identifier. It ensures that particular instance of a register is named uniquely.
- What can potentially become the critical path?
- Tag broadcast -> value capture -> instruction wake up
Some More Questions (Design Choices)
We’re not going to really fully answer because there a lot of design choices that you make in an out of order machine.
- When is a reservation station entry deallocated?
- Should the reservation stations be dedicated to each functional unit or global across functional units? (E.g, an adder has a different set of reservation stations compared to a multiplier? Or, shared reservation stations?)
- Centralized vs. Distributed: What are the tradeoffs?
- If it’s centralized, it’s going to be a bigger data structure. Assuming you keep the same number of instructions, it’s going to be a bigger memory so it maybe not be as easy to design. But you may get better balanced if you have a lot of same instructions.
- If you have distributed, you can specialize the reservation station for the functional unit. For example, if you have only one source register needed, your reservation station doesn’t need to handle two tags.
- Centralized vs. Distributed: What are the tradeoffs?
- Should reservation stations and ROB store data values or should there be a centralized physical register file where all data values are stored?
- We should have somewhere centralized basically. For example, if you’re always reading for whatever reason from R1, you have a lot of duplicate values. You could reduce the storage by not storing the values in RS but storing them in a physical register file.
- Timing: Exactly when does an instruction broadcast its tag?
- Think about exactly when it does an instruction broadcast is tagged.
- Many other design choices for OoO engines
Out-of-Order Execution with Precise Exceptions
An out of order exercise with precise exceptions.
- Assume ADD (4 cycle execute), MUL (6 cycle execute)
- Assume one adder and one multiplier
- How many cycles
- in a non-pipelined machine
- in an in-order-dispatch pipelined machine with reorder buffer (no forwarding and full forwarding)
- in an out-of-order dispatch pipelined machine with reorder buffer (full forwarding)
Idea: Use a reorder buffer to reorder instructions before committing them to architectural state. When an instruction finishes, it writes its value to the RAT and also to the ROB. When instruction becomes the oldest in the machine you take the value and update the ARF. So the ARF is always updated in program order.
- An instruction updates the RAT when it completes execution
- Also called frontend register file. It is very similar to what we’ve shown.
- An instruction updates a separate architectural register file when it retires
- It will hold the architectural state and be used for precise exceptions
- i.e., when it is the oldest in the machine and has completed execution
- In other words, the architectural register file is always updated in program order
- On an exception: flush pipeline and copy architectural register file into frontend register file. Because you have the correct states in the ARF.
Modern OoO Execution w/ Precise Exceptions
- Most modern processors use the following
- Reorder buffer to support in-order retirement of instructions
- A single register file to store all registers
- Both speculative(renamed) and architectural registers
- INT and FP are still separate
- Two register maps to distinguish between the front-end and the architectural states.
- Future/frontend register map -> used for renaming
- Architectural register map -> used for maintaining precise state
Enabling OoO Execution, Revisited
- How do we enable out-of-order execution?
- Link the consumer of a value to the producer
- Register renaming: Associate a “tag” with each data value
- Buffer instructions until they are ready
- Insert instruction into reservation stations after renaming
- Keep track of readiness of source values of an instruction
- Broadcast the “tag” when the value is produced
- Instructions compare their “source tags” to the broadcast tag
- if match, source value becomes ready
- When all source values of an instruction are ready, dispatch the instruction to functional unit (FU)
- Wakeup and select/schedule the instruction
Summary of OOO Execution Concepts
- Register renaming eliminates false dependencies, enables linking of producer to consumers
- Buffering in reservation stations enables the pipeline to move for independent instructions
- Tag broadcast enables communication (of readiness of produced value) between instructions
- Wakeup and select enables out-of-order dispatch
OOO Execution: Restricted Dataflow
- An out-of-order engine dynamically builds the dataflow graph of a piece of the program
- which piece? -> in the machine (reservation station / instruction window)
- The dataflow graph is limited to the instruction window
- Instruction window: all decoded but not yet retired instructions
- Can we do it for the whole program? -> If you have a lot of reservation station
Questions to Ponder
- Why is OoO execution beneficial?
- What if all operations take a single cycle? -> there’s no benefit
- Out-of-execution is a way of dealing with long latency
- Latency tolerance: OoO execution tolerates the latency of multi-cycle operations by executing independent operations concurrently
- What if an instruction takes 1000 cycles?
- How large of an instruction window do we need to continue decoding? -> If do we need to continue decoding, it needs a thousand entries.
- How many cycles of latency can OoO tolerate? -> as many as you have buffering for.
- What limits the latency tolerance scalability of Tomasulo’s algorithm?
- Instruction window size: how many decoded but not yet retired instructions you can keep in the machine.
Modern OoO Designs
Handling Out-of-Order Execution of Loads and Stores
Registers versus Memory
So far, we considered mainly registers as part of state. What about memory? What are the fundamental differences between registers and memory?
- Register dependences known statically – memory dependences determined dynamically
- If you look at an instructions, it’s sources R3, you know the sources and the destination. You can do renaming easily because you know everything at the front-end of the machine after decode the instruction. But memory, you need to execute the instruction a little bit to get the address. So you don’t know the memory address of an instruction at the beginning of the pipeline in the decode stage.
- Register state is small – memory state is large
- Register state is not visible to other threads/processors – memory state is shared between threads/processors (in a shared memory multiprocessor)
Memory Dependence Handling (I)
- Need to obey memory dependences in an out-of-order machine and need to do so while providing high performance. It’s not just about registers. You need to ensure that memory dependences are correctly obeyed. Meaning that a load may be dependent on a store and need to ensure that load gets the correct value from the correct store.
- Observation and Problem: Memory address is not known until a load/store executes
- Corollary 1: Renaming memory addresses is difficult
- Corollary 2: Determining dependence or independence of loads/stores has to be handled after their (partial) execution
- Corollary 3: When a load/store has its address ready, there may be older/younger stores/loads with unknown addresses in the machine
Memory Dependence Handling (II)
- When do you schedule a load instruction in an OOO engine?
- Problem: A younger load can have its address ready before an older store’s address is known
- Known as the memory disambiguation problem or the unknown address problem
- Approaches
- Conservative: Stall the load until all previous stores have computed their addresses (or even retired from the machine). When all of them have compute their addresses, now you know which store you’re depending on. If you’re even more conservative, you just wait until all of the stores are out of the machine.
- Aggressive: Assume load is independent of unknown-address stores and schedule the load right away. Basically the prediction is that hopefully this load is not going to depend on any other of stores. You need to check later on did I predict this correctly.
- Intelligent: Predict (with a more sophisticated predictor) if the load is dependent on any unknown address store
Handling of Store-Load Dependences
- A load’s dependence status is not known until all previous store addresses are available.
- How does the OOO engine detect dependence of a load instruction on a previous store?
- Option 1: Wait until all previous stores committed (no need to check for address match)
- Option 2: Keep a list of pending stores in a store buffer and check whether load address matches a previous store address
- How does the OOO engine treat the scheduling of a load instruction wrt previous stores?
- Option 1: Assume load dependent on all previous stores
- Option 2: Assume load independent of all previous stores
- Option 3: Predict the dependence of a load on an outstanding store
Memory Disambiguation (I)
- Option 1: Assume load is dependent on all previous stores
- No need for recovery
- Too conservative: delays independent loads unnecessarily
- Option 2: Assume load is independent of all previous stores
- Simple and can be common case: no delay for independent loads
- Requires recovery and re-execution of load and dependents on misprediction
- Option 3: Predict the dependence of a load on an outstanding store
- More accurate. Load store dependencies persist over time
- Still requires recovery/re-execution on misprediction
- Alpha 21264 : Initially assume load independent, delay loads found to be dependent
- Moshovos et al., “Dynamic speculation and synchronization of data dependences,” ISCA 1997.
- Chrysos and Emer, “Memory Dependence Prediction Using Store Sets,” ISCA 1998.
Memory Disambiguation (II)
- Chrysos and Emer, “Memory Dependence Prediction Using Store Sets,” ISCA 1998.
- Predicting store-load dependencies important for performance
- Simple predictors (based on past history) can achieve most of the potential performance
Data Forwarding Between Stores and Loads
- We cannot update memory out of program order
- Need to buffer all store and load instructions in instruction window
- Even if we know all addresses of past stores when we generate the address of a load, two questions still remain:
- How do we check whether or not it is dependent on a store
- How do we forward data to the load if it is dependent on a store
- Modern processors use a LQ (load queue) and a SQ for this
- Can be combined or separate between loads and stores
- A load searches the SQ after it computes its address.
- A store searches the LQ after it computes its address.
Out-of-Order Completion of Memory Ops
- When a store instruction finishes execution, it writes its address and data in its reorder buffer entry (or SQ entry)
- When a later load instruction generates its address, it:
- searches the SQ with its address
- accesses memory with its address
- receives the value from the youngest older instruction that wrote to that address (either from ROB or memory)
- This is a complicated “search logic” implemented as a Content Addressable Memory
- Content is “memory address” (but also need size and age )
- Called store-to-load forwarding logic
Store-Load Forwarding Complexity
- Content Addressable Search (based on Load Address)
- Range Search (based on Address and Size of both the Load and earlier Stores)
- Because you may partially overlap with the address. For example, this load maybe accessing bytes 8 to 12 and the store maybe writing bytes 11 to 12.
- Age-Based Search (for last written values)
- Load data can come from a combination of multiple places
- One or more stores in the Store Buffer (SQ)
- Memory/cache
Leave a comment