Digital Design and Computer Architecture - Lecture 14: Pipelining
This post is a derivative of Digital Design and Computer Architecture Lecture by Prof. Onur Mutlu, used under CC BY-NC-SA 4.0.
You can watch this lecture on Youtube and see pdf.
I write this summary for personal learning purposes.
Agenda for Today
- Issues in Pipelining: Control & Data Dependence Handling, State Maintenance and Recovery, …
Stalling Hardware
Stalls are supported by:
- adding enable inputs (EN) to the Fetch and Decode pipeline registers
- and a synchronous reset/clear (CLR) input to the Execute pipeline register
- or an INV bit associated with each pipeline register - it is good when you debug hardware
- When a lw stall occurs
- StallD(ecode) and stallF(etch) are asserted to force the Decode and Fetch stage pipeline registers to hlod their old values.
- FlushE(xecute) is also asserted to clear the contents of the Execute stage pipeline register, introducing a bubble.
Control Dependences
- Special case of data dependence: dependence on PC
- beq:
- Branch is not determined until the fourth stage of the pipeline. It means, you do need the target address much earlier because you need to figure out what’s the next instruction to fetch.
- Instructions after the branch are fetched before branch is resolved
- Always predict that the next sequential instruction is fetched
- Called “Always not taken” prediction
- These instructions must be flushed if the branch is taken
- This can involve a lot of wasted cycles in this case. It could be four cycles but if your pipeline is 20 deep and if you determine the branch condition at the 19th stage, you lose nineteen cycles.
- Branch misprediction penalty
- number of instructions flushed when branch is taken
- May be reduced by determining branch earlier (not going to talk about it as much right now)
- If you do always not-taken prediction, your prediction accuracy is about 30% so 70% of the branches are actually taken because a lot of programs are structured as loops (go back into the loop and do it over). Actually always taking predictions much better than always not taken.
- Also, you can distinguish between different branches. This branch is always taken and this branch is always not taken, so now based on the program counter, you have seperate predictions.
- Reduce the effect of branches by
- doing better prediction
- Backward branches are usually taken (loops)
- Consider history of whether branch was previously taken to improve the guess
- reducing the branch misprediction penalty
- Early branch resolution: but it introduces another data dependency in Decode stage
- doing better prediction
Pipelined Performance Example
-
SPECINT2006 benchmark:
- 25% loads
- 10% stores
- 11% branches
- 2% jumps
- 52% R-type
-
Suppose
- 40% of loads used by next instruction
- 25% of branches mispredicted
- All jumps flush next instruction
-
Load/Branch CPI = 1 when no stall/flush, 2 when stall/flush.
- When it is followed by an instruction that’s depend on it, there is a nop or a bubble that you need to insert. That’s why it takes two cycles.
- When you mispredicted, it takes two cycles.
- CPI(lw) = 1(0.6) + 2(0.4) = 1.4 Average CPI for load
- CPI(beq) = 1(0.75) + 2(0.25) = 1.25 Average CPI for branch
-
And
-
Store, jumps and r-type instructions are always contant latency
- The jump is always taken and right here we’re predicting always not taken, So we’re always mispricting the jump.
-
Average CPI = (0.25)(1.4) + // load
(0.1)(1) + // store
(0.11)(1.25) + // beq
(0.02)(2) + // jump
(0.52)(1) // r-type
= 1.15
-
-
There are 5 stages, and 5 different timing paths:
Tc = max { tpcq + tmem + tsetup // fetch 2(tRFread + tmux + teq + tAND + tmux + tsetup ) // decode tpcq + tmux + tmux + tALU + tsetup // execute tpcq + tmemwrite + tsetup // memory 2(tpcq + tmux + tRFwrite) // writeback } = 550ps (assume)
- The operation speed depends on the slowest operation
- Decode and Writeback use register file and have only half a clock cycle to complete, that is why there is a 2 in front of them
-
For a program with 100 billion instructions executing on a pipelined MIPS processor
- Execution Time = (# instructions) x CPI x Tc = (100 x 109^9)(1.15)(550 x 10^-12) = 63 seconds
Questions to Ponder
What is the role of the hardware vs. the software in data dependence handling?
- Software based interlocking / Hardware based interlocking
- Who inserts/manages the pipeline bubbles?
- Who finds the independent instructions to fill “empty” pipeline slots?
- What are the advantages/disadvantages of each?
- Think of the performance equation as well
What is the role of the hardware vs. the software in the order in which instructions are executed in the pipeline
- Software based instruction scheduling -> static scheduling (before runtime)
- Compiler orders the instructions, hardware executes them in that order
- Contrast this with dynamic scheduling (in which hardware can execute instructions out of the compiler-specified order)
- How does the compiler know the latency of each instruction?
- Hardware based instruction scheduling -> dynamic scheduling (runtime)
- Static (before runtime) doesn’t have information that’s available at runtime. (E.g,) branch predictions. But the hardware can collect and use information as the program is running.
What information does the compiler not know that makes static scheduling difficult?
Anything that is determined at run time. Variable-length operation latency (depend on value (multiply by 0), use of cache memory…), memory addr, branch direction
How can the compiler alleviate this (i.e., estimate the unknown)?
Profiling - Compiler runs the program and optimize the code.
How does each impact different metrics?
- Performance (and parts of the performance equation)
- Complexity
- Power consumption
- Reliability
- …
Pipelining and Precise Exceptions: Preserving Sequential Semantics
Multi-Cycle Execution
- Not all instructions take the same amount of time for “execution”
- Idea: Have multiple different functional units that take different number of cycles. (Integer add, mul, floating point multi, div…)
- Can be pipelined or not pipelined
- Can let independent instructions start execution on a different functional unit before a previous long-latency instruction finishes execution.
- What is wrong if an ADD start before FP MUL in a Von Neumann architecture?
- Sequential semantics of the ISA NOT preserved!
- What if FMUL incurs an exception? (divide by 0)
- It makes it hard for software programmers to debug.
Exceptions vs. Interrupts
Cause
- Exceptions: internal to the running thread
- E.g, an instruction got a value that you cannot operate on.
- Interrupts: external to the running thread
- E.g, some user decideds to hit the keyboard
When to Handle
- Exceptions: when detected (and known to be non-speculative)
- Interrupts: when convenient
- Except for very high priority ones
- Power failure
- Machine check (error)
- Except for very high priority ones
Priority
process (of exception), depends (on what kind of interrupt)
Handling Context
process (of exception), system (interrupt)
Precise Exceptions/Interrupts
The architectural state should be consistent (precise) when the exception/interrupt is ready to be handled.
- All previous instructions should be completely retired (finished and updated architectural registers and program counter).
- No later instruction should be retired
Retire = commit = finish execution and update arch. state
Checking for and Handling Exceptions in Pipelining
What happens when you decide to handle an exception? When the oldest instruction ready-to-be-retired is detected to have caused an exception, the control logic
- Ensures architectural state is precise (register file, PC, memory)
- Register file , PC and memory are not updated by any later instructions and are completely updated by all previous instructions.
- Flushes all younger instructions in the pipeline
- Saves PC and registers (as specified by the ISA)
- Redirects the fetch engine to the appropriate exception handling routine
Why Do We Want Precise Exceptions?
- Semantics of the von Neumann model ISA specifies it
- Contract needs to be obey
- Remember von Neumann vs. Dataflow
- There is no precise exeption in the dataflow and that’s why dataflow machine are very hard to debug
- Aids software debugging
- Enables (easy) recovery from exceptions
- Enables (easily) restartable processes
- Enables traps into software (e.g., software implemented opcodes)
Ensuring Precise Exceptions in Pipelining
Reorder buffer (ROB)
Complete instructions out-of-order, but reorder them before making results visible to architectural state.
Add another instruction into the pipeline. After an instruction completes it writes as a result into this reordering buffer as opposed to directly writing to the register file because the register file is visible to the programmer. And when the instruction becomes the oldest one you take the value and write into ther register file or memory at that point in time.
- When instruction is decoded it reserves the next-sequential entry in the ROB
- When instruction completes, it writes result into ROB entry
- When instruction oldest in ROB and it has completed without exceptions, its result moved to reg. file or memory
- Buffers information about all instructions that are decoded but not yet retired/committed
What’s in a ROB Entry?
You need all the information about an instruction. DestRegID, DestRegVal, StoreAddr, StoreData, PC, Valid bits for reg/data + control bits, Exception.
- Everything required to:
- correctly reorder instructions back into the program order
- update the architectural state with the instruction’s result(s), if instruction can retire without any issues
- handle an exception/interrupt precisely, if an exception/interrupt needs to be handled before retiring the instruction
- Need valid bits to keep track of readiness of the result(s) and find out if the instruction has completed execution
Independent Operations
- Result first written to ROB on instruction completion
- Result written to register file at commit time
- What if a later instruction needs a value in the reorder buffer?
- One option: stall the operation -> stall the pipeline
- Better: Read the value from the reorder buffer.
How to access the reorder buffer?
A register value can be in the register file, reorder buffer, (or bypass/forwarding paths). You should read the youngest data.
Idea: Use indirection
- Access register file first (check if the register is valid)
- If register not valid, register file stores the ID of the reorder buffer entry that contains (or will contain) the value of the register
- Mapping of the register to a ROB entry: Register file maps the register to a reorder buffer entry if there is an in-flight instruction writing to the register
- Access reorder buffer next
- Now, reorder buffer does not need to be content addressable
Important: Register Renaming with a Reorder Buffer
- Output and anti dependencies are not true dependencies
- WHY? The same register refers to values that have nothing to do with each other
- They exist due to lack of register ID’s (i.e. names) in the ISA
- The register ID is renamed to the reorder buffer entry that will hold the register’s value
- Register ID -> ROB entry ID
- Architectural register ID -> Physical register ID
- After renaming, ROB entry ID used to refer to the register
- This eliminates anti and output dependencies
- Gives the illusion that there are a large number of registers
In-Order Pipeline with Reorder Buffer
- Decode (D): Access regfile/ROB, allocate entry in ROB, check if instruction can execute, if so dispatch instruction
- Execute (E): Instructions can complete out-of-order
- Completion (R): Write result to reorder buffer
- Retirement/Commit (W): Check for exceptions; if none, write result to architectural register file or memory; else, flush pipeline and start from exception handler
- In-order dispatch/execution, out-of-order completion, in-order retirement
Reorder Buffer Tradeoffs
- Advantages
- Conceptually simple for supporting precise exceptions
- Can eliminate false dependences
- Disadvantages
- Reorder buffer needs to be accessed to get the results that are yet to be written to the register file
- CAM or indirection increased latency and complexity
- Reorder buffer needs to be accessed to get the results that are yet to be written to the register file
- Other solutions aim to eliminate the disadvantages (We will not cover these)
- History buffer
- Future file
- Checkpointing
And others
- Make each operation take the same amount of time
-
Downside
-
Worst-case instruction latency determines all instructions’ latency
-
What about memory operations?
-
Each functional unit takes worst-case number of cycles?
-
-
-
(History buffer, Future register file, Checkpointing)
- Suggested reading
- Smith and Plezskun, “Implementing Precise Interrupts in Pipelined Processors,” IEEE Trans on Computers 1988 and ISCA 1985.
Leave a comment