Digital Design and Computer Architecture - Lecture 17: Branch Prediction 2
This post is a derivative of Digital Design and Computer Architecture Lecture by Prof. Onur Mutlu, used under CC BY-NC-SA 4.0.
You can watch this lecture on Youtube and see pdf.
I write this summary for personal learning purposes.
Required Readings
This week
- Smith and Sohi, “The Microarchitecture of Superscalar Processors,” Proceedings of the IEEE, 1995
- H&H Chapters 7.8 and 7.9
- McFarling, “Combining Branch Predictors,” DEC WRL Technical Report, 1993.
Recall: How to Handle Control Dependences
- Critical to keep the pipeline full with correct sequence of dynamic instructions.
- Potential solutions if the instruction is a control-flow instruction:
- Stall the pipeline until we know the next fetch address
- Guess the next fetch address (branch prediction)
- Employ delayed branching (branch delay slot)
- Do something else (fine-grained multithreading)
- Eliminate control-flow instructions (predicated execution)
- Fetch from both possible paths (if you know the addresses of both possible paths) (multipath execution)
Recall: Fetch Stage with BTB and Direction Prediction
Recall: More Sophisticated Direction Prediction
- Compile time (static)
- Always not taken
- Always taken
- BTFN (Backward taken, forward not taken)
- Profile based (likely direction)
- Program analysis based (likely direction)
- Run time (dynamic)
- Last time prediction (single-bit)
- Two-bit counter based prediction
- Two-level prediction (global vs. local)
- Hybrid
- Advanced algorithms (e.g., using perceptrons)
Last Time Predictor
Remember the direction of the branch took last time it was executed.
- Last time predictor
- Single bit per branch (stored in BTB)
- Indicates which direction branch went last time it executed TTTTTTTTTTNNNNNNNNNN => 90% accuracy
- Always mispredicts the last iteration and the first iteration of a loop branch
- Accuracy for a loop with N iterations = (N-2)/N
- + Loop branches for loops with large N (number of iterations)
- – Loop branches for loops will small N (number of iterations) TNTNTNTNTNTNTNTNTNTN => 0% accuracy
Implementing the Last-Time Predictor
State Machine for Last-Time Prediction
Improving the Last Time Predictor
- Problem: A last-time predictor changes its prediction from T NT or NT T too quickly
- even though the branch may be mostly taken or mostly not taken
- Solution Idea: Add hysteresis to the predictor so that prediction does not change on a single different outcome. Htsteresis is basically some inertia that you add to the system doesn’t change its state very quickly. This was the state of the art for a really long time in computer architecture at least for a decade.
- Use two bits to track the history of predictions for a branch instead of a single bit
- Can have 2 states for T or NT instead of 1 state for each
- Smith, “A Study of Branch Prediction Strategies,” ISCA 1981.
Two-Bit Counter Based Prediction
- Each branch associated with a two-bit counter
- One more bit provides hysteresis
- A strong prediction does not change with one single different outcome
State Machine for 2-bit Saturating Counter
Hysteresis Using a 2-bit Counter
Two-Bit Counter Based Prediction
- Each branch associated with a two-bit counter
- One more bit provides hysteresis
- A strong prediction does not change with one single different outcome
- Accuracy for a loop with N iterations = (N-1)/N TNTNTNTNTNTNTNTNTNTN => 50% accuracy (assuming counter initialized to weakly taken)
- + Better prediction accuracy
- – More hardware cost (but counter can be part of a BTB entry)
Is This Good Enough?
- ~85-90% accuracy for many programs with 2-bit counter based prediction (also called bimodal prediction)
- Is this good enough?
- How big is the branch problem?
Let’s Do the Exercise Again
- Assume N = 20 (20 pipe stages), W = 5 (5 wide fetch)
- Assume: 1 out of 5 instructions is a branch
- Assume: Each 5 instruction-block ends with a branch
- How long does it take to fetch 500 instructions?
- 100% accuracy
- 100 cycles (all instructions fetched on the correct path)
- No wasted work; IPC = 500/100
- 90% accuracy
- 100 (correct path) + 20 * 10 (wrong path) = 300 cycles
- 200% extra instructions fetched; IPC = 500/300
- 85% accuracy
- 100 (correct path) + 20 * 15 (wrong path) = 400 cycles
- 300% extra instructions fetched; IPC = 500/400
- 80% accuracy
- 100 (correct path) + 20 * 20 (wrong path) = 500 cycles
- 400% extra instructions fetched; IPC = 500/500
- 100% accuracy
Can We Do Better: Two-Level Prediction
- Last-time and 2BC predictors exploit “last-time” predictability. They still look at the lst time few instances of the branch.
- People have found out that there’s actually other types of correlations and programs that lead to better predictability of branches and programs.
- Realization 1: A branch’s outcome can be correlated with other branches’ outcomes
- Global branch correlation
- By checking the direction of previous branches the taken or not taken outcomes of other branch, you can guess where this branch will go with much higher accuracy.
- Realization 2: A branch’s outcome can be correlated with past outcomes of the same branch (other than the outcome of the branch “last-time” it was executed)
- Local branch correlation
- From an iterations, maybe you can find patterns that are more predictable compared to when you just look back one iteration or wto iterations for the same branch.
- Realization 1: A branch’s outcome can be correlated with other branches’ outcomes
Realization 1
Global Branch Correlation (I)
-
(data independent) Recently executed branch outcomes in the execution path are correlated with the outcome of the next branch.
if (cond1) // ... if (cond1 && cond2)
These two branches are correlated because the first branch is testing condition 1 and branching based solely on that. And the second branch is testing condition 1 and condition 2 and branching based on that. If you know the outcome of the first branch is not taken, you can guess the next branch is also not taken.
-
(data dependent) If first branch taken, second definitely not taken.
branch Y: if (cond1) a = 2; // ... branch X: if (a == 0);
Global Branch Correlation (II)
branch Y: if (cond1)
// ...
branch Z: if (cond2)
// ...
branch X: if (cond1 && cond2)
- If Y and Z both taken, then X also taken.
- If Y or Z not taken, then X also not taken.
Global Branch Correlation (III)
-
Eqntott, SPEC’92: Generates truth table from Boolean expr.
if (aa==2) // B1 aa=0; if (bb==2) // B2 bb=0; if (aa!=bb) { // B3 //... }
-
If B1 is not taken (i.e., aa==0@B3) and B2 is not taken (i.e. bb=0@B3) then B3 is certainly taken
Capturing Global Branch Correlation
- Idea: Associate branch outcomes with “global T/NT (Taken/Not taken) history” of all branches. We’re going to keep a history of taken/not taken behavior of some number of branches.
- Make a prediction based on the outcome of the branch the last time the same global branch history was encountered
- Implementation:
- Keep track of the “global T/NT history” of all branches in a register => Global History Register (GHR). This is an n-bit register that encodes the outcomes of the last N branches.
- Use GHR to index into a table that recorded the outcome that was seen for each GHR value in the recent past => Pattern History Table (table of 2-bit counters)
- Global history/branch predictor
- Uses two levels of history (GHR + history at that GHR)
- You don’t just look at the outcome of this branch last time, you look at the history of in the global history register where did the past N branches go and what was the outcome that I had for this branch based on that particular history the last time I saw.
Two Level Global Branch Prediction
Basically you’re remembering what you did the last time, you saw the history and you’re predicting the branch based on that remembrance. If you see the same history, your’re gonna do whatever you did the last time.
- First level: Global branch history register (N bits)
- The direction of last N branches
- Second level: Table of saturating counters for each history entry
- The direction the branch took the last time the same history was seen
How Does the Global Predictor Work?
for(int i = 0; i < 100; i++)
for(int j = 0; j < 3; j++)
After the initial startup time, the conditional branches have the following behavior, asuming GR is shifted to the left:
Test | Value | GR | Result |
---|---|---|---|
j < 3 | j = 1 | 1101 | taken |
j < 3 | j = 2 | 1011 | taken |
j < 3 | j = 3 | 0111 | not taken |
i < 100 | 1110 | usually taken |
- for (int i = 0; i < 100; i++)
- This branch tests i
- Last 4 branches test j
- History: TTTN
- Predict taken for i
- Next history: TTNT (shift in last outcome)
- McFarling, “Combining Branch Predictors,” DEC WRL TR 1993.
Intel Pentium Pro Branch Predictor
- Two level global branch predictor
- 4-bit global history register
- Multiple pattern history tables (of 2 bit counters)
- Which pattern history table to use is determined by lower order bits of the branch address
- First widely commercially successful out-of-order execution machine
Improving Global Predictor Accuracy
- Idea: Add more context information to the global predictor to take into account which branch is being predicted. Because without this, whenever you look at the global history, different branches may see the same global history, but they may actually lead to different outcomes in the execution because we’re using a single global history for all branches in this case. If you add more context information to the global branch predictor, you can improve accuracy.
- Gshare predictor: GHR hashed with the Branch PC. You don’t just rely on the branch history register but you rely on the branch address as well.
- + More context information used for prediction
- + Better utilization of the two-bit counter array
- – Increases access latency
- Gshare predictor: GHR hashed with the Branch PC. You don’t just rely on the branch history register but you rely on the branch address as well.
- McFarling, “Combining Branch Predictors,” DEC WRL Tech Report, 1993.
Review: One-Level Branch Predictor
Two-Level Global History Branch Predictor
Two-Level Gshare Branch Predictor
Realization 2
Local Branch Correlation
for(int i = 1; i <= 4; i++)
If the loop test is done at the end of the body, the corresponding branch will execute the pattern 0b1110, where 1 and 0 represent taken and not taken respectively. Clearly, if we knew the direction this branch had gone on the previous three executions, then we could always be able to predict the next branch direction.
More Motivation for Local History
- To predict a loop branch “perfectly”, we want to identify the last iteration of the loop
- By having a separate PHT(Pattern History Table) entry for each local history, we can distinguish different iterations of a loop because you’ll have a different history every time you execute a single iteration. And these different histories will map to different locations.
- Works for “short” loops
Capturing Local Branch Correlation
- Idea: Have a per-branch history register
- Associate the predicted outcome of a branch with “T/NT history” of the same branch
- Make a prediction based on the outcome of the branch the last time the same local branch history was encountered
- Called the local history/branch predictor
- Uses two levels of history (Per-branch history register + history at that history register value)
Can We Do Even Better?
- Predictability of branches varies
- Some branches are more predictable using local history
- Some branches are more predictable using global
- For others, a simple two-bit counter is enough
- Yet for others, a single bit is enough
- Observation: There is heterogeneity in predictability behavior of branches
- No one-size fits all branch prediction algorithm for all branches
- Idea: Exploit that heterogeneity by designing heterogeneous (hybrid) branch predictors
Hybrid Branch Predictors
- Idea: Use more than one type of predictor (i.e., multiple algorithms) and select the “best” prediction
- E.g., hybrid of 2-bit counters and global predictor
- Advantages:
- + Better accuracy: different predictors are better for different branches
- + Reduced warmup time (faster-warmup predictor used until the slower-warmup predictor warms up)
- Disadvantages:
- – Need “meta-predictor” or “selector” to decide which predictor to use
- – Longer access latency
- McFarling, “Combining Branch Predictors,” DEC WRL Tech Report, 1993.
Alpha 21264 Tournament Predictor
- Minimum branch penalty: 7 cycles
- Typical branch penalty: 11+ cycles
- 48K bits of target addresses stored in I-cache
- Predictor tables are reset on a context switch
- Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro 1999.
Are We Done w/ Branch Prediction?
- Hybrid branch predictors work well
- E.g., 90-97% prediction accuracy on average
- Some “difficult” workloads still suffer, though!
- E.g., gcc
- Max IPC with tournament prediction: 9
- Max IPC with perfect prediction: 35
Some Other Branch Predictor Types
- Loop branch detector and predictor
- Loop iteration count detector/predictor
- Works well for loops with small number of iterations, where iteration count is predictable
- Used in Intel Pentium M
- Perceptron branch predictor
- Learns the direction correlations between individual branches
- Assigns weights to correlations
- Jimenez and Lin, “Dynamic Branch Prediction with Perceptrons,” HPCA 2001.
- Hybrid history length based predictor
- Uses different tables with different history lengths
- Seznec, “Analysis of the O-Geometric History Length branch predictor,” ISCA 2005.
Intel Pentium M Predictors: Loop and Jump
Perceptrons for Learning Linear Functions
A successful example of use of machine learning in processor design
- A perceptron is a simplified model of a biological neuron
- It is also a simple binary classifier
- A perceptron maps an input vector X to a 0 or 1
- Input = Vector X
- Perceptron learns the linear function (if one exists) of how each element of the vector affects the output (stored in an internal Weight vector)
- Output = Weight.X + Bias > 0
- In the branch prediction context
- Vector X: Branch history register bits
- Output: Prediction for the current branch
- Idea: Use a perceptron to learn the correlations between branch history register bits and branch outcome
- A perceptron learns a target Boolean function of N inputs
- Advantages
- + More sophisticated learning mechanism => better accuracy
- Disadvantages
- – Hard to implement (adder tree to compute perceptron output)
- – Can learn only linearly-separable functions e.g., cannot learn XOR type of correlation between 2 history bits and branch outcome
Other Ways of Handling Branches
How to Handle Control Dependences
-
Critical to keep the pipeline full with correct sequence of dynamic instructions.
-
Potential solutions if the instruction is a control-flow instruction:
- Stall the pipeline until we know the next fetch address
- Guess the next fetch address (branch prediction
-
Employ delayed branching (branch delay slot) (terrible idea)
- Do something else (fine-grained multithreading)
- Eliminate control-flow instructions (predicated execution)
- Fetch from both possible paths (if you know the addresses of both possible paths) (multipath execution)
Leave a comment