Digital Design and Computer Architecture - Lecture 16b: Branch Prediction 1

This post is a derivative of Digital Design and Computer Architecture Lecture by Prof. Onur Mutlu, used under CC BY-NC-SA 4.0.

You can watch this lecture on Youtube and see pdf.

I write this summary for personal learning purposes.

Approaches to (Instruction-Level) Concurrency

  • Pipelining
  • Out-of-order execution
  • Dataflow (at the ISA level)
  • Superscalar Execution
  • VLIW
  • Fine-Grained Multithreading
  • SIMD Processing (Vector and array processors, GPUs)
  • Decoupled Access Execute
  • Systolic Arrays

Control Dependence Handling

Control Dependence

  • Question: What should the fetch PC be in the next cycle?
  • Answer: The address of the next instruction
    • All instructions are control dependent on previous ones. Because the next instructions are sequential instruction. If you’re going to the next sequential instruction, the address of the next instruction is the PC+4.
  • If the fetched instruction is a non-control-flow instruction:
    • Next Fetch PC is the address of the next-sequential instruction
    • Easy to determine if we know the size of the fetched instruction
  • If the instruction that is fetched is a control-flow instruction:
    • How do we determine the next Fetch PC?
  • In fact, how do we even know whether or not the fetched instruction is a control-flow instruction? (Because you haven’t decoded it yet.) If you want to treat these control flow and non-control instructions differently, then you need to also guess that.

Branch Types

Type Direction at fetch type Number of possible next fetch addresses When is next fetch address resolved
Conditional Unknown 2 Execution (register dependent)
Unconditional Always taken 1 Decode (PC + offset)
Call Always taken 1 Decode (PC + offset)
Return Always taken Many Execution (register dependent)
Indirect Always taken Many Execution (register dependent)
  • Conditional
    • When you fetch them, you don’t know the direction whether you’re going to go to the next sequential instruction or the target. The fetch address is resolved after you execute the instruction like Beq in MIPS.
  • Unconditional
    • E.g, the jump instruction in MIPS. You have only one possible next fetch address so this is easier to handle because you know the target address after decode.
  • Return
    • They’re not conditional but they could return to many possible locations because you may call a function from many different sites in your program. You need to figure out what is the next instruction after return. There are specific structures in a machine that try to predict calls and returns.
  • Indirect
    • Take a register and jump to it. How do you figure out what’s your next fetch PC when you fetched this indirect branch? It could be any value in the register. In this case, predicting that it’s going to be the next instruction doesn’t work because usually it doesn’t jump to the next instruction.

Different branch types can be handled differently. The hardest ones are conditional branches and indirects.

How to Handle Control Dependences

  • Critical to keep the pipeline full with correct sequence of dynamic instructions. If the pipeline stalls your IPC(instrctions per cycle) goes down.
  • Potential solutions if the instruction is a control-flow instruction:
    • Stall the pipeline until we know the next fetch address
    • Guess the next fetch address (branch prediction)
    • Employ delayed branching (branch delay slot)
      • Branch doesn’t immediately jump to the target if it needs to jump to the target but always the next N instructions are executed. Which means that if your pipeline if four stages, you can put independent instruction in that delay slot and they’re going to be always executed. But the problem is how do you find those N instructions. You should not solve the branch problem this way because you’re tying the pipeline length to the ISA which is not good.
    • Do something else (fine-grained multithreading)
    • Eliminate control-flow instructions (predicated execution)
    • Fetch from both possible paths (if you know the addresses of both possible paths) (multipath execution)

Stall Fetch Until Next PC is Known: Good Idea?

Our maximum throughput is 0.5 instructions per cycle even though we cat fetch one instruction per cycle. That’s why you need to predict branches or handle them somehow.

The Branch Problem

  • Control flow instructions (branches) are frequent
    • 15-25% of all instructions
  • Problem: Next fetch address after a control-flow instruction is not determined after N cycles in a pipelined processor
    • N cycles: (minimum) branch resolution latency
  • If we are fetching W (superscalar width) instructions per cycle (i.e., if the pipeline is W wide)
    • A branch misprediction leads to N x W wasted instruction slots. You waste a lot of things.

Importance of The Branch Problem

  • Assume
    • N = 20 (20 pipe stages), W = 5 (5 wide fetch)
    • 1 out of 5 instructions is a branch
    • Each 5 instruction-block ends with a branch
    • There is no data dependency
  • How long does it take to fetch 500 instructions? (100 branches)
    • 100% accuracy
      • 100 cycles (all instructions fetched on the correct path)
      • No wasted work; IPC = 500/100
    • 99% accuracy
      • 100 (correct path) + 20 * 1 (wrong path) = 120 cycles
      • 20% extra instructions fetched; IPC = 500/120
    • 90% accuracy
      • 100 (correct path) + 20 * 10 (wrong path) = 300 cycles
      • 200% extra instructions fetched; IPC = 500/300
    • 60% accuracy
      • 100 (correct path) + 20 * 40 (wrong path) = 900 cycles
      • 800% extra instructions fetched; IPC = 500/900

Branch Prediction

Branch Prediction: Guess the Next Instruction to Fetch

Idea: guess the next instruction to fetch.

Simplest: Always Guess NextPC = PC + 4

  • Always predict the next sequential instruction is the next instruction to be executed
  • This is a form of next fetch address prediction (and branch prediction)
  • How can you make this more effective?
  • Idea: Maximize the chances that the next sequential instruction is the next instruction to be executed
    • Software: Lay out the control flow graph such that the “likely next instruction” is on the not-taken path of a branch
      • Profile guided code positioning -> Pettis & Hansen, PLDI 1990.
    • Hardware: ??? (how can you do this in hardware…)
      • Hardware reorganizes the code such that you get good prediction. It’s internal, not visible to the software.
      • Cache traces of executed instructions -> Trace cache

Guessing NextPC = PC + 4

  • How else can you make this more effective?

  • Idea: Get rid of control flow instructions (or minimize their occurrence)

  • How?

    1. Get rid of unnecessary control flow instructions -> combine predicates (predicate combining). If you have a lot of branches in your code, maybe you can combine them into a single branch.

    2. Convert control dependences into data dependences -> predicated execution

Branch Prediction: Always PC+4

Branch resloves in the ALU stage and you fetch the next instruction in the next cycle.

Performance Analysis:

  • correct guess: no penalty (~86% of the time)
  • incorrect guess: 2 bubbles (in this case)
  • Assume
    • no data dependency related stalls
    • 20% control flow instructions
    • 70% of control flow instructions are taken
    • CPI = [ 1 + (0.20*0.7) * 2 ] = [ 1 + 0.14 (probability of a wrong guess) * 2 (penalty for a wrong guess) ] = 1.28

Can we reduce either of the two penalty terms? Resolve branch condition and target address early. As opposed to finishing resolving the branch in the ALU stage, you can add extra logic to resolve the branch earlier.

Branch Prediction (A Bit More Enhanced)

  • Idea: Predict the next fetch address (to be used in the next cycle)
  • Requires three things to be predicted at fetch stage:
    • Whether the fetched instruction is a branch
    • (Conditional) branch direction
    • Branch target address (if taken)
  • Observation: Target address remains the same for a conditional direct branch across dynamic instances
    • Idea: Store the target address from previous instance and access it with the PC
    • Called Branch Target Buffer (BTB) or Branch Target Address Cache

Fetch Stage with BTB and Direction Prediction

More Sophisticated Branch Direction Prediction

We’re going to look at what other branches went to. This is called global branch history and it’s going to make sense because there’s correlation between where the different branches go because of the code structures.

Three Things to Be Predicted

  • Requires three things to be predicted at fetch stage:
    1. Whether the fetched instruction is a branch
    2. (Conditional) branch direction
    3. Branch target address (if taken)
  • Third (3.) can be accomplished using a BTB
    • Remember target address computed last time branch was executed
  • First (1.) can also be accomplished using a BTB
    • If BTB provides a target address for the program counter, then it must be a branch
    • Or, we can store “branch metadata” bits in instruction cache/memory -> partially decoded instruction stored in I-cache
  • Second (2.): How do we predict the direction?

Branch Direction Prediction Schemes

  • Compile time (static)
    • Always not taken
    • Always taken
    • BTFN (Backward taken, forward not taken)
    • Profile based (likely direction)
  • Run time (dynamic)
    • Last time prediction (single-bit)

Static Branch Prediction

Static Branch Prediction (I)
  • Always not-taken
    • The address of the next instruction fetch is equal to PC + 4.
    • Simple to implement: no need for BTB, no direction prediction
    • Low accuracy: ~30-40% (for conditional branches)
    • Remember: Compiler can layout code such that the likely path is the “not-taken” path -> more effective prediction
  • Always taken
    • No direction prediction. Whenever you see a branch, you assume that’s going to be taken.
    • Better accuracy: ~60-70% (for conditional branches)
      • Backward branches (i.e, loop branches) are usually taken
      • Backward branch: target address lower than branch PC
  • Backward taken, forward not taken (BTFN)
    • Predict backward (loop) branches as taken, others not-taken
    • You need to figure out what’s a backward branch and what’s a forward branch. You can store that information in the branch target buffer. The branch target buffer can have another bits saying this branch is backward or forward because you’ve decoded the branch and you know the target address and the program counter.
Static Branch Prediction (II)

This prediction is interesting because the compiler plays a bigger role.

  • Profile-based
    • Idea: Compiler determines likely direction for each branch using a profile run. Encodes that direction as a hint bit in the branch instruction format.
    • Let’s assume that we change the branch instruction format. We have a bit that is set by the compiler or the programmer. For example, the compiler says, this branch whenever I executed it in my profiling runs it’s 99% taken. So I’m going to set that bit as 1 indicating to the machine that this branch is likely going to be taken. Now, the compiler is giving a hint to the machine based on all the information that it has about the program.
    • This is good because the machine has more information it didn’t do anything but it got this information from the compiler or the programmer.

So, how to you make use the hint bit? The first time you execute the branch, maybe you don’t use it but you store this hint bit in the BTB. And the next time you fetch this branch, you quickly access the BTB and you basically take that bit and use that as part of your prediction.

  • + Per branch prediction (more accurate than schemes in previous slide) -> accurate if profile is representative!
  • – Requires hint bits in the branch instruction format
  • – Accuracy depends on dynamic branch behavior:
    • TTTTTTTTTTNNNNNNNNNN -> 50% accuracy
    • TNTNTNTNTNTNTNTNTNTN -> 50% accuracy
    • If it’s always taken or not taken, you get 100% accuracy.
  • – Accuracy depends on the representativeness of profile input set
Static Branch Prediction (III)
  • Program-based (or, program analysis based)
    • Idea: Use heuristics based on program analysis to determine staticallypredicted direction. You don’t even need profile the program.
    • Example opcode heuristic: Predict BLEZ as NT (negative integers used as error values in many programs).
      • If the branch is less than or equal to zero, it turns out most of time a branch is not-taken. You don’t need to profile the particular program but maybe across many programs, that you’ve seen in the past, you say if the branch is branching less than or equal to zero, predictor does not take.
      • It turns out negative integers are used as error values in many program and if you’re checking for a negative integer predicting that you’re not going to take the branch is a good idea.
    • Example loop heuristic: Predict a branch guarding a loop execution as taken (i.e., execute the loop)
      • If X executes the loop that X is that branch is usually taken. If you’re comparing two pointers in a program, they’re not equal most of time.
    • Pointer and FP comparisons: Predict not equal
  • + Does not require profiling
  • – Heuristics might be not representative or good
  • – Requires compiler analysis and ISA support (ditto for other static methods
  • Ball and Larus, ”Branch prediction for free,” PLDI 1993.
    • 20% misprediction rate
Static Branch Prediction (IV)
  • Programmer-based
    • Idea: Programmer provides the statically-predicted direction
    • Via pragmas in the programming language that qualify a branch as likely-taken versus likely-not-taken
  • + Does not require profiling or program analysis
  • + Programmer may know some branches and their program better than other analysis techniques
  • – Requires programming language, compiler, ISA support
  • – Burdens the programmer?
Pragmas

Idea: Keywords that enable a programmer to convey hints to lower levels of the transformation hierarchy. But hardware may ignore that information because the hardware designer may think the hardware predict this branch much better.

  • if (likely(x)) { … }
  • if (unlikely(error)) { … }
  • Many other hints and optimizations can be enabled with pragmas
    • E.g., whether a loop can be parallelized
    • #pragma omp parallel
    • Description
      • The omp parallel directive explicitly instructs the compiler to parallelize the chosen segment of code.
(Conclusion) Static Branch Prediction
  • All previous techniques can be combined
    • Profile based
    • Program based
    • Programmer based
  • What is the common disadvantage of all three techniques?
    • Static. They make use of no information that’s happening while the program is running.
    • Cannot adapt to dynamic changes in branch behavior
      • This can be mitigated by a dynamic compiler, but not at a fine granularity (and a dynamic compiler has its overheads…)
      • What is a Dynamic Compiler?
        • A compiler that generates code at runtime
        • Java JIT (just in time) compiler, Microsoft CLR (common lang. runtime)

Dynamic Branch Prediction

Idea: Predict branches based on dynamic information (collected at run-time)

  • Advantages
    • + Prediction based on history of the execution of branches
      • + It can adapt to dynamic changes in branch behavior
    • + No need for static profiling: input set representativeness problem goes away
  • Disadvantages
    • – More complex (requires additional hardware)
Last Time Predictor
  • Last time predictor
    • Single bit per branch (stored in BTB)
    • Indicates which direction branch went last time it executed TTTTTTTTTTNNNNNNNNNN => 90% accuracy
  • Always mispredicts the last iteration and the first iteration of a loop branch
    • Accuracy for a loop with N iterations = (N-2)/N
  • + Loop branches for loops with large N (number of iterations)
  • – Loop branches for loops will small N (number of iterations) TNTNTNTNTNTNTNTNTNTN => 0% accuracy

Leave a comment