Digital Design and Computer Architecture - Lecture 14: Pipelining

January 14, 2022

This post is a derivative of Digital Design and Computer Architecture Lecture by Prof. Onur Mutlu, used under CC BY-NC-SA 4.0.

You can watch this lecture on Youtube and see pdf.

I write this summary for personal learning purposes.

Agenda for Today

Issues in Pipelining: Control & Data Dependence Handling, State Maintenance and Recovery, …

Stalling Hardware

Stalls are supported by:

adding enable inputs (EN) to the Fetch and Decode pipeline registers
and a synchronous reset/clear (CLR) input to the Execute pipeline register
- or an INV bit associated with each pipeline register - it is good when you debug hardware
When a lw stall occurs
- StallD(ecode) and stallF(etch) are asserted to force the Decode and Fetch stage pipeline registers to hlod their old values.
- FlushE(xecute) is also asserted to clear the contents of the Execute stage pipeline register, introducing a bubble.

Control Dependences

Special case of data dependence: dependence on PC
beq:
- Branch is not determined until the fourth stage of the pipeline. It means, you do need the target address much earlier because you need to figure out what’s the next instruction to fetch.
- Instructions after the branch are fetched before branch is resolved
  - Always predict that the next sequential instruction is fetched
  - Called “Always not taken” prediction
- These instructions must be flushed if the branch is taken
  - This can involve a lot of wasted cycles in this case. It could be four cycles but if your pipeline is 20 deep and if you determine the branch condition at the 19th stage, you lose nineteen cycles.
- Branch misprediction penalty
  - number of instructions flushed when branch is taken
  - May be reduced by determining branch earlier (not going to talk about it as much right now)
    - If you do always not-taken prediction, your prediction accuracy is about 30% so 70% of the branches are actually taken because a lot of programs are structured as loops (go back into the loop and do it over). Actually always taking predictions much better than always not taken.
    - Also, you can distinguish between different branches. This branch is always taken and this branch is always not taken, so now based on the program counter, you have seperate predictions.
- Reduce the effect of branches by
  - doing better prediction
    - Backward branches are usually taken (loops)
    - Consider history of whether branch was previously taken to improve the guess
  - reducing the branch misprediction penalty
    - Early branch resolution: but it introduces another data dependency in Decode stage

Pipelined Performance Example

SPECINT2006 benchmark:
- 25% loads
- 10% stores
- 11% branches
- 2% jumps
- 52% R-type
Suppose
- 40% of loads used by next instruction
- 25% of branches mispredicted
- All jumps flush next instruction
Load/Branch CPI = 1 when no stall/flush, 2 when stall/flush.
- When it is followed by an instruction that’s depend on it, there is a nop or a bubble that you need to insert. That’s why it takes two cycles.
- When you mispredicted, it takes two cycles.
- CPI(lw) = 1(0.6) + 2(0.4) = 1.4 Average CPI for load
- CPI(beq) = 1(0.75) + 2(0.25) = 1.25 Average CPI for branch
And
- Store, jumps and r-type instructions are always contant latency
  - The jump is always taken and right here we’re predicting always not taken, So we’re always mispricting the jump.
- Average CPI = (0.25)(1.4) + // load
  
  (0.1)(1) + // store
  
  (0.11)(1.25) + // beq
  
  (0.02)(2) + // jump
  
  (0.52)(1) // r-type
  
  = 1.15
There are 5 stages, and 5 different timing paths:

Tc = max { tpcq + tmem + tsetup // fetch 2(tRFread + tmux + teq + tAND + tmux + tsetup ) // decode tpcq + tmux + tmux + tALU + tsetup // execute tpcq + tmemwrite + tsetup // memory 2(tpcq + tmux + tRFwrite) // writeback } = 550ps (assume)
- The operation speed depends on the slowest operation
- Decode and Writeback use register file and have only half a clock cycle to complete, that is why there is a 2 in front of them
For a program with 100 billion instructions executing on a pipelined MIPS processor
- Execution Time = (# instructions) x CPI x Tc = (100 x 109^9)(1.15)(550 x 10^-12) = 63 seconds

Questions to Ponder

What is the role of the hardware vs. the software in data dependence handling?

Software based interlocking / Hardware based interlocking
- Who inserts/manages the pipeline bubbles?
- Who finds the independent instructions to fill “empty” pipeline slots?
What are the advantages/disadvantages of each?
Think of the performance equation as well

What is the role of the hardware vs. the software in the order in which instructions are executed in the pipeline

Software based instruction scheduling -> static scheduling (before runtime)
- Compiler orders the instructions, hardware executes them in that order
- Contrast this with dynamic scheduling (in which hardware can execute instructions out of the compiler-specified order)
- How does the compiler know the latency of each instruction?
Hardware based instruction scheduling -> dynamic scheduling (runtime)
- Static (before runtime) doesn’t have information that’s available at runtime. (E.g,) branch predictions. But the hardware can collect and use information as the program is running.

What information does the compiler not know that makes static scheduling difficult?

Anything that is determined at run time. Variable-length operation latency (depend on value (multiply by 0), use of cache memory…), memory addr, branch direction

How can the compiler alleviate this (i.e., estimate the unknown)?

Profiling - Compiler runs the program and optimize the code.

How does each impact different metrics?

Performance (and parts of the performance equation)
Complexity
Power consumption
Reliability
…

Pipelining and Precise Exceptions: Preserving Sequential Semantics

Multi-Cycle Execution

Not all instructions take the same amount of time for “execution”
Idea: Have multiple different functional units that take different number of cycles. (Integer add, mul, floating point multi, div…)
- Can be pipelined or not pipelined
- Can let independent instructions start execution on a different functional unit before a previous long-latency instruction finishes execution.
What is wrong if an ADD start before FP MUL in a Von Neumann architecture?
- Sequential semantics of the ISA NOT preserved!
- What if FMUL incurs an exception? (divide by 0)
  - It makes it hard for software programmers to debug.

Exceptions vs. Interrupts

Cause

Exceptions: internal to the running thread
- E.g, an instruction got a value that you cannot operate on.
Interrupts: external to the running thread
- E.g, some user decideds to hit the keyboard

When to Handle

Exceptions: when detected (and known to be non-speculative)
Interrupts: when convenient
- Except for very high priority ones
  - Power failure
  - Machine check (error)

Priority

process (of exception), depends (on what kind of interrupt)

Handling Context

process (of exception), system (interrupt)

Precise Exceptions/Interrupts

The architectural state should be consistent (precise) when the exception/interrupt is ready to be handled.

All previous instructions should be completely retired (finished and updated architectural registers and program counter).
No later instruction should be retired

Retire = commit = finish execution and update arch. state

Checking for and Handling Exceptions in Pipelining

What happens when you decide to handle an exception? When the oldest instruction ready-to-be-retired is detected to have caused an exception, the control logic

Ensures architectural state is precise (register file, PC, memory)
- Register file , PC and memory are not updated by any later instructions and are completely updated by all previous instructions.
Flushes all younger instructions in the pipeline
Saves PC and registers (as specified by the ISA)
Redirects the fetch engine to the appropriate exception handling routine

Why Do We Want Precise Exceptions?

Semantics of the von Neumann model ISA specifies it
- Contract needs to be obey
- Remember von Neumann vs. Dataflow
  - There is no precise exeption in the dataflow and that’s why dataflow machine are very hard to debug
Aids software debugging
Enables (easy) recovery from exceptions
Enables (easily) restartable processes
Enables traps into software (e.g., software implemented opcodes)

Ensuring Precise Exceptions in Pipelining

Reorder buffer (ROB)

Complete instructions out-of-order, but reorder them before making results visible to architectural state.

Add another instruction into the pipeline. After an instruction completes it writes as a result into this reordering buffer as opposed to directly writing to the register file because the register file is visible to the programmer. And when the instruction becomes the oldest one you take the value and write into ther register file or memory at that point in time.

When instruction is decoded it reserves the next-sequential entry in the ROB
When instruction completes, it writes result into ROB entry
When instruction oldest in ROB and it has completed without exceptions, its result moved to reg. file or memory
Buffers information about all instructions that are decoded but not yet retired/committed

What’s in a ROB Entry?

You need all the information about an instruction. DestRegID, DestRegVal, StoreAddr, StoreData, PC, Valid bits for reg/data + control bits, Exception.

Everything required to:
- correctly reorder instructions back into the program order
- update the architectural state with the instruction’s result(s), if instruction can retire without any issues
- handle an exception/interrupt precisely, if an exception/interrupt needs to be handled before retiring the instruction
Need valid bits to keep track of readiness of the result(s) and find out if the instruction has completed execution

Independent Operations

Result first written to ROB on instruction completion
Result written to register file at commit time
What if a later instruction needs a value in the reorder buffer?
- One option: stall the operation -> stall the pipeline
- Better: Read the value from the reorder buffer.

How to access the reorder buffer?

A register value can be in the register file, reorder buffer, (or bypass/forwarding paths). You should read the youngest data.

Idea: Use indirection

Access register file first (check if the register is valid)
- If register not valid, register file stores the ID of the reorder buffer entry that contains (or will contain) the value of the register
- Mapping of the register to a ROB entry: Register file maps the register to a reorder buffer entry if there is an in-flight instruction writing to the register
Access reorder buffer next
Now, reorder buffer does not need to be content addressable

Important: Register Renaming with a Reorder Buffer

Output and anti dependencies are not true dependencies
- WHY? The same register refers to values that have nothing to do with each other
- They exist due to lack of register ID’s (i.e. names) in the ISA
The register ID is renamed to the reorder buffer entry that will hold the register’s value
- Register ID -> ROB entry ID
- Architectural register ID -> Physical register ID
- After renaming, ROB entry ID used to refer to the register
This eliminates anti and output dependencies
- Gives the illusion that there are a large number of registers

In-Order Pipeline with Reorder Buffer

Decode (D): Access regfile/ROB, allocate entry in ROB, check if instruction can execute, if so dispatch instruction
Execute (E): Instructions can complete out-of-order
Completion (R): Write result to reorder buffer
Retirement/Commit (W): Check for exceptions; if none, write result to architectural register file or memory; else, flush pipeline and start from exception handler
In-order dispatch/execution, out-of-order completion, in-order retirement

Reorder Buffer Tradeoffs

Advantages
- Conceptually simple for supporting precise exceptions
- Can eliminate false dependences
Disadvantages
- Reorder buffer needs to be accessed to get the results that are yet to be written to the register file
  - CAM or indirection increased latency and complexity
Other solutions aim to eliminate the disadvantages (We will not cover these)
- History buffer
- Future file
- Checkpointing

And others

Make each operation take the same amount of time
- Downside
  - Worst-case instruction latency determines all instructions’ latency
  - What about memory operations?
  - Each functional unit takes worst-case number of cycles?
(History buffer, Future register file, Checkpointing)
Suggested reading
- Smith and Plezskun, “Implementing Precise Interrupts in Pipelined Processors,” IEEE Trans on Computers 1988 and ISCA 1985.

Share on

Twitter Facebook LinkedIn

Juhyung Kim