Digital Design and Computer Architecture - Lecture 18c: Fine Grained Multithreading

February 1, 2022

This post is a derivative of Digital Design and Computer Architecture Lecture by Prof. Onur Mutlu, used under CC BY-NC-SA 4.0.

You can watch this lecture on Youtube and see pdf.

I write this summary for personal learning purposes.

Approaches to (Instruction-Level) Concurrency

~~Pipelining~~
~~Out-of-order execution~~
~~Dataflow (at the ISA level)~~
~~Superscalar Execution~~
~~VLIW~~
~~Systolic Arrays~~
~~Decoupled Access Execute~~
Fine-Grained Multithreading
- It’s going to eliminate the dependence checking
SIMD Processing (Vector and array processors, GPUs)

Recall: How to Handle Data Dependences

Anti and output dependences are easier to handle
- write to the destination in one stage and in program order
Flow dependences are more interesting
Five fundamental ways of handling flow dependences
- Detect and wait until value is available in register file
- Detect and forward/bypass data to dependent instruction
- Detect and eliminate the dependence at the software level
  - No need for the hardware to detect dependence
- Predict the needed value(s), execute “speculatively”, and verify
- Do something else (fine-grained multithreading)
  - No need to detect. Switch something else.

Recall: How to Handle Control Dependences

Critical to keep the pipeline full with correct sequence of dynamic instructions.
Potential solutions if the instruction is a controlflow instruction:
- Stall the pipeline until we know the next fetch address
- Guess the next fetch address (branch prediction)
- Employ delayed branching (branch delay slot)
- Do something else (fine-grained multithreading)
  - Don’t try to figure out what’s the next instruction. Just fetch an instruction that you know is independent from some other threads.
- Eliminate control-flow instructions (predicated execution)
- Fetch from both possible paths (if you know the addresses of both possible paths) (multipath execution)

Fine-Grained Multithreading

Fine-Grained Multithreading (I)

It works when you have a lot of threads in the machine and the idea is hardware has multiple thread contexts. You have multiple program counters and multiple register files. And each cycle, the fetch engine fetches from a different thread every cycle. You never fetch from the same thread until the instruction of the thread gets out of the pipeline. So you only have one instruction from the same thread in the entire pipeline.

Idea: Hardware has multiple thread contexts (PC+registers). Each cycle, fetch engine fetches from a different thread.
- By the time the fetched branch/instruction resolves, no instruction is fetched from the same thread
- Branch/instruction resolution latency overlapped with execution of other threads’ instructions
+ No logic needed for handling control and data dependences within a thread
– Single thread performance suffers
– Extra logic for keeping thread contexts
– Does not overlap latency if not enough threads to cover the whole pipeline

Fine-Grained Multithreading (II)

Idea: Switch to another thread every cycle such that no two instructions from a thread are in the pipeline concurrently
Tolerates the control and data dependency latencies by overlapping the latency with useful work from other threads
Improves pipeline utilization by taking advantage of multiple threads
Thornton, “Parallel Operation in the Control Data 6600,” AFIPS 1964.
Smith, “A pipelined, shared resource MIMD computer,” ICPP 1978.

You could do a memory access that takes ten cycles, and you had a ten cycle pipeline for that and you start a memory access from one thread every cycle so you need ten threads to tolerate the latency ofr memory.

Fine-Grained Multithreading: History

CDC 6600’s peripheral processing unit is fine-grained multithreaded
- Thornton, “Parallel Operation in the Control Data 6600,” AFIPS 1964.
- Processor executes a different I/O thread every cycle. Such that you start the different memory access, they call these IO threads.
- An operation from the same thread is executed every 10 cycles
Denelcor HEP (Heterogeneous Element Processor)
- Smith, “A pipelined, shared resource MIMD computer,” ICPP 1978.
- 120 threads/processor
- available queue vs. unavailable (waiting) queue for threads
- each thread can have only 1 instruction in the processor pipeline; each thread independent
- to each thread, processor looks like a non-pipelined machine
- system throughput vs. single thread performance tradeoff. If you have only a single thread to execute, this is not going to buy you any performance.

Fine-Grained Multithreading in HEP

Cycle time: 100ns
8 stages -> 800 ns to complete an instruction
- assuming no memory access
No control and data dependency checking

Multithreaded Pipeline Example

you have multiple program counters and general-purpose registers that are per thread. There’s a thread selection logic. Actually, existing processors are also very similar to this. If you look at a superscalar or out-of-order processor, they employ multi-threading they don’t employ find-grained multi-threading. Fine-grained means every cycle you fetch from a different thread. If you do that on a general-purpose machine, your single thread performance suffers. For find-grained multi-threading, you need seperate register files.

Sun Niagara Multithreaded Pipeline

Fine-grained Multithreading

Advantages
- + No need for dependency checking between instructions (only one instruction in pipeline from a single thread)
- + No need for branch prediction logic
- + Otherwise-bubble cycles used for executing useful instructions from different threads
- + Improved system throughput, latency tolerance, utilization
Disadvantages
- - Extra hardware complexity: multiple hardware contexts (PCs, register files, …), thread selection logic
- - Reduced single thread performance (one instruction fetched every N cycles from the same thread)
- - Resource contention between threads in caches and memory
- - Some dependency checking logic between threads remains (load/store)

Modern GPUs are FGMT Machines

NVIDIA GeForce GTX 285 “core”

Groups of 32 threads share instruction stream (each group is a Warp): they execute the same instruction on different data
Up to 32 warps are interleaved in an FGMT manner
Up to 1024 thread contexts can be stored

Share on

Twitter Facebook LinkedIn

Juhyung Kim