Digital Design and Computer Architecture - Lecture 16a: Dataflow & Superscalar Execution

This post is a derivative of Digital Design and Computer Architecture Lecture by Prof. Onur Mutlu, used under CC BY-NC-SA 4.0.

You can watch this lecture on Youtube and see pdf.

I write this summary for personal learning purposes.

Required Readings

  • Smith and Sohi, “The Microarchitecture of Superscalar Processors,” Proceedings of the IEEE, 1995
  • H&H Chapters 7.8 and 7.9
  • McFarling, “Combining Branch Predictors,” DEC WRL Technical Report, 1993.

Agenda for Today & Next Few Lectures

  • Single-cycle Microarchitectures
  • Multi-cycle and Microprogrammed Microarchitectures
  • Pipelining
  • Issues in Pipelining: Control & Data Dependence Handling, State Maintenance and Recovery, …
  • Out-of-Order Execution
  • Other Execution Paradigms

Recall: OOO Execution: Restricted Dataflow

  • An out-of-order engine dynamically builds the dataflow graph of a piece of the program
    • We can see that by being able to reverse engineer the data flow graph of a program by just looking at the state of the reservation stations and the register file.
  • The dataflow graph is limited to the instruction window and the reorder buffer
    • Instruction window: all decoded but not yet retired instructions
    • Reorder buffer
  • Can we do it for the whole program?
    • In other words, how can we have a large instruction window?
    • You do this for tolerating latencies if an instruction takes 500 cycles and if you’re fetching one instruction per cycle than you need an instruction window of 500 entries if you would like to keep decoding.
    • It’s not very efficient because if you really want to that whole wake up and select loop requires broadcasting the tagged to all of the parts of the machine where you need to wake up instructions and that could be a lot of places.
  • The complexity of load and store scheduling when you have a out-of-order execution
    • Younger load may be ready to get scheduled but there may be an unknown address store.
  • Can we do it efficiently with Tomasulo’s algorithm?

Other Approaches to Concurrency (or Instruction Level Parallelism)

  • Pipelining
  • Out-of-order execution
  • Dataflow (at the ISA level)
  • Superscalar Execution
  • VLIW
  • Fine-Grained Multithreading
  • SIMD Processing (Vector and array processors, GPUs)
  • Decoupled Access Execute
  • Systolic Arrays

Today we’re going to cover just dataflow and superscalar execution. We’ve already seen dataflow at the ISA level but cover that again to contrast it with out-of-order execution because it is assuming a sequential ISA. But it’s doing out of order dispatch of instructions internally so it’s getting part of the benefits of a dataflow ISA but it’s not exposing the state of flow of execution to the programmer.

Review: Data Flow: Exploiting Irregular Parallelism

If you’re processing an image and if you want to do some operation to the image, for example, resize the image zoom in or zoom out, you basically apply an operation to every single pixel in the image and that’s very regular. Because you do the same operation on every single pixel in the image.

Dataflow is very good at handling irregular parallelism like you don’t know how to actually parallelize some code.

Data Flow Summary

  • Availability of data determines order of execution
  • A data flow node fires when its sources are ready
  • Programs represented as data flow graphs (of nodes)
    • It’s the difference between out-of-order execution and dataflow
    • Because of this reason, data Flow at the ISA level has not been (as) successful
  • Data Flow implementations at the microarchitecture level (while preserving Von Neumann semantics) have been very successful
  • Out of order execution is the prime example

Pure Data Flow Advantages/Disadvantages (ISA level data flow)

Advantages

  • Very good at exploiting irregular parallelism
    • It’s executed in a dataflow manner. The instructions go into the reservation stations and wait and whenever their source operands are ready, they fire. You get this parallelism that you wouldn’t otherwise be able to extract from the program as a programmer or as a compiler.
  • Only real dependencies constrain processing
  • More parallelism can be exposed than Von Neumann model because you can expose much larger amounts of parallelism.

Disadvantages

  • No precise state semantics
    • Debugging very difficult
    • Interrupt/exception handling is difficult (what is precise state semantics?)
  • Too much parallelism? (Parallelism control needed)
    • If you make it really huge, you can fit very large dataflow graphs but turns out you really need to make it extremely large to be handle real-world codes. And people didn’t have enough hardware budget or power budget to be able to do all of that tag matching, search and have enough space what looks like reservation stations.
    • If you don’t have enough buffering in your machine, you have the problem that you cannot keep track of which instructions are ready. You cannot keep track of the source readiness of the instructions. As a result, you may not be able to build a correct machine.
  • High bookkeeping overhead (tag matching, data storage)

Superscalar Execution

  • Idea: Fetch, decode, execute, retire multiple instructions per cycle
    • N-wide superscalar -> N instructions per cycle
  • Need to add the hardware resources for doing so
  • Hardware performs the dependence checking between concurrently-fetched instructions
    • Whenever you’re decoding 4 instructions per cycle, the hardware needs to figure out the dependencies between those instructions so that even though you’ve decoded 4 instructions per cycle, one of them needs to wait for the previous one.
  • Superscalar execution and out-of-order execution are orthogonal concepts
    • Can have all four combinations of processors: [in-order, out-of-order] x [scalar, superscalar]

In-Order Superscalar Processor Example

In-Order Superscalar Performance Example

Superscalar Performance with Dependencies

Superscalar Execution Tradeoffs

  • Advantages
    • Higher IPC (instructions per cycle)
  • Disadvantages
    • Higher complexity for dependency checking
      • Require checking within a pipeline stage
        • It’s very similar to the depends checking logic in a pipeline except you’re doing it at the same cycle. So it’s actually worse than the pipeline’s instruction dependency checking because you need to do it very quickly.
      • Renaming becomes more complex in an OoO processor
        • Rename an instruction per cycle vs. Rename four instructions per cycle
        • If you’re decoding four instructions per cycle, the fourth instruction that you fetched can be depending on the first, second and third one. So you need to compare the source registers of the fourth instruction to the destination register of all of the previous instructions.
    • More hardware resources needed

Leave a comment