Digital Design and Computer Architecture - Lecture 18b: Systolic Arrays and Beyond

This post is a derivative of Digital Design and Computer Architecture Lecture by Prof. Onur Mutlu, used under CC BY-NC-SA 4.0.

You can watch this lecture on Youtube and see pdf.

I write this summary for personal learning purposes.

Broader Agenda

  • Single-cycle Microarchitectures
  • Multi-cycle and Microprogrammed Microarchitectures
  • Pipelining
  • Issues in Pipelining: Control & Data Dependence Handling, State Maintenance and Recovery, …
  • Out-of-Order Execution
  • Other Execution Paradigms

Approaches to (Instruction-Level) Concurrency

  • Pipelining
  • Out-of-order execution
  • Dataflow (at the ISA level)
  • Superscalar Execution
  • VLIW
  • Systolic Arrays
  • Decoupled Access Execute
  • Fine-Grained Multithreading
  • SIMD Processing (Vector and array processors, GPUs)

Readings for Today

  • Required
    • H. T. Kung, “Why Systolic Architectures?,” IEEE Computer 1982.
  • Recommended
    • Jouppi et al., “In-Datacenter Performance Analysis of a Tensor Processing Unit”, ISCA 2017.

Readings for Next Week

  • Required
    • Lindholm et al., “NVIDIA Tesla: A Unified Graphics and Computing Architecture,” IEEE Micro 2008.
  • Recommended
    • Peleg and Weiser, “MMX Technology Extension to the Intel Architecture,” IEEE Micro 1996.

Systolic Arrays

Systolic Arrays: Motivation

  • Goal: design an accelerator. We don’t want to design a general-purpose machine, we want to design an accelerator that’s specific for particular tasks. That has
    • Simple, regular design (keep # unique parts small and regular)
    • High concurrency => high performance
    • Balanced computation and I/O (memory) bandwidth
      • It turns out memory has always been a problem and it’s a bigger problem when you have a lot of data.
  • Idea: Replace a single processing element (PE) with a regular array of PEs and carefully orchestrate flow of data between the PEs
    • such that they collectively transform a piece of input data before outputting it to memory
  • Benefit: Maximizes computation done on a single piece of data element brought from memory

Systolic Arrays

  • Memory: heart
  • Data: blood
  • PEs: cells

Memory pulses data through PEs

  • H. T. Kung, “Why Systolic Architectures?,” IEEE Computer 1982.

Why Systolic Architectures?

  • Idea: Data flows from the computer memory in a rhythmic fashion, passing through many processing elements before it returns to memory
  • Similar to blood flow: heart -> many cells -> heart
    • Different cells “process” the blood
    • Many veins operate simultaneously. There is a network in which these cells get connected to each other
    • Can be many-dimensional
  • Why? Special purpose accelerators/architectures need
    • Simple, regular design (keep # unique parts small and regular)
    • High concurrency => high performance
    • Balanced computation and I/O (memory) bandwidth

Systolic Architectures

  • Basic principle: Replace a single PE with a regular array of PEs and carefully orchestrate flow of data between the PEs
    • Balance computation and memory bandwidth. Because you fetch the data once and you do many opertions on it as opposed to fetching the data once and doing one operation on it.
  • Differences from pipelining:
    • These are individual PEs. There are not stages of instruction execution.
    • Array structure can be non-linear and multi-dimensional
    • PE connections can be multidirectional (and different speed)
    • PEs can have local memory and execute kernels (rather than a piece of the instruction)

Systolic Computation Example

  • Convolution
    • Used in filtering, pattern matching, correlation, polynomial evaluation, etc …
    • Many image processing tasks
    • Machine learning: up to hundreds of convolutional layers in Convolutional Neural Networks (CNN)

Implementing a Convolutional Layer with Matrix Multiplication

You can implement the convolution layer with matrix multiplication.

Power of Convolutions and Applied Courses

  • In 2010, Prof. Andreas Moshovos adopted Professor Hwu’s ECE498AL Programming Massively Parallel Processors Class
  • Several of Prof. Geoffrey Hinton’s graduate students took the course
  • These students developed the GPU implementation of the Deep CNN that was trained with 1.2M images to win the ImageNet competition
  • Example:
    • AlexNet (2012)
    • GoogLeNet (2014)
    • ResNet (2015)

Systolic Computation Example

  • Convolution
    • Used in filtering, pattern matching, correlation, polynomial evaluation, etc …
    • Many image processing tasks
    • Machine learning: up to hundreds of convolutional layers in Convolutional Neural Networks (CNN)

This is exactly what HT Kung at Carnegie Mellon University want to accelerate. He said convolution is important for many tasks, many image processing tasks, … we want convolution in there and this is the architecture that he built.

  • Worthwhile to implement adder and multiplier separately to allow overlapping of add/mul executions
  • One needs to carefully orchestrate when data elements are input to the array
  • And when output is buffered. You need to ensure that the output arrives at the right time and you ensure that it gets buffered.
  • This gets more involved when
    • Array dimensionality increases. If you want to do different operations, it becomes more complicate.
    • PEs are less predictable in terms of latency. If everything is done one cycle, clearly you can orchestrate when you should input the Xs and when you will get out the Ys. Buf if your latencies are become more variable and not predictable, then you have a problem.

Combinations

  • Systolic arrays can be chained together to form powerful systems
  • This systolic array is capable of producing on-the-fly least-squares fit to all the data that has arrived up to any given moment

Systolic Arrays: Pros and Cons

  • Advantages:
    • Principled: Efficiently makes use of limited memory bandwidth, balances computation to I/O bandwidth availability
    • Specialized (computation needs to fit PE organization/functions) -> improved efficiency, simple design, high concurrency/performance -> good to do more with less memory bandwidth requirement
  • Downside:
    • Specialized -> not generally applicable because computation needs to fit the PE functions/organization

More Programmability in Systolic Arrays

  • Each PE in a systolic array
    • Can store multiple “weights”
    • Weights can be selected on the fly
    • Eases implementation of, e.g., adaptive filtering
  • Taken further
    • Each PE can have its own data and instruction memory
    • Data memory -> to store partial/temporary results, constants
    • Leads to stream processing, pipeline parallelism
      • More generally, staged execution

Stages of Pipelined Programs

  • Loop iterations are divided into code segments called stages
  • Threads execute stages on different cores

Pipelined File Compression Example

If you’re doing file compression and if you want to paralyze it, you can break it down into different stages.

Systolic Array: Advantages & Disadvantages

  • Advantages
    • Makes multiple uses of each data item -> reduced need for fetching/refetching -> better use of memory bandwidth
    • High concurrency
    • Regular design (both data and control flow)
  • Disadvantages
    • Not good at exploiting irregular parallelism
    • Relatively special purpose -> need software, programmer support to be a general purpose model

Example Systolic Array: The WARP Computer

  • HT Kung, CMU, 1984-1988
  • Linear array of 10 cells, each cell a 10 Mflop programmable processor
  • Attached to a general purpose host machine
  • HLL and optimizing compiler to program the systolic array
  • Used extensively to accelerate vision and robotics tasks
  • Annaratone et al., “Warp Architecture and Implementation,” ISCA 1986.
  • Annaratone et al., “The Warp Computer: Architecture, Implementation, and Performance,” IEEE TC 1987.

An Example Modern Systolic Array: TPU

Jouppi et al., “In-Datacenter Performance Analysis of a Tensor Processing Unit”, ISCA 2017.

Decoupled Access/Execute (DAE)

This is a paradigm that is not used to a lot in existing systems but it has some principles that are actually very good that can enable systems to be much more efficient going forward.

  • Motivation: Tomasulo’s algorithm too complex to implement
    • 1980s before Pentium Pro
  • Idea: Decouple operand access and execution via two separate instruction streams that communicate via ISA-visible queues.
    • Which is kind of in between inorder and out of order execution. It has kind of a systolic principle in it also. You have two processors. One of them is specialized for memory access and the other specialized for execution. And they comunicate with each other through queues.
  • Smith, “Decoupled Access/Execute Computer Architectures,” ISCA 1982, ACM TOCS 1984.
  • Compiler generates two instruction streams (Access and Execute)
    • Synchronizes the two upon control flow instructions (using branch queues)
  • Advantages:
    • Execute stream can run ahead of the access stream and vice versa
      • If A is waiting for memory, E can perform useful work
      • If A hits in cache, it supplies data to lagging E
      • Queues reduce the number of required registers
    • Limited out-of-order execution without wakeup/select complexity
  • Disadvantages:
    • Compiler support to partition the program and manage queues
      • Determines the amount of decoupling
    • Branch instructions require synchronization between A and E. Because if you get a branch in the access engine, then you need to ensure that the execute engine doesn’t run too far ahead.
    • Multiple instruction streams (can be done with a single one, though)

Astronautics ZS-1

  • Single stream steered into A and X pipelines
  • Each pipeline inorder
  • Smith et al., “The ZS-1 central processor,” ASPLOS 1987.
  • Smith, “Dynamic Instruction Scheduling and the Astronautics ZS-1,” IEEE Computer 1989.

Loop Unrolling to Eliminate Branches

Loop unrolling is a way of getting rid of branches.

for (int i = 0; i < N; i++) {
    A[i] = A[i] + B[i];
}
for (int i = 0; i < N; i+=4) {
    A[i] = A[i] + B[i];
    A[i+1] = A[i+1] + B[i+1];
    A[i+2] = A[i+2] + B[i+2];
    A[i+3] = A[i+3] + B[i+3];
}
  • Idea: Replicate loop body multiple times within an iteration
  • + Reduces loop maintenance overhead
    • Induction variable increment or loop condition test
  • + Enlarges basic block (and analysis scope)
    • Enables code optimization and scheduling opportunities
  • – What if iteration count not a multiple of unroll factor? (need extra code to detect this)
  • – Increases code size

A Modern DAE Example: Pentium 4

Boggs et al., “The Microarchitecture of the Pentium 4 Processor,” Intel Technology Journal, 2001.

Leave a comment