Digital Design and Computer Architecture - Lecture 19: SIMD Processors

February 2, 2022

This post is a derivative of Digital Design and Computer Architecture Lecture by Prof. Onur Mutlu, used under CC BY-NC-SA 4.0.

You can watch this lecture on Youtube and see pdf.

I write this summary for personal learning purposes.

Approaches to (Instruction-Level) Concurrency

~~Pipelining~~
~~Out-of-order execution~~
~~Dataflow (at the ISA level)~~
~~Superscalar Execution~~
~~VLIW~~
~~Systolic Arrays~~
~~Decoupled Access Execute~~
~~Fine-Grained Multithreading~~
SIMD Processing (Vector and array processors, GPUs)

Readings for this Week

Required
- Lindholm et al., “NVIDIA Tesla: A Unified Graphics and Computing Architecture,” IEEE Micro 2008.
Recommended
- Peleg and Weiser, “MMX Technology Extension to the Intel Architecture,” IEEE Micro 1996.

Exploiting Data Parallelism: SIMD Processors and GPUs

SIMD Processing: Exploiting Regular (Data) Parallelism

Usually when you are programming in C code or whatever, you will see that many times you write a for loop and this for loop you’re going over an array and doing many times the same thing. For example, you have two vectors and you want to add them element wise. You want add A[0] width to B[0] and store the result in C[0]. A[1] + B[1] and store the result in C[1] and so on. A, B, C[0] and A, B, C[1] are completely independant. So, why don’t we design a machine that can operate on all these elements at the same time?

Flynn’s Taxonomy of Computers

Mike Flynn, “Very High-Speed Computing Systems,”Proc. of IEEE, 1966
SISD: Single instruction operates on single data element
SIMD: Single instruction operates on multiple data elements
- Array processor
- Vector processor
MISD: Multiple instructions operate on single data element
- Closest form: systolic array processor, streaming processor
MIMD: Multiple instructions operate on multiple data elements (multiple instruction streams)
- Multiprocessor
- Multithreaded processor

Data Parallelism

Concurrency arises from performing the same operation on different pieces of data
- Single instruction multiple data (SIMD)
- E.g., dot product of two vectors
Contrast with data flow
- Concurrency arises from executing different operations in parallel (in a data driven manner)
Contrast with thread (or “control”) parallelism
- In multi-core processors, we can have individual threads running on each of the cores and they can be doing completely different things.
- Concurrency arises from executing different threads of control in parallel
SIMD exploits operation-level parallelism on different data
- Same operation concurrently applied to different pieces of data
- A form of ILP (Instruction Level Parallelism) where instruction happens to be the same across data

SIMD Processing

Single instruction operates on multiple data elements
- In time or in space
Multiple processing elements
Time-space duality
- Array processor: Instruction operates on multiple data elements at the same time using different spaces
- Vector processor: Instruction operates on multiple data elements in consecutive time steps using the same space

Array vs. Vector Processors

LD VR <- A[3:0]
ADD VR <- VR, 1
MUL VR <- VR, 2
ST A[3:0] <- VR

When we execute the same one single instruction, we are loading from memory more than one element. We’re getting four elements a0, a1, a2 and a3, and we’re storing them in a verctoer register. You can think about it like a very long register which is divided into four slots. The second instruction is the vector add. What we’re doing is adding one. Next instruction is vector multiply by two. And finally after having done all the computation, we store the four elements at the same time in memory.

LD0 LD1 LD2 LD3
AD0 AD1 AD2 AD3
MU0 MU1 MU2 MU3
ST0 ST1 ST2 ST3

In the array processor, each of the processing elements can execute any type of instruction. One cycle later, we do the add operation on the four elements. And so on…

LD0
LD1 AD0
LD2 AD1 MU0
LD3 AD2 MU1 ST0
    AD3 MU2 ST1
        MU3 ST2
            ST3

In the vector processor, we have specialized functional units. We need to execute all the load operations in the same unit so in the very first cycle we issue load 0. After that a1 and after that a2. As soon as we have read element a0, we can start executing the addition.

Real world SIMD processors are a combination of both.

SIMD Array Processing vs. VLIW

What’s the main difference between SIMD and Very Long Instruction Word.

VLIW: Multiple independent operations packed together by the compiler. We have a very smart compiler that is able to start parallelism from the code and generate some packed instructions that contain several instructions.
Array processor: Single operation on multiple (different) data elements

Vector Processors (I)

A vector is a one-dimensional array of numbers
Many scientific/commercial programs use vectors
```
for (i= 0; i<=49; i++)
    C[i] = (A[i] + B[i]) / 2;
```
- The all iterations of the loop are independent. That’s why it’s vectorizable.
A vector processor is one whose instructions operate on vectors rather than scalar (single data) values
Basic requirements
- Need to load/store vectors -> vector registers (contain vectors)
- Need to operate on vectors of different lengths -> vector length register (VLEN)
- Elements of a vector might be stored apart from each other in memory -> vector stride register (VSTR)
  - Stride: distance in memory between two elements of a vector

Vector Processors (II)

A vector instruction performs an operation on each element in consecutive cycles
- Vector functional units are pipelined
- Each pipeline stage operates on a different data element
Vector instructions allow deeper pipelines
- No intra-vector dependencies -> no hardware interlocking needed within a vector
- No control flow within a vector
- Known stride allows easy address calculation for all vector elements
  - Enables prefetching of vectors into registers/cache/memory
  - Prefetching: If you know exactly what you are going to access in memory, you could start the access already.

The operation that we execute on every element store in a vector register is independent from the other operations on the same vector registers. That’s why we can allow deeper pipelines. That’s going to be very good to increasing the throughput of the machine.

Vector Processor Advantages

No dependencies within a vector
- Pipelining & parallelization work really well
- Can have very deep pipelines, no dependencies!
Each instruction generates a lot of work
- Reduces instruction fetch bandwidth requirements
Highly regular memory access pattern
No need to explicitly code loops
- Fewer branches in the instruction sequence

Vector Processor Disadvantages

– Works (only) if parallelism is regular (data/SIMD parallelism)
- ++ Vector operations
- – Very inefficient if parallelism is irregular
  - – How about searching for a key in a linked list?

To program a vector machine, the compiler or hand coder must make the data structure in the code fit nearly exactly the regular structure built into the hardware. That’s hard to do in first place, and just as hard to change. One tweak, and the low-level code has to be rewritten by a very smart and dedicated programmer who knows the hardware and often the subtleties of the application area.

Fisher, “Very Long Instruction Word architectures and the ELI-512,”ISCA 1983.

Vector Processor Limitations

–Memory (bandwidth) can easily become a bottleneck, especially if

compute/memory operation balance is not maintained
data is not mapped appropriately to memory banks

Vector Processing in More Depth

Vector Registers

Each vector data register holds N M-bit values
Vector control registers: VLEN, VSTR, VMASK
Maximum VLEN can be N
- Maximum number of elements stored in a vector register
Vector Mask Register (VMASK)
- Why do we need this vector mask? Because sometimes we cannot exploit all the lanes or the functional units at the same time. So we need to be able to decide when do we really want to operate on a specific data element or not.
- Indicates which elements of vector to operate on
- Set by vector test instructions
  - e.g., VMASK[i] = (Vk[i] == 0)

Predication

Vector Functional Units

Use a deep pipeline to execute element operations -> fast clock cycle
Control of deep pipeline is simple because elements in vector are independent

Loading/Storing Vectors from/to Memory

Why we need memory banks? Why we need to have an interleaved memory in order to have an efficient access to memory?

-> We need to access many elements at the same time for read and write.

Requires loading/storing multiple elements
Elements separated from each other by a constant distance (stride)
- Assume stride = 1 for now
Elements can be loaded in consecutive cycles if we can start the load of one element per cycle
- Can sustain a throughput of one element per cycle
Question: How do we achieve this with a memory that takes more than 1 cycle to access?
Answer: Bank the memory; interleave the elements across banks. It is not possible without banks because accessing memory takes usually much longer than one single cycle.

Memory Banking

Bank memory or interleaved memory. Instead of having one single memory with one single MDR and MAR, we’re going to have a number of banks. Each of them has its own MDR and MAR. They still share some important components as the data bus and the address bus.

Memory is divided into banks that can be accessed independently. banks share address and data buses (to minimize pin cost)
Can start and complete one bank access per cycle
Can sustain N parallel accesses if all N go to different banks

Q. How are the memory address is going to be mapped into these banks?

Vector Memory System

Next address = Previous address + Stride
If (stride == 1) && (consecutive elements interleaved across banks) && (number of banks >= bank latency), then
- we can sustain 1 element/cycle throughput

Scalar Code vs. Vector Code vs. Vector Chaining

Scalar Code Example: Element-Wise Avg.

for i = 0 to 49
- C[i] = (A[i] + B[i]) / 2

Scalar code (instruction and its latency)

MOVI R0 = 50            1
MOVA R1 = A             1
MOVA R2 = B             1
MOVA R3 = C             1
X: LD R4 = MEM[R1++]    11 ;autoincrement addressing
LD R5 = MEM[R2++]       11
ADD R6 = R4 + R         54
SHFR R7 = R6 >> 1       1
ST MEM[R3++] = R7       11
DECBNZ R0, X            2  ;decrement and branch if NZ

Scalar Code Execution Time (In Order)

Scalar execution time on an in-order processor with 1 bank
- First two loads in the loop cannot be pipelined: 2*11 cycles
- 4 + 50*40 = 2004 cycles
Scalar execution time on an in-order processor with 16 banks (word-interleaved: consecutive words are stored in consecutive banks)
- First two loads in the loop can be pipelined
- 4 + 50*30 = 1504 cycles
Why 16 banks?
- 11-cycle memory access latency
- Having 16 (>11) banks ensures there are enough banks to overlap enough memory operations to cover memory latency

Vectorizable Loops

A loop is vectorizable if each iteration is independent of any other
For i = 0 to 49
- C[i] = (A[i] + B[i]) / 2

Vectorized loop (each instruction and its latency):

MOVI VLEN = 50        1
MOVI VSTR = 1         1
VLD V0 = A            11 + VLEN –1
VLD V1 = B            11 + VLEN –1
VADD V2 = V0 + V1     4 + VLEN –1
VSHFR V3 = V2 >> 1    1 + VLEN –1
VST C = V3            11 + VLEN –1

Basic Vector Code Performance

Assume no chaining(no vector data forwarding)
- Without this, in this case, we have to wait until the ver first vector load has completely finished and the second vector load has completely finished for the 50 elements.
- i.e., output of a vector functional unit cannot be used as the direct input of another
- The entire vector register needs to be ready before any element of it can be used as part of another operation
One memory port (one address generator)
16 memory banks (word-interleaved)
285 cycles

Vector Chaining

As soon as I get the first element, I can forward it to the next functional unit.

Vector chaining: Data forwarding from one vector functional unit to another

Vector Code Performance - Chaining

Vector chaining: Data forwarding from one vector functional unit to another
182 cycles

Vector Code Performance –Multiple Memory Ports

Chaining and 2 load ports, 1 store port in each bank
79 cycles
19X perf. improvement!

Questions (I)

What if # data elements > # elements in a vector register?
- Idea: Break loops so that each iteration operates on # elements in a vector register
  - E.g., 527 data elements, 64-element VREGs
  - 8 iterations where VLEN = 64
  - 1 iteration where VLEN = 15 (need to change value of VLEN)
- Called vector stripmining

Questions (II)

What if vector data is not stored in a stridedfashion in memory? (irregular memory access to a vector)
- Idea: Use indirection to combine/pack elements into vector registers
- Called scatter/gather operations
  - Want to vectorize loops with indirect accesses:
    for (i=0; i<N; i++) A[i] = B[i] + C[D[i]];
```
LV vD, rD        # Load indices in D vector
LVI vC, rC, vD   # Load indirect from rC base
LV vB, rB        # Load B vector
ADDV .D vA,vB,vC # Do add
SV vA, rA        # Store result
```
  - Gather/scatter operations often implemented in hardware to handle sparse vectors (matrices)
  - Vector loads and stores use an index vector which is added to the base register to generate the addresses

Conditional Operations in a Loop

What if some operations should not be executed on a vector (based on a dynamically-determined condition)?
```
for (i=0; i<N; i++)
    if (a[i] != 0)
        b[i]=a[i]*b[i];
```
Idea: Masked operations
- VMASK register is a bit mask determining which data element should not be acted upon
```
VLD V0 = A
VLD V1 = B
VMASK = (V0 != 0)
VMUL V1 = V0 * V1
VST B = V1
```
- This is predicated execution. Execution is predicated on mask bit.

Another Example with Masking

for (i = 0; i < 64; ++i)
    if (a[i] >= b[i])
        c[i] = a[i];
    else
        c[i] = b[i];

Steps to execute the loop in SIMD code

Compare A, b to get VMASK
Masked store of A into C
Complement VMASK
Masked store of B into C

Masked Vector Instructions

We know the bit of the mask. If it’s zero, we don’t do anything. If it’s one, we execute multiplication or addition. But how do we really implement that on the hardware?

Simple Implementation
- execute all N operations, turn off result writeback according to mask
Density-Time Implementation
- scan mask vector and only execute elements with non-zero masks

Density-Time Implementation is better because you are saving cycles. The drawback is the one single addition or one single multiplication will take a little bit longer because you need to first go over the entire mask and scan it before starting the computation.

Some Issues

Stride and banking

As long as they are relatively prime to each other and there are enough banks to cover bank access latency, we can sustain 1 element/cycle throughput
For example, if my number of banks is 16. 16 is a power of 2. So it’s the only divisor of 16 is 2. If I’m using an odd number, I will never have a bank conflict. So my access will be maximum efficiency. But if they have some divisors in common for example 10 and 16, I will have some bank conflicts.

Storage of a matrix

Two dimensional organization in memory can be in two different ways. Row major and column major.

Row major: Consecutive elements in a row are laid out consecutively in memory
Column major: Consecutive elements in a column are laid out consecutively in memory
You need to change the stride when accessing a row versus column

Matrix Multiplication

A and B, both in row-major order
A: Load A0 into vector register V1
- Each time, increment address by one to access the next column
- Accesses have a stride of 1
B: Load B0 into vector register V2
- Each time, increment address by 10
- Accesses have a stride of 10

Different stride scan lead to bank conflicts. How do we minimize them?

Minimizing Bank Conflicts

More banks
Better data layout to match the access pattern
- It is not always possible but in some cases it is
- E.g., transpose
Better mapping of address to bank
- E.g., randomized mapping
- Rau, “Pseudo-randomly interleaved memory,”ISCA 1991.

Array vs. Vector Processors, Revisited

Array vs. vector processor distinction is a “purist’s” distinction
Most “modern” SIMD processors are a combination of both
- They exploit data parallelism in both time and space
- GPUs are a prime example we will cover in a bit more detail

Share on

Twitter Facebook LinkedIn

Juhyung Kim