Digital Design and Computer Architecture - Lecture 18a: VLIW

This post is a derivative of Digital Design and Computer Architecture Lecture by Prof. Onur Mutlu, used under CC BY-NC-SA 4.0.

You can watch this lecture on Youtube and see pdf.

I write this summary for personal learning purposes.

Approaches to (Instruction-Level) Concurrency

  • Pipelining
  • Out-of-order execution
  • Dataflow (at the ISA level)
  • Superscalar Execution
  • VLIW
  • Systolic Arrays
  • Decoupled Access Execute
  • Fine-Grained Multithreading
  • SIMD Processing (Vector and array processors, GPUs)

VLIW Concept

Superscalar vs VLIW

  • Superscalar
    • Hardware fetches multiple instructions and checks dependencies between them
    • Fetch multiple instructions per cycle and decode multiple instructions per cycle.
    • Width n could be eight today. Fetch decode execute finish eight instructions per cycle.
    • But the key thing is the hardware has all the burden.
  • VLIW (Very Long Instruction Word)
    • Software (compiler) packs independent instructions in a larger “instruction bundle” to be fetched and executed concurrently. If you’re fetching any instructions per cycle, the compiler guarantees that those any instructions are independent of each other so that hardware doesn’t need to do dependence checking.
    • Hardware fetches and executes the instructions in the bundle concurrently
    • No need for hardware dependency checking between concurrently-fetched instructions in the VLIW model

See

You have some multiple execution units. These are processing elements. When you address a program counter a location in memory, you’re getting multiple instructions and compiler guarantee that these multiple instructions are independent of each other. The compiler also does something else which is it aligns the instructions such that the instructions are aligned to the functional units that can execute them. For example, if not all of the functional units can execute a load, the compiler ensures that load appears in the right place in the long instruction word, such that load can directly go into the execution unit. That eliminates some sort of network.

VLIW (Very Long Instruction Word)

  • A very long instruction word consists of multiple independent instructions packed together by the compiler
    • Packed instructions can be logically unrelated (contrast with SIMD/vector processors, which we will see soon)
  • Idea: Compiler finds independent instructions and statically schedules (i.e. packs/bundles) them into a single VLIW instruction
  • Traditional Characteristics
    • Multiple functional units to able to execute multiple things
    • All instructions in a bundle are executed in lock step
    • Instructions in a bundle statically aligned to be directly fed into the functional units
      • The compiler needs to have intimate knowledge of the machine looks like what is the pipeline, where are the execution units and how many execution units do I have, what are the kinds of dependencies …
      • If you’re an assembly programmer, then you need to know the machine extremely well.

VLIW Performance Example (2-wide bundles)

lw $t0, 40($s0)    ; Ideal IPC = 2
add $t1, $s1, $s2
sub $t2, $s1, $s3
and $t3, $s3, $s4
or $t4, $s1, $s5
sw $s5, 80($s0)

It’s very similar to superscalar exept the compiler needs to somehow bundle these instructions. There’s no dependency checking logic compared to the superscalar because the compiler ensured that these instructions are actually independent. If the compiler does a good job in the scheduling, you’ll get true two instructions per cycle. If the compiler is not able to find an instruction to bundle together, it will insert a no-op.

VLIW Lock-Step Execution

  • Lock-step (all or none) execution: If any operation in a VLIW instruction stalls, all instructions stall
    • E.g, if you have load, add and multiply instructions, if the load stalls for some reason, all of the other ones also stalls
  • In a truly VLIW machine, the compiler handles all dependency-related stalls, hardware does not perform dependency checking
    • What about variable latency operations?
    • Sometimes your load takes one cycle because you hit in the cache, but sometimes it takes 100 cycles. now you have a problem when you’re statically scheduling code.

VLIW Philosophy

  • Philosophy similar to RISC (simple instructions and hardware)
    • Except multiple instructions in parallel
  • RISC (John Cocke, 1970s, IBM 801 minicomputer)
    • Compiler does the hard work to translate high-level language code to simple instructions (John Cocke: control signals)
      • And, to reorder simple instructions for high performance
    • Hardware does little translation/decoding => very simple
  • VLIW (Josh Fisher, ISCA 1983)
    • Compiler does the hard work to find instruction level parallelism
    • Hardware stays as simple and streamlined as possible
      • Executes each instruction in a bundle in lock step
      • Simple => higher frequency, easier to design

Commercial VLIW Machines

VLIW has been successful in some domains but not so successful in the general-purpose domains.

  • Multiflow TRACE, Josh Fisher (7-wide, 28-wide)
  • Cydrome Cydra 5, Bob Rau
  • Transmeta Crusoe: x86 binary-translated into internal VLIW
  • TI C6000, Trimedia, STMicro (DSP & embedded processors)
    • Most successful commercially
  • Intel IA-64
    • Not fully VLIW, but based on VLIW principles
    • EPIC (Explicitly Parallel Instruction Computing)
    • Instruction bundles can have dependent instructions
    • A few bits in the instruction format specify explicitly which instructions in the bundle are dependent on which other ones

VLIW Tradeoffs

  • Advantages
    • + No need for dynamic scheduling hardware -> simple hardware
    • + No need for dependency checking within a VLIW instruction -> simple hardware for multiple instruction issue + no renaming
    • + No need for instruction alignment/distribution after fetch to different functional units -> simple hardware
  • Disadvantages
    • – Compiler needs to find N independent operations per cycle
      • – If it cannot, inserts NOPs in a VLIW instruction
      • – Parallelism loss AND code size increase
    • – Recompilation required when execution width (N), instruction latencies, functional units change (Unlike superscalar processing)
    • – Lockstep execution causes independent operations to stall
      • – No instruction can progress until the longest-latency instruction completes

VLIW Summary

  • VLIW simplifies hardware, but requires complex compiler techniques
  • Solely-compiler approach of VLIW has several downsides that reduce performance
    • – Too many NOPs (not enough parallelism discovered)
    • – Static schedule intimately tied to microarchitecture
      • – Code optimized for one generation performs poorly for next
    • – No tolerance for variable or long-latency operations (lock step)

++ Most compiler optimizations developed for VLIW employed in optimizing compilers (for superscalar compilation) -> Enable code optimizations

++ VLIW successful when parallelism is easier to find by the compiler (traditionally embedded markets, DSPs)

Leave a comment