Digital Design and Computer Architecture - Lecture 23a: Multiprocessor Caches

This post is a derivative of Digital Design and Computer Architecture Lecture by Prof. Onur Mutlu, used under CC BY-NC-SA 4.0.

You can watch this lecture on Youtube and see pdf.

I write this summary for personal learning purposes.

Most of the content is the same as lecture 22.

Cache Coherence

Whose Responsibility?

  • Software
    • Can the programmer ensure coherence if caches are invisible to software?
    • What if the ISA provided a cache flush instruction?
      • FLUSH-LOCAL A: Flushes/invalidates the cache block containing address A from a processor’s local cache.
      • FLUSH-GLOBAL A: Flushes/invalidates the cache block containing address A from all other processors’ caches.
      • FLUSH-CACHE X: Flushes/invalidates all blocks in cache X.
  • Hardware
    • Simplifies software’s job
    • If you don’t solve this problem, programming becomes very very hard. If you don’t provide cache coherence, you need to make the caches completely visible to the software like GPUs do partially today such that the programmer controls what goes into the every cache — scratchpad memory
    • One idea: Invalidate all other copies of block A when a processor writes to it

A Very Simple Coherence Scheme (VI)

Caches “snoop” (observe) each other’s write/read operations via a shared bus. If a processor writes to a block, all others invalidate the block.

It has only two states and it doesn’t have a modified state because this are write through cache. Remember write through cache is whenever you’re writing to the cache, you’re writing to all of the other and other levels. That’s not good because you’re wasting a lot of bandwidth and not exploiting the locality.

(Non-)Solutions to Cache Coherence

  • No hardware based coherence
    • Keeping caches coherent is software’s responsibility
    • + Makes microarchitect’s life easier
    • – Makes average programmer’s life much harder
      • need to worry about hardware caches to maintain program correctness?
    • – Overhead in ensuring coherence in software (e.g., page protection and page-based software coherence)
  • All caches are shared between all processors
    • + No need for coherence
    • – Shared cache becomes the bandwidth bottleneck
    • – Very hard to design a scalable system with low-latency cache access this way

Maintaining Coherence

  • Need to guarantee that all processors see a consistent value (i.e., consistent updates) for the same memory location
  • Writes to location A by P0 should be seen by P1(eventually), and all writes to A should appear in some order
  • Coherence needs to provide:
    • Write propagation: guarantee that updates will propagate
    • Write serialization: provide a consistent order seen by all processors for the same memory location
  • Need a global point of serialization for this store ordering

Hardware Cache Coherence

  • Basic idea:
    • A processor/cache broadcasts its write/update to a memory location to all other processors
    • Another cache that has the location either updates or invalidates its local copy
  • Two major approaches
    • Snoopy bus (all operations are broadcast on a shared bus)
      • The problem with this is you may have a shared bus maybe up to 4, 8, 16, 32 processors. Now what happens if you want to have hundred processors or thousand processors on a single chip, that bus becomes a bottleneck because multiple processors want to access memory and they want to write but you’re serializing each access through that bus so you cacnnot actually put support multiple transactions on that bus. The second issue is if you actually design a bus electrically to actually get requests from hundreds or thousands of different links, this becomes electrically not so good. You have a lot of electrical loading on the bus so you can run the bus only at very small frequencies. So this doesn’t scale very well.
    • Directory based (a mediator gives permission to each request)
      • You don’t have a shared bus that you broadcast requests to but you go through some mediator. You basically ask the mediator “I want to write to this block”. And the mediator checks if somebody else already is trying to write to that block. The mediator contains all of the information about all of the caches in the system.
      • This is more scalable because you can actually distribute these mediators across the address space in different parts of the system. There doesn’t need to be a single interconnect. There’s no need for a single bus in this case. That mediator can be sitting anywhere in the network.
      • The problem is latency. You have to go through this mediator every single time to be able to write to a cache block.
  • To learn more, take the Graduate Comp Arch class

Leave a comment