Digital Design and Computer Architecture - Lecture 22: More Caches

March 11, 2022

This post is a derivative of Digital Design and Computer Architecture Lecture by Prof. Onur Mutlu, used under CC BY-NC-SA 4.0.

You can watch this lecture on Youtube and see pdf.

I write this summary for personal learning purposes.

Readings

Caches
Required
- H&H Chapters 8.1-8.3
- Refresh: P&P Chapter 3.5
Recommended
- An early cache paper by Maurice Wilkes
  - Wilkes, “Slave Memories and Dynamic Storage Allocation,” IEEE Trans. On Electronic Computers, 1965.

Recall

Cache Structure

Tag store
Data store

Set Associativity

Addresses 0 and 8 always conflict in direct mapped cache
Instead of having one column of 8, have 2 columns of 4 blocks

What’s In A Tag Store Entry?

Valid bit
Tag
Replacement policy bits
Dirty bit?
- It indicates that the block you have in the cache is dirty meaning that you have written to it but you haven’t updated memory.
- Write back vs. write through caches

Handling Writes

Handling Writes (I)

Write cause some problems in caches. Let’s solve the problem.

When do we write the modified data in a cache to the next level? Do you write it to this first level cache or do you write it to all of the levels underneath.
- Write through: At the time the write happens. Whenever the write happens, I’m going to update all of the caches in the multi level hierarchy including the main memory. Of course you could have a subset of the caches obeying the write through policy. It doesn’t have to be all of the caches.
- Write back: When the block is evicted. Update only this cache and don’t update anything else. Only when this block needs to be kicked out from the cache evicted from the cache, then I’m going to take that block and write it back to the next level. You’re writing it back at the time when the block is evicted.
Write-back
- + Can combine multiple writes to the same block before eviction
  - Potentially saves bandwidth between cache levels + saves energy
- – Need a bit in the tag store indicating the block is “dirty/modified”. Whenever you do the write, you set the dirty bit and whenever that blocks get evicted, checks that dirty bit and if the dirty bit is set, it writes that block back to the next level in the hierarchy.
- If you have multiple writes or multiple stores that are writing to the same block, you could combine those.
Write-through
- + Simpler
- + All levels are up to date. Consistency: Simpler cache coherence because no need to check close-to-processor caches’ tag stores for presence
- + Suppose that there is some other processor that needs the same block. If you do write-through, this other processor can get the data from the main memory without any problem.
- – More bandwidth intensive; no combining of writes

Handling Writes (II)

Let’s say you have a store instruction. You want to store to a memory location. Let’s say 4 bytes. Do you bring the entire cache block into the cache and put it into the cache?

Do we allocate a cache block on a write miss?
- Allocate on write miss: Yes. You try to write it into the cache and if it misses, you bring the entire block into the cache.
- No-allocate on write miss: No
Allocate on write miss
- Whenever you do the first write, you bring the cache block into the cache first and then you do th write. Whenever you access some data, you’re bringing it to the cache. You do same thing for reads and writes.
- + Can combine writes instead of writing each of them individually to next level
- + Simpler because write misses can be treated the same way as read misses
- – Requires transfer of the whole cache block
No-allocate
- + Conserves cache space if locality of writes is low (potentially better cache hit rate)

Handling Writes (III)

What if the processor writes to an entire block over a small amount of time? (This is called streaming write)
Is there any need to bring the block into the cache from memory in the first place?
Why do we not simply write to only a portion of the block, i.e., subblock
- E.g., 4 bytes out of 64 bytes
- Problem: Valid and dirty bits are associated with the entire 64 bytes, not with each individual 4 bytes

Subblocked (Sectored) Caches

Idea: Divide a block into subblocks (or sectors)
- Have separate valid and dirty bits for each subblock (sector)
- Allocate only a subblock (or a subset of subblocks) on a request
++ No need to transfer the entire cache block into the cache (A write simply validates and updates a subblock)
++ More freedom in transferring subblocks into the cache (a cache block does not need to be in the cache fully) (How many subblocks do you transfer on a read?)
– More complex design
– May not exploit spatial locality fully

Instruction vs. Data Caches

Separate or Unified?
- Unified means instructions and data are places in the sample place.
Pros and Cons of Unified:
- + Dynamic sharing of cache space: no overprovisioning that might happen with static partitioning (i.e., separate I and D caches)
- – Instructions and data can thrash each other (i.e., no guaranteed space for either). They can kick each other out of the cache.
- – I and D are accessed in different places in the pipeline. Where do we place the unified cache for fast access?
First level caches are almost always split
- Mainly for the last reason above
Higher level caches are almost always unified

Multi-level Caching in a Pipelined Design

First-level caches (instruction and data)
- Decisions very much affected by cycle time (frequency of the processor)
- Small, lower associativity; latency is critical
- Tag store and data store accessed in parallel
Second-level caches
- Decisions need to balance hit rate and access latency
- Usually large and highly associative; latency not as important
- Tag store and data store accessed serially
Serial vs. Parallel access of levels
- Serial: Second level cache accessed only if first-level misses
- Second level does not see the same accesses as the first
  - First level acts as a filter (filters some temporal and spatial locality)
  - Management policies are therefore different

Cache Performance

Cache Parameters that affect the Miss/Hit Rate

Also affect the latency

Cache size
Block size
Associativity
Replacement policy
Insertion/Placement policy

Cache Size

Cache size: total data (not including tag) capacity. Whenever we refer to the size of cache, you’ll normally talk about the data store size, not the tag strore size.
- bigger can exploit temporal locality better
- at some point, the hit rate doesn’t increase because the working set fits in the cache. And whenever you first start accessing the cache, there’s nothing in the cache.
- not ALWAYS better
Too large a cache adversely affects hit and miss latency
- smaller is faster => bigger is slower
- access time may degrade critical path
Too small a cache
- doesn’t exploit temporal locality well
- useful data replaced often
Working set: the whole set of data the executing application references
- Within a time interval
- If your time interval specified, you can calculate how much data the application actually touches during that time interval.

Block Size

Block size is the data that is associated with an address tag
- not necessarily the unit of transfer between hierarchies
  - Sub-blocking: A block divided into multiple pieces (each w/ V/D bits)
Too small blocks
- if you reduce to block size, you’re increasing the tag store size because you need identify that block.
- don’t exploit spatial locality well
- have larger tag overhead
Too large blocks
- too few total # of blocks -> less temporal locality exploitation
- waste of cache space and bandwidth/energy if spatial locality is not high

Large Blocks: Critical-Word and Subblocking

Large cache blocks can take a long time to fill into the cache
- fill cache line critical word first
- restart cache access before complete fill
Large cache blocks can waste bus bandwidth
- divide a block into subblocks
- associate separate valid and dirty bits for each subblock
- Recall: When is this useful?

Associativity

How many blocks can be present in the same index (i.e., set)?
If you have larger associativity
- lower miss rate (reduced conflicts)
- higher hit latency and area cost (plus diminishing returns)
Smaller associativity
- lower cost
- lower hit latency
  - Especially important for L1 caches
Is power of 2 associativity required?

Classification of Cache Misses

Compulsory miss
- first reference to an address (block) always results in a miss
- subsequent references should hit unless the cache block is displaced for the reasons below
Capacity miss
- cache is too small to hold everything needed or don’t have enough associativity
- defined as the misses that would occur even in a fully associative cache (with optimal replacement) of the same capacity
Conflict miss
- defined as any miss that is neither a compulsory nor a capacity miss

How to Reduce Each Miss Type

Compulsory
- Caching cannot help
- Prefetching can: Anticipate which blocks will be needed soon
  - For example, a browser anticipate so you’re gonna click this link soon so I’m going to prefetch all of those from the web into my cache.
Conflict
- More associativity
- Other ways to get more associativity without making the cache associative
  - Victim cache
  - Better, randomized indexing
  - Software hints?
Capacity
- Utilize cache space better: keep blocks that will be referenced
- Software management: divide working set and computation such that each “computation phase” fits in cache
  - You don’t touch the entire 128KB or entire 1MB. You touch only 64KB chunks at a time assuming your cache size is 64KB.
  - Restructure how you access to the memory.
  - It’s called blocking all tiling.

How to Improve Cache Performance

Three fundamental goals
- Reducing miss rate
  - Caveat: reducing miss rate can reduce performance if more costly-to-refetch blocks are evicted
- Reducing miss latency or miss cost
- Reducing hit latency or hit cost
The above three together affect performance

Improving Basic Cache Performance

Reducing miss rate
- More associativity
- Alternatives/enhancements to associativity
  - Victim caches, hashing, pseudo-associativity, skewed associativity
- Better replacement/insertion policies
- Software approaches
Reducing miss latency/cost
- Multi-level caches
- Critical word first
- Subblocking/sectoring
- Better replacement/insertion policies
- Non-blocking caches (multiple cache misses in parallel)
- Multiple accesses per cycle
- Software approaches

Software Approaches for Higher Hit Rate

Restructuring data access patterns
Restructuring data layout
Loop interchange
Data structure separation/merging
Blocking
…

Restructuring Data Access Patterns (I)

Idea: Restructure data layout or data access patterns

Example: If column-major

x[i+1,j] follows x[i,j] in memory
x[i,j+1] is far away from x[i,j]

Poor code

for i = 1, rows
    for j = 1, columns
        sum = sum + x[i, j]

Better code

for j = 1, columns
    for i = 1, rows
        sum = sum + x[i, j]

This is called loop interchange
Other optimizations can also increase hit rate
- Loop fusion, array merging, …
A lot of compilers also do this.

Restructuring Data Access Patterns (II)

Blocking
- Divide loops operating on arrays into computation chunks so that each chunk can hold its data in the cache
- Avoids cache conflicts between different chunks of computation
- Essentially: Divide the working set so that each piece fits in the cache
Also called Tiling

Restructuring Data Layout

1.

struct Node {
    struct Node* next;
    int key;
    char [256] name;
    char [256] school;
}

while (node) {
    if (node->key == input_key) {
        // access other fields of node
    }
    node = node->next;
}

Pointer based traversal (e.g., of a linked list)
Assume a huge linked list (1B nodes) and unique keys
Why does the code on the left have poor cache hit rate?
- “Other fields” occupy most of the cache line even though rarely accessed!

2.

struct Node {
    struct Node* next;
    int key;
    struct Node_data* node_data;
}
struct Node_data {
    char [256] name;
    char [256] school;
}
while (node) {
    if (node->key == input->key) {
        // access node->node_data
    }
    node = node->next;
}

Idea: separate frequentlyused fields of a data structure and pack them into a separate data structure
Who should do this?
- Programmer
- Compiler
  - Profiling vs. dynamic
- Hardware?
- Who can determine what is frequently used?

Multi-Core Issues in Caching

Caches in a Multi-Core System

Cache efficiency becomes even more important in a multicore/multi-threaded system
- Because more threads or more cores requiring data and you need to cache things very well. If you’re not being intelligent about it, one core can destroy the locality of another core or one thread can destroy the locality of another thread. And if this happens you have not only a locality and a thread problem but also a bandwidth problem.
- Memory bandwidth is at premium
- Cache space is a limited resource across cores/threads
How do we design the caches in a multi-core system?
Many decisions
- Shared vs. private caches
- How to maximize performance of the entire system?
- How to provide QoS to different threads in a shared cache?\
- Should cache management algorithms be aware of threads?
- How should space be allocated to threads in a shared cache?

Private vs. Shared Caches

Idea: Instead of dedicating a hardware resource to a hardware context, allow multiple contexts to use it
- Example resources: functional units, pipeline, caches, buses, memory
Why?
- + Resource sharing improves utilization/efficiency -> throughput
  - When a resource is left idle by one thread, another thread can use it; no need to replicate shared data
- + Reduces communication latency
  - For example, data shared between multiple threads can be kept in the same cache in multithreaded processors
- + Compatible with the shared memory programming model

Resource sharing results in contention for resources
- When the resource is not idle, another thread cannot use it
- If space is occupied by one thread, another thread needs to reoccupy it
Sometimes reduces each or some thread’s performance
- Thread performance can be worse than when it is run alone
Eliminates performance isolation -> inconsistent performance across runs
- Thread performance depends on co-executing threads
Uncontrolled (free-for-all) sharing degrades QoS
- Causes unfairness, starvation

Need to efficiently and fairly utilize shared resources

Shared Caches Between Cores

Advantages
- High effective capacity
- Dynamic partitioning of available cache space
  - No fragmentation due to static partitioning
  - If one core does not utilize some space, another core can
- Easier to maintain coherence (a cache block is in a single location)
Disadvantages
- Slower access (cache not tightly coupled with the core)
- Cores incur conflict misses due to other cores’ accesses
  - Misses due to inter-core interference
  - Some cores can destroy the hit rate of other cores
- Guaranteeing a minimum level of service (or fairness) to each core is harder (how much space, how much bandwidth?)

Cache Coherence

If multiple processors cache the same block, how do they ensure they all see a consistent state?
Many processors today implement what is called hardware cache coherence protocols. Whenever the cache writes some data into the block, hardware automatically searches all of the other caches and invalidates that block from those caches.

Share on

Twitter Facebook LinkedIn

Juhyung Kim