Digital Design and Computer Architecture - Lecture 24: Virtual Memory II

March 19, 2022

This post is a derivative of Digital Design and Computer Architecture Lecture by Prof. Onur Mutlu, used under CC BY-NC-SA 4.0.

You can watch this lecture on Youtube and see pdf.

I write this summary for personal learning purposes.

Readings

Virtual Memory
Required
- H&H Chapter 8.4

Recall

A System with Physical Memory Only

Virtual Memory

Idea: Give the programmer the illusion of a large address space while having a small physical memory
- So that the programmer does not worry about managing physical memory
Programmer can assume he/she has “infinite” amount of physical memory
Hardware and software cooperatively and automatically manage the physical memory space to provide the illusion
- Illusion is maintained for each independent process

Virtual Memory: Conceptual View

A System with Virtual Memory (Page-based)

The page table is the mechanism that enables address translation. It’s an os managed lookup table but hardware has access to it through the memory accesses and the hardware caches it. It tells you whether the virtual page you are accessing is in physical memory or not. And if not, it helps you at least the virtual memory system helps bring the page from the disk into the physical memory. As a result, physical memory is a cache for the disk.

Four Issues in Indirection and Mapping

When to map a virtual address to a physical address?
- When the virtual address is first referenced by the program
What is the mapping granularity?
- What is your page size?
- Byte? Kilo-byte? Mega-byte? …
- Multiple granularities?
- There are clear trade-offs between large pages and small pages.
  - If you have a very small page size, then you may need very large page tables and you may not get very good locality also when you bring things into the physical memory.
  - If you have a very large page size, your page table size becomes smaller because you need to store fewer pages in the page table. And you may get very good locality when you transfer a large page into your physical memory if you have very good locality. But if you don’t have very good spatial locality, you may transferring a one gigabyte page into your physical memory and using only 8 bytes of it. As a result, it doesn’t work very well.
Where and how to store the virtual physical mappings?
- Operating system data structures? Hardware? Cooperative?
What to do when physical address space is full?
- Evict an unlikely-to-be-needed virtual address from physical memory

Physical Memory as a Cache

In other words…
Physical memory is a cache for pages stored on disk
- In fact, it is a fully-associative cache in modern systems (a virtual page can potentially be mapped to any physical frame)
Similar caching issues exist as we have covered earlier:
- Placement: where and how to place/find a page in cache?
- Replacement: what page to remove to make room in cache?
- Granularity of management: large, small, uniform pages?
- Write policy: what do we do about writes? Write back?
- + Prefetching

Virtual Memory Definitions

Page size: the mapping granularity of virtual -> physical address spaces
- dictates the amount of data transferred from hard disk to DRAM at once
- you could sub page things just like we did sub-blocking if you remember in caches but you need to be very careful about this because now things are exposed to software
Page table: table that stores virtual physical page mappings
- lookup table used to translate virtual page addresses to physical frame addresses (and find where the associated data is)
Address translation: the process of determining the physical address from the virtual address

Virtual to Physical Mapping

Most accesses hit in physical memory
Programs see the large capacity of virtual memory

Address Translation

Page Table Address Translation Example

Page Table Size

Suppose 64-bit VA and 40-bit PA, how large is the page table?
- 2^52 entries x ~4 bytes ~= 2^54 bytes and that is for just one process! and the process may not be using the entire VM space!

Page Table Challenges

Challenge 1: Page table is large — Multi-level page tables

Solution: multi-level (hierarchical) page tables
Idea: Organize page table in a hierarchical manner such that only a small first-level page table has to be in physical memory. And all of the other page tables can be in virtual memory.

Multi-Level Page Table Example

The real page table is actually second-level page table. But to be able to access their second-level page table, you need to go through another level so that you can actually get the page table entry you’re looking for that corresponds to the full virtual page number.

First-level page table has to be in physical memory
Only the needed second-level page tables can be kept in physical memory

The downsize is each translation requires additional latency.

Multi-Level Page Table: Address Translation

Multi-Level Page Tables from x86 Manual

x86 Page Tables (I): Small Pages

x86 Page Tables (II): Large Pages

Four-level Paging in x86-64

We solved the problem. We made it multi-level so that’s only a small fraction has to be in physical memory. We don’t waste physical memory address space and hierarchy page table also enables us with the ability to not allocate page tables in virtual memory.

Challenge 2: Each instruction fetch or load/store requires at least two memory accesses

one for address translation (page table read)
one to access data with the physical address (after translation)

Two memory accesses to service an instruction fetch or load/store greatly degrades execution time
- Num. of memory accesses increases with multi-level page tables
- Unless we are clever… -> speed up the translation…

Translation Lookaside Buffer (TLB)

These translation could have very good locality because whenever we access particular location in virtual memory, it’s likely that we’re going to access location again or locations around it. So we can get both good temporal locality and space locality. Since if the data has locality, translations should have locality as well because these are coupled with each other.

Idea: Cache the page table entries (PTEs) in a hardware structure in the processor to speed up address translation
Translation lookaside buffer (TLB)
- We’re not caching the data, we’re caching the translations.
- Small cache of most recently used translations (PTEs)
- Reduces number of memory accesses required for most instruction fetches and loads/stores to only one
Page table accesses have a lot of temporal locality
- Memory accesses have temporal and spatial locality
- Large page sizes aid spatial locality (4KB, 8KB, MBs, GBs)
- Consecutive instructions and loads/stores are likely to access same page
TLB: cache of page table entries (i.e., translations)
- Small: accessed in ~1 cycle
- Typically 16 - 512 entries at level 1
- Usually high associativity
- 90-99 % hit rates typical (depends on workload)
- Reduces number of memory accesses for most instruction fetches and loads/stores to only one

Example Two-Entry TLB

TLB is a Translation (PTE) Cache

All issues we discussed in caching and prefetching lectures apply to TLBs
Example issues:
- Instruction vs. Data TLBs
- Multi-level TLBs
- Associativity and size choices and tradeoffs
- Insertion, promotion, replacement policies\
- What to keep in which TLB and how to decide that
- Prefetching into the TLBs
- TLB coherence
- Shared vs. private TLBs across cores/threads
- …

Virtual Memory Support and Examples

Supporting Virtual Memory

Virtual memory requires both HW+SW support
- If you do it in just software, your performance is terrible
- Page Table is in memory
- Can be cached in special hardware structures called Translation Lookaside Buffers (TLBs)
The hardware component is called the MMU (memory management unit, usually a per core)
- Includes Page Table Base Register(s), TLBs, page walkers
It is the job of the software to leverage the MMU to
- Populate page tables, decide what to replace in physical memory
- Change the Page Table Base Register on context switch (to use the running thread’s page table)
- Handle page faults and ensure correct mapping

Address Translation

How to obtain the physical address from a virtual address?
Page size specified by the ISA
- VAX: 512 bytes
- Today: 4KB, 8KB, 2GB, … (small and large pages mixed together)
- Trade-offs? (remember cache lectures)
Page Table contains an entry for each virtual page
- Called Page Table Entry (PTE)
- What is in a PTE?

What Is in a Page Table Entry (PTE)?

Page table is the “tag store” for the physical memory data store
- A mapping table between virtual memory and physical memory
PTE is the “tag store entry” for a virtual page in memory
- Need a valid bit -> to indicate validity/presence in physical memory
- Need tag bits (PFN) -> to support translation
- Need bits to support replacement
- Need a dirty bit to support “write back caching”
- Need protection bits to enable access control and protection

Recall: Address Translation

Address Translation: Page Hit

Let’s take a look at what the mmu does when you get a page hit meaning address translation hit.

Processor sends virtual address to MMU
MMU fetches PTE from page table in memory
MMU fetches PTE from page table in memory
MMU sends physical address to L1 cache
L1 cache sends data word to processor

Address Translation: Page Fault

Processor sends virtual address to MMU
MMU fetches PTE from page table in memory
MMU fetches PTE from page table in memory
Valid bit is zero, so MMU triggers page fault exception (Software needs to handle)
Handler identifies victim, and if dirty pages it ou to disk
Handler pages in new page and updates PTE in memory
Handler returns to original process, restarting faulting instruction

Page Fault (“A Miss in Physical Memory”)

If a page is not in physical memory but disk
- Page table entry indicates virtual page not in memory
- Access to such a page triggers a page fault exception
- OS trap handler invoked to move data from disk into memory
  - Other processes can continue executing
  - OS has full control over placement

Servicing a Page Fault

Processor signals I/O controller
- Read block of length P starting at disk address X and store starting at memory address Y
Disk-to-mem read occurs
- Direct Memory Access (DMA)
- Under control of I/O controller
Controller signals completion
- Interrupts processor
- OS resumes suspended process

Page Replacement Algorithms

This is similar to page replacement algorithms but a little bit different in the sense that physical memory is huge

If physical memory is full (i.e., list of free physical pages is empty), which physical frame to replace on a page fault?
Is True LRU feasible?
- 4GB memory, 4KB pages, how many possibilities of ordering?
Modern systems use approximations of LRU
- E.g., the CLOCK algorithm
And, more sophisticated algorithms to take into account “frequency” of use
- E.g., the ARC algorithm
- Megiddo and Modha, “ARC: A Self-Tuning, Low Overhead Replacement Cache,” FAST 2003.

CLOCK Page Replacement Algorithm

Keep a circular list of physical frames in memory (OS does)
Keep a pointer (hand) to the last-examined frame in the list
When a page is accessed, set the R bit in the PTE
When a frame needs to be replaced, replace the first frame that has the reference (R) bit not set, traversing the circular list starting from the pointer (hand) clockwise
- During traversal, clear the R bits of examined frames
- Set the hand pointer to the next frame in the list

Cache versus Page Replacement

Physical memory (DRAM) is a cache for disk
- Managed by system software via the virtual memory subsystem
Page replacement is similar to cache replacement
Page table is the “tag store” for physical memory data store
What is the difference?
- Required speed of access to cache vs. physical memory
  - Cache needs to be fast, usually physical memory is slow. So you have more freedom.
- Number of blocks in a cache vs. physical memory
  - A cache is smaller than number of blocks in physical memory
- “Tolerable” amount of time to find a replacement candidate (disk versus memory access latency)
- Role of hardware versus software

Memory Protection

Multiple programs (i.e., processes) run concurrently
- Each process has its own page table
- Each process can use its entire virtual address space without worrying about where other programs are
A process can only access physical pages mapped in its page table – cannot overwrite memory of another process
- Provides protection and isolation between processes
- Enables access control mechanisms per page

Page Table is Per Process

Each process has its own virtual address space
- Full address space for each program, but they can also share the pages (e.g., read-only library code or shared data)
- Simplifies memory allocation, sharing, linking and loading

Access Protection/Control via Virtual Memory

Page-Level Access Control (Protection)

Not every process is allowed to access every page
- E.g., need supervisor (i.e., kernel) level privilege to access system pages
- E.g., may not be able to execute “instructions” in some pages
Idea: Store access control information on a page basis in the process’s page table
Enforce access control at the same time as translation
Virtual memory system serves two functions today
- Address translation (for illusion of large physical memory)
- Access control (protection)

VM as a Tool for Memory Access Protection

Extend Page Table Entries (PTEs) with permission bits and it can say whether you’re allowed to read, write or execute the page
Check bits on each access and during a page fault
- If violated, generate exception (Access Protection exception)

Privilege Levels in x86

Basically you have different levels of protection called protection rings

Four privilege levels in x86 (referred to as rings)
- Ring 0: Highest privilege (operating system)
- Ring 1: Not widely used
- Ring 2: Not widely used
- Ring 3: Lowest privilege (user applications)
Supervisor = Kernel (in modern terminology)

x86: A Closer Look at the PDE/PTE

PDE: Page Directory Entry (32 bits)
PTE: Page Table Entry (32 bits)
[Protection: PDE’s Flags](https://youtu.be/7-TyqWqE3RE?t=2910
- Protects all 1024 pages in a page table
Protection: PTE’s Flags
- Protects one page at a time
Page Level Protection in x86
Protection: PDE + PTE = ???

Food for Thought: What If?

Your hardware is unreliable and someone can flip the access protection bits
- such that a user-level program can gain supervisor-level access (i.e., access to all data on the system)
- by flipping the access control bit from user to supervisor!
Can this happen?

Remember RowHammer?

One can predictably induce errors in most DRAM memory chips

One can predictably induce bit flips in commodity DRAM chips
- 80% of the tested DRAM chips are vulnerable
First example of how a simple hardware failure mechanism can create a widespread system security vulnerability

Google’s Original RowHammer Attack

If you want more…

Onur Mutlu, “The Story of RowHammer”
Computer Architecture, Fall 2020, Lecture 4b
- RowHammer (ETH Zürich, Fall 2020)
Computer Architecture, Fall 2020, Lecture 5a
- RowHammer in 2020: TRRespass (ETH Zürich, Fall 2020)
Computer Architecture, Fall 2020, Lecture 5b
- RowHammer in 2020: Revisiting RowHammer (ETH Zürich, Fall 2020)
Computer Architecture, Fall 2020, Lecture 5c
- Secure and Reliable Memory (ETH Zürich, Fall 2020)

Takeaway and Food for Thought

If hardware is unreliable, higher-level security and protection mechanisms (as in virtual memory) may be compromised
The root of security and trust is at the very low levels…
- in the hardware itself
- RowHammer, Spectre, Meltdown are recent key examples…
What should we assume the hardware provides?
How do we keep hardware reliable?
How do we design secure hardware?
How do we design secure hardware with high performance, high energy efficiency, low cost, convenient programming?

Some Issues in Virtual Memory

Three Major Issues in Virtual Memory

There are many other issues we will not cover in detail
- What happens on a context switch?
- How can you handle multiple page sizes?
- …

A Modern Example Virtual Memory System

Share on

Twitter Facebook LinkedIn

Juhyung Kim