Memory Hierarchy and Virtual Memory Systems

Memory Hierarchy of Existing Systems

Trends in Current Processes

  • Integration of 2D and up to 3 levels of cache on the processor chip.
  • Growth of cache size and integration level.
  • Caches and TLBs separate data and instructions.
  • Common points: out-of-order search, non-blocking caches (correct under fault and failure), prefetch gadgets, and hardware data.
  • Software prefetch data and first-level caches can support multiple logins on the same cycle.

Virtual Memory (T.4)

Definition

Virtual memory management provides the illusion of unlimited space, where the addressable space is not limited by the main memory reserved for the program (physical space) but by the system’s virtual address range.

Advantages

  • Virtual space can be much larger than physical space.
  • Facilitates multiprogramming.
  • Better use of main memory.
  • Facilitates program protection.
  • Transparent to the programmer.

Disadvantages

  • Temporary overhead in memory management (address translation, block replacements, etc.).
  • Overhead in processing exceptions.
  • Hardware expenditure for fast and efficient memory management (MMU – Memory Management Unit).

Hardware-Software Translation

Processor -> Virtual Address -> Mapper -> Physical Address (page fault if page not present) -> Memory Hierarchy.

Strategies for Implementing Virtual Memory

1) Internal MMU

MMU in the same integrated circuit as the processor. Common in current processors.

Advantages: Reduced access times, high portability, hardware sharing between processor and MMU.

2) External MMU

MMU in a separate integrated circuit.

Advantages: Saves space on the processor’s integrated circuit for other resources (cache, etc.).

Definitions

  • Virtual Space: Namespace, address set that can address a process.
  • Physical Space: Reserved memory space in primary memory (FMA) for the process.
  • Address Translator: V (virtual space), M (physical space). When receiving x ∈ V, returns y if x is in M at position y, or ? otherwise, causing an exception or address failure (reference x should be transferred from secondary to primary memory).

Addressing Rules for Resolving Faults

  1. Load Rule: When to transfer x.
  2. Location Rule: Where is x in main memory?
  3. Replacement Rule: Which virtual reference in main memory should be removed to make room for x (if main memory is fully occupied)?

Classification of Virtual Memory Systems

Virtual references are grouped into blocks. A virtual reference consists of two fields: Block Number and Displacement within the Block (these blocks are transfer units between secondary and primary memory).

The address translator only translates the block field, leaving the displacement invariant. The translator’s size is proportional to the number of blocks in virtual (or physical) space.

Types of Virtual Memory Systems by Block Size

  1. Pages: All blocks are the same size. Blocks are called pages, and an exception is a ‘page fault’.
  2. Segments: Blocks are of different sizes. Blocks are called segments, and an exception is a ‘segment fault’.
  3. Paged Segmented System: Blocks (segments) are of unequal size but are multiples of a unit size (page).

Paging System (4.1)

Representation

Program P: P = (p1, p2, …, pn), with pi being a virtual page.

Normally, size(pi) = p = 2k.

Virtual address within pi: aij = pi dj with pi ∈ P, 0 <= dj <= p.

Reference vector (page references generated by running P): R = r(1) r(2) … r(n), r(i) = pj, 1 <= i <= n, 1 <= j <= N.

Address Translation

Mapping (translating) the virtual address to the physical page’s base address and adding the displacement from the virtual address gives the relative displacement within the page (virtual page -> physical page).

Basic Translation Implementation Schemes

A) Direct Translation

The translator uses a page table indexed by the Page Number, with bits for residence validity, modification (for write-back), replacement, and protection. The page table is stored in fast registers (quick and expensive translation) or main memory (slower but less expensive).

B) Associative Translation

Translation uses a page table stored in associative memory. Its size is |M| / p entries.

C) Multi-Level Direct Translation

The high cost of fast registers and associative memories, as well as large page tables, restricts the use of direct translation and small associative systems. Most systems use a 2-level direct scheme (DCIR), storing the page table in virtual space (page table of pages). Some systems extend to 3 levels.

Disadvantages: Translation is slower, requiring 3 or 4 accesses to main memory. Page fault management is complex.

D) Combined Direct and Associative Translation

Combines the low cost of direct translation with the high speed of associative translation. The Translation Lookaside Buffer (TLB) stores recently referenced [virtual page, physical page] pairs with management bits. Its success is based on the principle of locality. Typical TLB size is 32 to 256 entries.

Cache Optimization

Cache Victim

A small, fully associative cache that holds lines removed from the main cache due to faults. If a line is needed, it’s swapped with a line in the main cache. Useful for small data caches with direct correspondence.

Pseudo-Associative Caches

Before accessing lower memory levels, other memory lines (‘pseudo-set’) are checked. Requires careful line placement to avoid performance degradation.

Prefetch of Specifications

Overlaps prefetch with execution. Performance may decrease if demand faults interfere. Can be hardware-based (buffer cache or external) or compiler-controlled.

Compile-Time Optimization

Rearranging code or relocating data reduces failure rates. Techniques include array fusion, loop interchange, and blocking.

Reducing Failure Penalty (3.7)

Write Priority on Failure

In a direct-write cache, add a write buffer. In a write-back cache, add a buffer to store changed blocks, stopping the CPU if a new fault occurs until the buffer is emptied.

Locating Subblocks

In direct-mapped caches, subblocks have validity bits. Reduces failure penalty and label size.

Not Waiting for Entire Line to be Stored

Techniques include early restart and out-of-order search.

Lock-Free Cache

Increases controller complexity but doesn’t significantly affect time. Options include low failure hit, hit under multiple faults, and failure under failure.

Level Two Cache

Larger and slower than Level 1 cache, aimed at reducing failures. Uses large caches, high associativity, and large lines.

Reducing Success Time (3.8)

Small and Simple Cache

Smaller hardware is faster. Small caches can be included on the processor chip.

Write Segmentation

Compares tags during writes, allowing per-cycle writes without affecting reads.

Preventing Index Address Translation During Cache Access

Virtual caches use virtual addresses. Issues include process changes, synonyms, and I/O. Physical caches use physical addresses for indexing.

** Sections 3.6 to 3.8 are Cache Performance Improvements **

Cache Coherence (3.9)

Input/Output

DeviceNet I/O can create inconsistent cache copies. Accessing main memory (I/O buffer) is preferred.

Multiprocessor

Multiple processors require coherent cache copies. Cache coherence protocols:

  1. Directory-Based: Unique copy of line info. Scalable.
  2. Snooping: Each cache has info copy. Requires a common bus.

Protocols include write-invalidate and write-broadcast. Most multiprocessors use write-back caches to reduce bus traffic. Line length is important for false sharing.