Memory Hierarchy and Virtual Memory Systems
Memory Hierarchy of Existing Systems
Trends in Current Processes
- Integration of 2D and up to 3 levels of cache on the processor chip.
- Growth of cache size and integration level.
- Caches and TLBs separate data and instructions.
- Common points: out-of-order search, non-blocking caches (correct under fault and failure), prefetch gadgets, and hardware data.
- Software prefetch data and first-level caches can support multiple logins on the same cycle.
Virtual Memory (T.4)
Definition
Virtual memory management provides the illusion of unlimited space, where the addressable space is not limited by the main memory reserved for the program (physical space) but by the system’s virtual address range.
Advantages
- Virtual space can be much larger than physical space.
- Facilitates multiprogramming.
- Better use of main memory.
- Facilitates program protection.
- Transparent to the programmer.
Disadvantages
- Temporary overhead in memory management (address translation, block replacements, etc.).
- Overhead in processing exceptions.
- Hardware expenditure for fast and efficient memory management (MMU – Memory Management Unit).
Hardware-Software Translation
Processor -> Virtual Address -> Mapper -> Physical Address (page fault if page not present) -> Memory Hierarchy.
Strategies for Implementing Virtual Memory
1) Internal MMU
MMU in the same integrated circuit as the processor. Common in current processors.
Advantages: Reduced access times, high portability, hardware sharing between processor and MMU.
2) External MMU
MMU in a separate integrated circuit.
Advantages: Saves space on the processor’s integrated circuit for other resources (cache, etc.).
Definitions
- Virtual Space: Namespace, address set that can address a process.
- Physical Space: Reserved memory space in primary memory (FMA) for the process.
- Address Translator: V (virtual space), M (physical space). When receiving x ∈ V, returns y if x is in M at position y, or ? otherwise, causing an exception or address failure (reference x should be transferred from secondary to primary memory).
Addressing Rules for Resolving Faults
- Load Rule: When to transfer x.
- Location Rule: Where is x in main memory?
- Replacement Rule: Which virtual reference in main memory should be removed to make room for x (if main memory is fully occupied)?
Classification of Virtual Memory Systems
Virtual references are grouped into blocks. A virtual reference consists of two fields: Block Number and Displacement within the Block (these blocks are transfer units between secondary and primary memory).
The address translator only translates the block field, leaving the displacement invariant. The translator’s size is proportional to the number of blocks in virtual (or physical) space.
Types of Virtual Memory Systems by Block Size
- Pages: All blocks are the same size. Blocks are called pages, and an exception is a ‘page fault’.
- Segments: Blocks are of different sizes. Blocks are called segments, and an exception is a ‘segment fault’.
- Paged Segmented System: Blocks (segments) are of unequal size but are multiples of a unit size (page).
Paging System (4.1)
Representation
Program P: P = (p1, p2, …, pn), with pi being a virtual page.
Normally, size(pi) = p = 2k.
Virtual address within pi: aij = pi dj with pi ∈ P, 0 <= dj <= p.
Reference vector (page references generated by running P): R = r(1) r(2) … r(n), r(i) = pj, 1 <= i <= n, 1 <= j <= N.
Address Translation
Mapping (translating) the virtual address to the physical page’s base address and adding the displacement from the virtual address gives the relative displacement within the page (virtual page -> physical page).
Basic Translation Implementation Schemes
A) Direct Translation
The translator uses a page table indexed by the Page Number, with bits for residence validity, modification (for write-back), replacement, and protection. The page table is stored in fast registers (quick and expensive translation) or main memory (slower but less expensive).
B) Associative Translation
Translation uses a page table stored in associative memory. Its size is |M| / p entries.
C) Multi-Level Direct Translation
The high cost of fast registers and associative memories, as well as large page tables, restricts the use of direct translation and small associative systems. Most systems use a 2-level direct scheme (DCIR), storing the page table in virtual space (page table of pages). Some systems extend to 3 levels.
Disadvantages: Translation is slower, requiring 3 or 4 accesses to main memory. Page fault management is complex.
D) Combined Direct and Associative Translation
Combines the low cost of direct translation with the high speed of associative translation. The Translation Lookaside Buffer (TLB) stores recently referenced [virtual page, physical page] pairs with management bits. Its success is based on the principle of locality. Typical TLB size is 32 to 256 entries.
Cache Optimization
Cache Victim
A small, fully associative cache that holds lines removed from the main cache due to faults. If a line is needed, it’s swapped with a line in the main cache. Useful for small data caches with direct correspondence.
Pseudo-Associative Caches
Before accessing lower memory levels, other memory lines (‘pseudo-set’) are checked. Requires careful line placement to avoid performance degradation.
Prefetch of Specifications
Overlaps prefetch with execution. Performance may decrease if demand faults interfere. Can be hardware-based (buffer cache or external) or compiler-controlled.
Compile-Time Optimization
Rearranging code or relocating data reduces failure rates. Techniques include array fusion, loop interchange, and blocking.
Reducing Failure Penalty (3.7)
Write Priority on Failure
In a direct-write cache, add a write buffer. In a write-back cache, add a buffer to store changed blocks, stopping the CPU if a new fault occurs until the buffer is emptied.
Locating Subblocks
In direct-mapped caches, subblocks have validity bits. Reduces failure penalty and label size.
Not Waiting for Entire Line to be Stored
Techniques include early restart and out-of-order search.
Lock-Free Cache
Increases controller complexity but doesn’t significantly affect time. Options include low failure hit, hit under multiple faults, and failure under failure.
Level Two Cache
Larger and slower than Level 1 cache, aimed at reducing failures. Uses large caches, high associativity, and large lines.
Reducing Success Time (3.8)
Small and Simple Cache
Smaller hardware is faster. Small caches can be included on the processor chip.
Write Segmentation
Compares tags during writes, allowing per-cycle writes without affecting reads.
Preventing Index Address Translation During Cache Access
Virtual caches use virtual addresses. Issues include process changes, synonyms, and I/O. Physical caches use physical addresses for indexing.
** Sections 3.6 to 3.8 are Cache Performance Improvements **
Cache Coherence (3.9)
Input/Output
DeviceNet I/O can create inconsistent cache copies. Accessing main memory (I/O buffer) is preferred.
Multiprocessor
Multiple processors require coherent cache copies. Cache coherence protocols:
- Directory-Based: Unique copy of line info. Scalable.
- Snooping: Each cache has info copy. Requires a common bus.
Protocols include write-invalidate and write-broadcast. Most multiprocessors use write-back caches to reduce bus traffic. Line length is important for false sharing.