Operating System Concepts: Memory, Concurrency, and Distributed Systems

Posted on Jun 28, 2025 in Communication

Dynamic Memory Allocation Fundamentals

Highly parallel applications are common.
Dynamic memory allocation is ubiquitous.
Serial memory allocators are inadequate for parallel systems.

Parallel Allocator Requirements

Speed: As fast as a serial allocator on a single-processor system.
Scalability: Performance scales linearly with the number of processors.
False Sharing Avoidance: Does not introduce false sharing of cache lines.
Low Fragmentation Ratio: Keep the ratio of allocated memory to application-allocated memory low.

Understanding False Sharing

Multiple processors share bytes on the same cache line without truly sharing variables or data.

Active False Sharing Explained

malloc() returns heap objects on the same cache line to different threads.

Passive False Sharing Explained

free() allows a future malloc() to produce false sharing. Each thread free()s memory to its own freelist. A chunk of memory migrates from Thread 1’s freelist to Thread 2’s.

Memory Blowup in Parallel Allocators

Blowup: Unbounded fragmentation, or fragmentation that grows linearly with the number of CPUs. This is caused by a parallel allocator not using freed memory to satisfy future allocation requests.

Hoard: A Scalable Memory Allocator

Per-processor heaps and a single global heap.
Threads are mapped to a processor’s heap.
Heaps are divided into page-aligned superblocks.
Superblocks are divided into blocks.

Bounding Memory Blowup with Hoard

A heap owns some superblocks, which are assigned from the global heap on allocation. A heap only allocates from superblocks it owns. When no memory is available in any superblock on a thread’s heap:

Obtain a superblock from the global heap, if available.
If not (global heap is empty too), create a new superblock request from the OS and add it to the thread’s heap.

Hoard does not return empty superblocks to the OS. Superblocks are returned to the global heap when f, the empty fraction, of blocks are not in use.

Hoard’s Approach to Avoiding False Sharing

Heap allocations are made from superblocks.
Each superblock is owned by one heap.
Freed memory returns to the allocating superblock.
Superblocks returned to the global heap are not empty.
Multiple running threads can use the same heap.

Minimizing Contention in Hoard

Allocation by one thread and freeing by another is uncommon.
Lock contention is low for scalable applications.
Steady-state memory use is within a constant factor of maximum memory use.

Optimizing Physical Page Placement

Careful placement of physical pages can be better than random placement due to:

Cache conflicts.
NUMA multiprocessors.

Virtual Address Considerations

Does not need to be translated before checking the cache.
Application programmers can reason about conflicts.
Cache needs to be flushed on context switch.

Physical Address Considerations

Data may stay in cache across context switches.
Virtual address must be translated before checking the cache.
Conflicts depend on what physical page is allocated.

Page Coloring for Cache Optimization

Number of Colors = (Cache Size) / (Page Size * Associativity)

A page’s color is (page number) % (number of colors).

Page Coloring: Assign a color to virtual and physical pages. On a page fault, allocate a physical page with the same color as the virtual page.

Bin Hopping Technique

Assign colors to physical pages and keep per-color free lists as before. On a page fault, allocate a physical page of the next color from the last one previously allocated. However, given the consistent number of cache misses across trials, one could hypothesize that the OS is not altering the mapping of pages to cache lines between runs, which could suggest a static technique like page coloring rather than bin hopping.

Understanding the Translation Lookaside Buffer (TLB)

The TLB is the portion of the application’s working set that can be mapped in the TLB at one time. The TLB is used on every virtual address access to translate from virtual to physical. Misses require a slow walk of the page tables in memory; high coverage implies a low miss rate.

Strategies to Increase TLB Coverage

Use multiple page sizes at the same time.
Increase TLB coverage without increasing TLB size.
Keep internal fragmentation and disk traffic low (use superpage TLB).

Superpages: Definition and Characteristics

Superpage sizes must be power-of-two multiples of the base page size.
Must be contiguous and aligned in both virtual and physical memory.
Requires a TLB entry for the superpage.
Must be supported by the Memory Management Unit (MMU) of that processor.

Reservation-Based Superpage Creation

A superpage is allocated at page fault time.
Size is specified by the user, making it non-transparent.

Navarro’s Approach to Superpage Management

Navarro balances the need for large contiguous physical pages to create superpages against the uncertainty of whether a given range of virtual address space will benefit from large pages. Marking a large reservation as early as possible increases the likelihood that superpages can be created if the application is actually using a lot of pages in the vicinity of the faulting address. However, allowing reservations to be pre-empted makes it possible to use that physical memory for other needs if the application does not make use of other virtual pages within the reserved space.

Addressing HugeTLBFS Limitations

Enabled by default for all applications.
Transparent to the application; no configuration required.
No pre-allocation of huge pages that are unusable for anything else.
Swappable, by breaking a hugepage into 4KB base pages.

Non-Uniform Memory Access (NUMA)

NUMA: A multiprocessor design where each processor (or small set of processors) has a bank of local memory, but can also access remote memory.

NUMA Memory Policies: MPOL_BIND

Allocate only from nodes specified in the nodemask.

NUMA Memory Policies: MPOL_INTERLEAVE

Interleave allocations from nodes specified in the nodemask.

NUMA Memory Policies: MPOL_PREFERRED

Prefer allocation from the first node in the nodemask, falling back to “nearby” nodes if the preferred node is low on free memory.

NUMA Memory Policies: MPOL_LOCAL

Allocate from the node local to the CPU that triggered the allocation; use other nodes if the local node is low on memory.

Understanding System Interrupts

Interrupts: An event external to the currently executing process that causes a change in the normal flow of instruction execution; usually generated by hardware devices external to the CPU.

Hardware Interrupt Handling Process

The interrupt controller signals the CPU that an interrupt has occurred and passes the interrupt number.
The CPU senses (checks) the interrupt request line after every instruction.
When the interrupt is handled, the program state is reloaded, and the program resumes.

Software Interrupt Handling: Immediate vs. Deferred

The part that has to be done immediately.
The part that should be deferred for later.

Example: Network Interrupt

Immediate: Copy packet off the network card, respond to the card.
Deferred: Process packet header and pass to the destination application.

Example: Timer Interrupt

Immediate: Update system timers.
Deferred: Recalculate process priorities, check for expiry of timeslice, schedule a new process.

Interrupt handling is split this way because the device that generated the interrupt is typically disabled (cannot generate any more interrupts) until it gets a response. Doing all the work associated with the interrupt immediately may take too long, leading to dropped network packets, inaccurate time accounting, and so on.

Signals in Operating Systems

Signals: Allow a process to respond to asynchronous external events (or synchronous internal events).

The Role of Trampoline Functions in Signals

The trampoline function is crucial for managing the invocation and return process of signals. It ensures that the signal handling function executes as expected and that, upon completion, the process can safely return to its original execution flow. In short, the trampoline is a key part of signal delivery, ensuring the correctness and efficiency of signal handling. Signal handling involves trampoline code injection onto the user stack.

Security Risks of Trampoline Functions

Requires the user stack to be executable, which is a security risk. FreeBSD and Linux both now put the trampoline on a separate page that is read-only to the user process.

Signal Context (Sigcontext) and Sigreturn

The signal context (sigcontext), which contains the CPU registers and other state information prior to signal handling, is copied to the user stack and then restored by the sigreturn syscall. This step ensures that after signal handling, the process can resume its state correctly and continue execution.

Inter-Process Communication: FIFO (Named Pipes)

Can be opened with the open() syscall as needed (not just prior to fork()).
Can be used by processes not directly related to each other.
Unidirectional.
No message boundaries.

Inter-Process Communication: Sockets

Sockets support various semantics.
Communicating processes each create a socket and connect them together.
Can be used for local or remote communication.

Evolution of I/O Multiplexing: Select and Poll

The design of select() and poll() was reasonable at the time due to:

System Requirements: Not many file descriptors (or connections, or sockets) to check for activity.
Hardware Characteristics: Physical memory was quite limited, so there was no desire to keep OS memory allocated between checks for activity.

In modern systems, the number of connections (and hence file descriptors to check) increased dramatically (e.g., thousands for high-performance web servers). This led to high overheads copying pollfd structs, or hard limits with the number of bits for select(), as well as overheads for scanning the lists.

Limitations of Select and Poll

Inefficient for large sets of descriptors, though acceptable for small sets.
The application must pass the full list of descriptors for every call.
Up to three scans over the set are required.

How Event Sources Wake Up Polling Threads

When a thread calls poll(), it looks up the object associated with each file descriptor and calls that object’s poll() method, which adds the thread to a list of waiters for that event source. When the event happens, the object associated with that event goes over its list and wakes each waiting thread.

Kqueue: A Scalable I/O Event Notification Interface

Efficient and scalable to a large number of descriptors (thousands).
Expands the information conveyed.

Distributed Systems: Cristian’s Algorithm (1989)

Assumes propagation delay is the same for send and receive.
Accuracy can be improved by making multiple requests and using the minimum Round Trip Time (RTT).

Distributed Systems: Lamport Clocks

Condition: If A ⇒ B then T(A) < T(B)

Each event is assigned a Lamport timestamp, initialized to 0.
If an event is a send event, the local process’s timestamp is incremented by 1 and included in the message.
If an event is a receive event, the local process’s timestamp = Max(local timestamp, message timestamp) + 1.

Distributed Systems: Vector Clock Algorithm

VC_i(j) represents the local logical clock value of process P_j as known by process P_i.

Initialize all values in VC_i to 0: VC_i = [0, …, 0].
Whenever process P_i experiences an event, VC_i[i] is incremented by 1.
When process P_i sends a message to process P_j, it must include its vector clock VC_i.
When process P_j receives a message, it performs two operations:
1. For each value VC_j[k] in the VC_j vector, update it to max(VC_i[k], VC_j[k]).
2. Increment its own corresponding clock value in VC_j, i.e., VC_j[j] is incremented by 1.

Fundamentals of Distributed Algorithms

A distributed algorithm is an algorithm that runs on more than one process.

Synchronous Timing Assumption

Processes share a clock.
Communication can be guaranteed to occur in some known number of clock cycles.
Computation steps can be guaranteed to occur in some known number of clock cycles.

Asynchronous Timing Assumption

Processes operate asynchronously from one another. No claims can be made about whether another process is running slowly or has failed. There is no time bound on how long it takes for a message to be delivered.

Types of Failures: Fail-Stop

A failure results in the process, p, stopping.

Types of Failures: Byzantine

Process p fails in an arbitrary manner. It can send messages and perform actions that will have the worst impact on other processes. It can collaborate with other “failed” processes.

Fault tolerance is the ability of a system to continue operating in the presence of faults.

Reliable Broadcast Properties

Validity: If a correct process broadcasts a message m, then all correct processes eventually deliver m.
Agreement: If a correct process delivers a message m, then all correct processes eventually deliver m.
Integrity: For any message m, every correct process delivers m at most once, and only if m was previously broadcast by sender(m).

Diffusion Algorithm Considerations

Main issue: Floods the network, especially if processes are highly connected.

Challenges in Asynchronous Distributed Consensus

The asynchronous assumption makes it impossible to differentiate between failed and slow processes. Therefore, termination (liveness) cannot be guaranteed. Even if an algorithm terminates, it may violate agreement (safety). A slow process may decide differently than other processes, thus violating the agreement property.

State Machine Commands

A message that the state machine receives. Commands must execute atomically with respect to other commands.

Replicated State Machines and Distributed Consensus

This context is different from general distributed agreement. One client must decide on the result, but replicas do not have to agree with each other about the result. Each correct replica:

Must receive every request.
Must execute the same commands in the same order.

Therefore, Replicated State Machines (RSMs) require Distributed Consensus to agree on the order of commands. A distributed consensus algorithm is needed in the Replicated State Machine approach to providing fault-tolerant services because each server in the RSM starts in the same initial state (optional), and correct replicas must apply commands in the same order so that they are all in the same state when the command is handled, and all reach the same next state after handling the command. This ensures that all correct replicas perform the same action and send the same reply to the client. Distributed consensus is used so that servers can reach agreement on the order in which to apply commands from clients.