Parallel Computing Architectures, Models, and Performance Laws

Temporal Parallelism

Temporal parallelism is a clever way to speed things up in parallel computing by thinking about tasks in terms of stages or a pipeline. Instead of having one processor do everything for one piece of data before moving to the next, the work is broken down into sequential steps, and different processors work on different stages of the pipeline simultaneously.

With temporal parallelism, while one processor is loading the next image, another processor could be preprocessing the previous image, a third could be extracting features from an even earlier image, and so on. Each processor works on a different stage of the overall task, but on different pieces of data at the same time. Temporal parallelism allows you to leverage multiple processors by having them specialize in different parts of a larger task and work on a continuous flow of data. It is a powerful technique for improving the efficiency and throughput of many parallel applications.

Flynn’s Classification

  1. SISD (Single Instruction, Single Data): This represents a traditional sequential computer with one processor executing a single instruction at a time on a single data stream.
  2. SIMD (Single Instruction, Multiple Data): This architecture features a single control unit that broadcasts the same instruction to multiple processing units, each operating on different data elements simultaneously.
  3. MISD (Multiple Instruction, Single Data): This type involves multiple processing units executing different instructions on the same single data stream.
  4. MIMD (Multiple Instruction, Multiple Data): This is the most common type of parallel computer, where multiple processors independently execute different instructions on different data streams.

Bus Network Architecture

The bus network in parallel computing refers to a type of interconnection network where all the processing nodes (processors, memory modules, etc.) are connected to a shared communication medium, typically a set of parallel electrical wires or optical fibers. This shared medium is the “bus.”

A bus network in parallel computing is a simple and cost-effective interconnection strategy for a small number of nodes that rely on a shared communication medium. However, its inherent limitations in terms of scalability and potential for bottlenecks make it less suitable for large-scale parallel systems where more sophisticated interconnection networks like meshes, hypercubes, or multistage networks are typically preferred.

Uses of Bus Networks

While bus networks are not a primary choice for large-scale parallel computing due to their inherent limitations, they still play a role in smaller shared-memory systems, internal communication within compute nodes, and some specialized or legacy architectures due to their simplicity and cost-effectiveness for a limited number of processors.

Pthread Functions and Data Types

  1. pthread_t: This is an opaque data type representing a thread’s identifier. Variables of this type are used to refer to specific threads created within the program.
  2. pthread_create(): This function creates a new thread. It requires a pointer to a pthread_t variable to store the new thread’s ID, thread attributes (or NULL for defaults), the starting function for the thread, and an argument to pass to that function.
  3. pthread_kill(): This function sends a specified signal to a target thread identified by its pthread_t value. It is used for inter-thread communication or to request termination (use with caution, especially SIGKILL).
  4. pthread_exit(): This function is called by a thread to terminate its execution. It accepts a void* argument that represents the thread’s exit status, which can be retrieved by another thread using pthread_join(). A thread also implicitly exits when its starting function returns.

Multi-threaded Programming

Multi-threaded programming is a parallel programming technique where a single process is divided into multiple, concurrently executing units called threads. These threads share the same memory space and resources of the parent process but have their own independent flow of control, including a program counter, stack, and registers.

In the context of parallel computing, the primary goal of multi-threading is to achieve parallelism by allowing different parts of a program to run simultaneously. This is particularly effective on multi-core processors, where each thread can potentially execute on a separate core, leading to significant performance improvements for suitable tasks.

Multi-threaded programming is a powerful approach to achieving parallelism on shared-memory systems. When implemented correctly, it can lead to significant performance gains and improved application responsiveness. However, it also introduces complexities related to concurrency management that developers must carefully address to create robust and efficient parallel applications.

Direct Networks

Direct networks establish fixed, point-to-point connections between processing nodes and their immediate neighbors. Communication between non-adjacent nodes necessitates multi-hop routing, where messages are forwarded through intermediate nodes. Topologies like meshes, tori, and hypercubes exemplify this approach. While offering simpler node interfaces and lower latency for local communication, distant communication can suffer from higher latency, and scalability may be limited by the network’s diameter and node degree. These static connections define the communication pathways.

Indirect Networks

In parallel computing, indirect networks employ a dedicated switching network to facilitate communication between processing nodes. Unlike direct networks, nodes do not have direct links to each other; instead, they connect to switches that dynamically route messages. Topologies such as crossbar switches and multistage interconnection networks (e.g., fat trees) fall under this category. While offering more flexible communication and potentially lower latency for certain patterns, they introduce the cost and complexity of switches and can experience contention.

Direct vs. Indirect Network Comparison

  1. Connectivity: Direct networks feature direct, fixed connections between neighboring processing nodes, while indirect networks use a separate switching network to connect processing nodes.
  2. Routing Responsibility: In direct networks, each processing node handles routing decisions, forwarding messages to neighbors. In indirect networks, dedicated switches manage the routing of messages.
  3. Communication Path: Direct networks have static communication paths determined by the fixed connections. Indirect networks offer dynamic paths established through the switching network.
  4. Latency for Distant Nodes: Communication between distant nodes in direct networks often incurs higher latency due to multiple hops. Indirect networks can potentially offer lower and more consistent latency depending on the switch architecture.
  5. Cost and Complexity: Direct networks can be more cost-effective for simpler topologies but might have scalability limitations. Indirect networks generally involve higher costs and complexity due to the switching infrastructure.

Task Dependencies

Task dependencies define the order in which different tasks or units of work must be executed. Understanding these dependencies is crucial for correctly designing and implementing parallel algorithms to ensure the desired outcome and maximize efficiency.

  1. Data Dependencies: These occur when the result of one task is required as input by another task.
  2. Control Dependencies: These arise from control flow statements like conditional branches (if-else), loops, and function calls. The execution of a task depends on the outcome of a control decision made by a preceding task.
  3. Resource Dependencies: These occur when multiple tasks need to access the same non-sharable resource (e.g., a specific hardware unit, a critical section of code protected by a lock, a file).
  4. Temporal Dependencies (or Ordering Constraints): These are dependencies imposed by the problem itself or by design choices, where one task must be completed before another for logical correctness, even if there is not a direct data flow or control dependency.

Multi-Cache Memory Systems

Multi-cache memory refers to a system where each processing core (CPU) has its own local cache memory. This is the prevalent architecture in modern multi-core processors. The working principle revolves around reducing memory access latency and increasing overall system performance by exploiting the principle of locality.

  1. Locality of Reference: Programs tend to access the same data or instructions repeatedly (temporal locality) or access data/instructions that are located near each other in memory (spatial locality).
  2. Local Caches: Each core has a small, fast cache that stores recently accessed data and instructions.
  3. Cache Hit: If the requested data is found in the local cache (a “cache hit”), the core can access it much faster than going to the main memory.
  4. Cache Miss: If the data is not in the local cache (a “cache miss”), the core must fetch it from the main memory (or a shared higher-level cache, if present).

Task Graphs

Task graphs in parallel computing are Directed Acyclic Graphs (DAGs) where nodes represent tasks and directed edges represent dependencies. They visually depict the execution order and potential parallelism. Analyzing task graphs helps in identifying independent tasks for concurrent execution, scheduling tasks onto processors, understanding performance bottlenecks (critical path), and designing efficient parallel algorithms by explicitly showing task dependencies.

Parallel Processing System Development Lifecycle

  1. Analyze Problem & Parallelism: Understand the problem and identify opportunities for parallel execution.
  2. Define Goals & Algorithm: Set performance objectives and choose or design a parallel-friendly algorithm.
  3. Decompose into Tasks: Break the problem into smaller, concurrent tasks (data or functional).
  4. Analyze Dependencies: Identify relationships between tasks to ensure correct execution order.
  5. Determine Granularity: Decide the size and complexity of individual tasks (coarse or fine).
  6. Map Tasks to Processors: Assign tasks to processing units, considering load balancing.
  7. Choose Programming Model: Select the appropriate parallel programming approach (threads, message passing, etc.).
  8. Implement & Debug: Write the parallel code and thoroughly debug for correctness (race conditions, deadlocks).
  9. Measure & Optimize Performance: Profile the application, identify bottlenecks, and tune for speedup and efficiency.
  10. Deploy & Maintain: Deploy the system on the target architecture and provide ongoing support.

Scheduling Concurrency

Scheduling concurrency is the intelligent management of parallel tasks on parallel resources to achieve the desired performance goals while respecting constraints like task dependencies and aiming for efficient resource utilization and load balance. The choice of scheduling algorithm and whether to use static or dynamic approaches depends heavily on the characteristics of the parallel application and the underlying hardware architecture.

Scheduling concurrency is a crucial aspect that determines how effectively the available parallel resources (like CPU cores or processing units) are utilized to execute concurrent tasks and achieve speedup. Essentially, it is the process of deciding when, where, and for how long each concurrent task gets to run on a processing resource.

Agglomeration in Parallel Computing

Agglomeration in parallel computing is the process of grouping initially fine-grained tasks into larger chunks. This crucial step aims to enhance performance by reducing communication overhead between processors, increasing the granularity of tasks for better processor utilization, and improving data locality by keeping related computations together. While it reduces the number of parallel units, effective agglomeration minimizes scheduling overhead and communication costs, leading to more efficient parallel execution. Careful balancing is needed to avoid limiting parallelism or creating load imbalances.

Systolic Processors

Systolic processors are a type of parallel computer architecture that uses a network of interconnected processing elements (PEs) to perform computations. These PEs are arranged in a regular pattern, typically a grid, and each PE receives data from its neighbors, processes it, and then passes the result on to its downstream neighbors. This rhythmic flow of data, analogous to the flow of blood in a heart, gives these processors their name, and it allows for highly efficient parallel processing, particularly for compute-bound algorithms like matrix multiplication.

Gustafson’s Law

Gustafson’s Law (also known as Gustafson-Barsis’s Law) is a principle in parallel computing that offers a more optimistic perspective on the potential speedup achievable through parallelization compared to Amdahl’s Law. While Amdahl’s Law focuses on the speedup for a fixed problem size as the number of processors increases, Gustafson’s Law considers the scenario where the problem size can scale with the number of processors, aiming to keep the execution time constant.

Amdahl’s Law

Amdahl’s Law is a fundamental principle in parallel computing that describes the theoretical maximum speedup achievable by parallelizing a workload on multiple processors. It states that the speedup is limited by the portion of the workload that cannot be parallelized (the sequential part).

Amdahl’s Law provides a crucial reminder that the speedup achievable through parallelization is ultimately limited by the sequential parts of a program. It highlights the importance of minimizing the serial fraction to effectively leverage the power of parallel computing.

Point-to-Point Communications

Point-to-point communication involves a direct data exchange between exactly two specific processes in a parallel system. The sender explicitly targets a receiver (by rank/ID), and the receiver can specify the sender. Messages contain data, type, and a tag. Send and receive operations can be blocking (waiting for completion) or non-blocking (returning immediately). This fundamental communication method enables targeted data transfer and forms the basis for more complex parallel algorithms requiring direct process interaction.

Interconnection Networks

Interconnection networks are crucial for efficient communication among all processors within a system. There are two main approaches for interconnecting these processors: static high-speed interconnection networks and dynamic interconnection networks.

  • Static Interconnection Network: Static interconnection networks are fixed. In a unidirectional static interconnection network, connections between nodes allow communication to occur in only one direction. Data can be transmitted from one node to another node but not in the reverse direction. However, in a bidirectional static interconnection network, the connection between nodes allows communication to occur in both directions.
  • Dynamic Interconnection Network: Unlike Static Interconnection Networks, where connections are fixed between nodes, dynamic networks enable the dynamic reconfiguration of connections to adapt to changing communication requirements.

Life Cycle of a Parallel Process

  1. Initialize: Parallel processes are created and allocated initial resources.
  2. Ready/Runnable: The process is waiting to be scheduled on a processing element.
  3. Running/Executing in Parallel: The process performs its computations, potentially with internal threads.
  4. Waiting/Blocked (Parallel Context): The process waits for synchronization (barriers), communication (messages), or resource access.
  5. Terminate/Complete: The process finishes its tasks and releases resources, possibly synchronizing with others before final exit.

Actions for Parallel Process Creation

  1. System Initialization (Parallel Bootstrapping): When the parallel system starts up, initial sets of processes needed for the parallel environment (e.g., resource managers, initial computation starters) are created.
  2. Parallel Process Spawning: A running parallel process explicitly creates new processes to distribute the workload or manage sub-tasks. This is common using parallel programming APIs (like MPI_Comm_spawn).
  3. User Request (Parallel Launch): A user initiates the execution of a parallel program, leading to the creation of multiple processes across the available parallel resources (e.g., using commands like mpirun or through job submission systems).
  4. Batch Job Initiation (Parallel Batch Processing): In parallel batch systems, the scheduler creates processes to execute parallel jobs submitted by users.

VLIW (Very-Long Instruction Word) Architecture

VLIW stands for Very-Long Instruction Word (VLIW) architectures. It is an effective alternative for exploiting Instruction-Level Parallelism (ILP) in programs, specifically for performing more than one basic (primitive) instruction at a time.

These processors include various functional units, fetch a Very-Long Instruction Word from the instruction cache including various primitive instructions, and dispatch the whole VLIW for parallel implementation.

The main goal of VLIW is to remove the complicated instruction scheduling and parallel dispatch that appears in most modern microprocessors. A VLIW processor should be quicker and less costly than a comparable RISC chip.

Condition for Compacting Instructions in a VLIW Word

The compiler must ensure that the operations packed into a VLIW instruction are truly independent and can be executed concurrently without causing any correctness issues due to data or control flow. If these conditions are met, the compiler can then schedule these independent operations into the different slots of the VLIW instruction word, targeting the available functional units of the processor. If sufficient independent operations are not found, “no-operation” (NOP) slots might be inserted to fill the instruction word.

Cluster Computing

Cluster computing is a type of parallel computing that utilizes multiple independent, off-the-shelf computers (nodes) networked together to work as a single, cohesive computing resource. Each node typically has its own processor(s), memory, operating system, and storage. These individual computers are interconnected using high-speed networks and often managed by specialized cluster management software.

Cluster computing leverages the power of interconnected commodity computers to create a scalable and cost-effective parallel computing platform. Parallelism is typically achieved through message passing, and the performance is highly influenced by the network interconnect and the design of the parallel applications.

Granularity in Parallel Computing

Granularity refers to the size and complexity of the individual tasks into which a parallel program is divided. It essentially describes the amount of computation performed by a task relative to the amount of communication or synchronization required with other tasks.

Finding the right balance in granularity is crucial for achieving good performance in parallel computing. Overly fine-grained parallelism can be limited by communication costs, while overly coarse-grained parallelism might not exploit the available parallelism effectively, leading to idle processors.

Grain Size and Parallelism

Parallelism achieved using grain size involves breaking a problem into tasks of varying sizes for concurrent execution. Fine-grained yields many small tasks with high potential parallelism but high overhead. Coarse-grained offers fewer large tasks with low overhead but limited parallelism. Medium-grained aims for a balance. The chosen grain size determines the trade-off between the degree of parallelism and the communication/synchronization overhead, significantly impacting the efficiency of utilizing parallel resources.

Scalar Processor

A scalar processor is a type of processor that can process one data element at a time. Scalar processors are typically used for general-purpose computing tasks, such as word processing and spreadsheets. Compared to vector processors, scalar processors are less powerful and slower, but they are cheaper and more energy-efficient.

Vector Processor

A vector processor is a type of processor that can process multiple data elements at once. It is capable of performing operations on a vector of data elements in parallel. Vector processors are particularly useful for tasks such as image and video processing, where large amounts of data need to be processed in parallel. Vector processors are also used in scientific computing, where they are used to accelerate the processing of complex algorithms.

Fat Tree Interconnection Topology

A Fat Tree is a hierarchical, multi-rooted tree-based interconnection network topology commonly used in High-Performance Computing (HPC) and data centers. Unlike traditional tree topologies that can suffer from bandwidth bottlenecks higher up the hierarchy, a Fat Tree is designed to have increasing bandwidth capacity closer to the root. This “fatter” structure higher up allows for more efficient communication as traffic aggregates.

  • It is organized in levels, with processing nodes (or compute units) typically at the leaves.
  • There are multiple paths between any two nodes in the network, enhancing fault tolerance and enabling load balancing.
  • Fat Trees are often designed with a high degree of symmetry.

Asymptotic Notation for Algorithm Analysis

Asymptotic notation is a way to describe the limiting behavior of a function as its input size grows. In parallel computing, we use it to analyze the time complexity and space complexity of parallel algorithms as the problem size (e.g., the size of a matrix, the number of data points) and the number of processors increase. It helps us understand how the algorithm’s performance will scale with larger inputs and more resources.

  1. Big O (O): Upper bound, represents the worst-case performance. For example, an algorithm with O(n) time complexity means its runtime grows linearly with the input size ‘n’ in the worst-case scenario.
  2. Big Theta (Θ): Tight bound, represents the average or typical performance. An algorithm with Θ(n) time complexity means its runtime grows linearly with the input size ‘n’ on average.
  3. Big Omega (Ω): Lower bound, represents the best-case performance. For example, an algorithm with Ω(n) time complexity means its runtime has a lower bound of linear growth with the input size ‘n’ in the best-case scenario.

Instruction Flow in Parallel Systems

Instruction flow in parallel computing is about orchestrating the concurrent execution of multiple instructions, whether they are part of the same instruction stream (ILP, SIMD) or different instruction streams (TLP, MIMD, Dataflow), to achieve faster computation. The specific mechanisms depend on the underlying hardware architecture and the parallel programming paradigm employed.

Unlike sequential computing where instructions are executed one after another, parallel computing aims to execute multiple instructions simultaneously. The way these instructions are managed and executed concurrently varies based on the parallel architecture and programming model.

Collective Communication

Collective communication in parallel computing involves data exchange and coordination among a group of processes within a defined communicator. Unlike point-to-point, a single collective operation involves multiple participants. These operations are often optimized for the underlying parallel architecture. Collective communication simplifies parallel programming by providing high-level abstractions for common multi-process interactions, often with better performance than implementing the same logic using individual point-to-point messages.

  1. Broadcast: Sends data from one process to all others.
  2. Gather: Collects data from all processes to one process.
  3. Scatter: Distributes data from one process to all others.
  4. Reduce: Combines data from all processes using an operation (e.g., sum, min) and returns the result to one or all processes.

Message Passing Systems

Message Passing Systems are a fundamental paradigm in parallel computing, particularly prevalent in distributed memory architectures where each processor has its own private memory and cannot directly access the memory of other processors. In these systems, parallel processes interact and exchange data by explicitly sending and receiving messages.

Message Passing Systems are a cornerstone of parallel computing on distributed memory architectures, enabling parallel processes to collaborate by explicitly exchanging data-carrying messages. While requiring careful programming, they offer scalability and control necessary for tackling complex computational problems on large parallel machines.

Message Passing Interface (MPI)

The Message Passing Interface (MPI) is a standardized means of exchanging messages between multiple computers running a parallel program across distributed memory. In parallel computing, multiple computers—or even multiple processor cores within the same computer—are called nodes. Each node in the parallel arrangement typically works on a portion of the overall computing problem.

The challenge then is to synchronize the actions of each parallel node, exchange data between nodes, and provide command and control over the entire parallel cluster. The Message Passing Interface defines a standard suite of functions for these tasks. The term message passing itself typically refers to the sending of a message to an object, parallel process, subroutine, function, or thread, which is then used to start another process.

Unix Workstation Clusters

Unix workstation clusters in parallel computing refer to the use of multiple networked Unix-based workstations to work together as a single, powerful parallel computing system. This approach leverages the existing infrastructure of individual workstations to tackle computationally intensive tasks that would be too time-consuming for a single machine.

Unix workstation clusters harness the collective processing power of individual Unix machines to solve complex problems in scientific research, engineering, and other computationally demanding fields. They offer a balance between performance and cost-effectiveness, making parallel computing accessible to a wider range of users and institutions.

Data Parallelism

Data parallelism in parallel computing involves distributing a large dataset across multiple processing units and performing the same operation on each subset concurrently. This approach is effective for tasks with regular data structures like arrays and matrices. Speedup is achieved by processing different parts of the data simultaneously. While scalable and often straightforward to implement, it is not suitable for all problems and can be limited by data distribution overhead and load imbalance if data partitions or processing times vary significantly.

Specialist Data Parallelism

Specialist data parallelism is an advanced form where data is partitioned, and while the type of operation across partitions might be similar, the specific parameters or instruction sequences applied to each segment are tailored. This allows for data-dependent computations or the use of optimized kernels for specific data characteristics within the larger dataset. It offers more flexibility than strict SIMD, enabling efficient handling of heterogeneous data or computations with localized variations, often leveraging capabilities of modern parallel architectures like GPUs.

Parallel Programming Models

  1. Shared Memory Model: Multiple threads within a single process share a common address space. They can access and modify the same memory locations.
  2. Message Passing Model: Multiple independent processes, each with its own private memory space, communicate by explicitly sending and receiving messages.
  3. Threads Model: This programming model is a type of shared memory programming. In the threads model of parallel programming, a single “heavyweight” process can have multiple “lightweight,” concurrent execution paths.
  4. Data Parallel Model: Concept: The same operation is performed simultaneously on different elements of a data structure. Focuses on distributing data across multiple processing units.

Modes of Memory Access

  1. Shared Memory: In this model, multiple processors have direct access to a common pool of physical memory. Processors can communicate by reading and writing to shared memory locations.
  2. Distributed Memory: Each processor has its own private local memory, which is not directly accessible by other processors. Communication between processors occurs explicitly through message passing over an interconnection network.
  3. Hybrid Shared-Distributed Memory: This model combines features of both shared and distributed memory architectures. Nodes in the system are typically shared-memory multiprocessors, and these nodes are interconnected by a distributed memory network.

Vector Supercomputers

Vector supercomputers are a class of parallel computers that were dominant in high-performance computing from the 1970s to the 1990s. Their architecture is specifically designed to efficiently process vectors, which are ordered arrays of data. Instead of performing an operation on a single data element at a time (as in scalar processors), a single vector instruction in a vector supercomputer can perform the same operation on multiple data elements in parallel.

Vector supercomputers like the Cray-1, Cray X-MP, NEC SX series, and Fujitsu VP series played a crucial role in advancing scientific discovery and engineering in the late 20th century. While they are less common as stand-alone systems today, the principles of vector processing and SIMD execution have been incorporated into modern CPUs (through SIMD extensions) and are a fundamental aspect of the architecture of Graphics Processing Units (GPUs), which now dominate many areas of high-performance computing.

Coarse-Grained Specialized Temporal Parallelism

Coarse-grained specialized temporal parallelism is a parallel computing approach where a large task is divided into a sequence of significant processing stages. Each stage is handled by a specialized processing unit (hardware or software optimized for that specific operation), and these units work concurrently in a pipeline fashion on different data items. The “coarse-grained” aspect means that the amount of work done in each stage is substantial, aiming to minimize the overhead of passing data between the specialized units.

This approach is effective when the overall task can be naturally decomposed into a sequence of distinct and computationally intensive stages, and when specialized processing units can be efficiently utilized for each stage.

OpenMP

OpenMP is a set of compiler directives as well as an API for programs written in C, C++, or FORTRAN that provides support for parallel programming in shared-memory environments. OpenMP identifies parallel regions as blocks of code that may run in parallel. Application developers insert compiler directives into their code at parallel regions, and these directives instruct the OpenMP run-time library to execute the region in parallel.

It creates as many threads as there are processing cores in the system. Thus, for a dual-core system, two threads are created; for a quad-core system, four are created; and so forth. Then all the threads simultaneously execute the parallel region. When each thread exits the parallel region, it is terminated. OpenMP provides several additional directives for running code regions in parallel, including parallelizing loops.