Distributed Systems: Concepts, Protocols, and Architectures

Posted on Oct 3, 2025 in Computers

Fundamentals of Distributed Systems

Distributed Mutual Exclusion

Distributed mutual exclusion algorithms coordinate access to shared resources in a distributed system, ensuring only one process uses the resource at a time. Unlike centralized systems, there is no single coordinator. These algorithms rely on message passing among processes to achieve mutual exclusion.

Approaches include:

Token-Based: A unique token grants access to the critical section.
Non-Token-Based: Processes request permission using timestamps.
Quorum-Based: Access is granted if a majority of processes agree.

Key challenges involve message delays, potential failures, and maintaining fairness across the distributed nodes. A key principle of distributed systems is mutual exclusion, which prevents simultaneous operations from concurrently using common assets or crucial areas.

Threads and Multithreading

Threads in a distributed system are units of execution that can run concurrently within a process, which itself can be distributed across multiple machines. While the concept of a thread as a unit of execution remains the same, its application in a distributed system introduces complexities related to communication, synchronization, and data sharing across network boundaries.

Multithreading is a valuable technique for improving the performance and responsiveness of individual processes within a distributed system. However, achieving effective concurrency across the entire distributed application requires careful consideration of inter-process communication, distributed synchronization, data consistency, and fault tolerance. Developers must employ appropriate distributed computing paradigms and tools to manage these complexities.

Clock Synchronization

Clock synchronization in distributed systems refers to the process of ensuring that all clocks across various nodes or computers in the system are set to the same time or at least have their times closely aligned. In a distributed system, where multiple computers communicate and collaborate over a network, each computer typically has its own local clock. Due to factors such as hardware differences, network delays, and clock drift (inaccuracies in timekeeping), these local clocks can drift apart over time.

Physical Clock Synchronization
In distributed systems, each node operates with its own clock, which can lead to time differences. The goal of physical clock synchronization is to overcome this challenge by aligning the actual time readings.
Logical Clock Synchronization
In distributed systems, absolute time often takes a backseat to logical clock synchronization. Logical clocks prioritize the order of events rather than their exact physical timing, acting as storytellers for causality.

Distributed File Systems (DFS)

Goals of DFS

The primary goals of a Distributed File System (DFS) are to enable users of geographically dispersed systems to seamlessly share data and resources. This involves presenting a unified view of files, regardless of their actual physical location. Key goals include:

Transparency: Hiding the distributed nature from users, making access feel local.
Availability: Ensuring continuous access to files even in the face of server or network failures through techniques like data replication.
Reliability: Protecting against data loss through redundancy and fault tolerance mechanisms.
Scalability: Allowing the system to expand to accommodate increasing data and user loads without significant performance degradation.
Performance: Providing acceptable access times, often achieved through caching and data distribution strategies.

Design Issues of DFS

Naming and Name Resolution: How files are named and how these names are translated to their physical locations in a potentially large and dynamic system.
Caching: Caching data at client machines improves performance but introduces the challenge of maintaining cache consistency when data is modified by other clients.
File Access Semantics: Defining the behavior of concurrent read and write operations.
Fault Tolerance: Designing the system to handle server failures, network partitions, and data corruption. Replication, redundancy, and robust recovery mechanisms are crucial.
Scalability: Designing the system architecture and algorithms to efficiently handle a growing number of clients, servers, and data. This involves considering issues like load balancing and data partitioning.

Types of Distributed File Systems

Client-Server Architectures
- Network File System (NFS): A widely used protocol where a server provides file access to clients over a network, allowing users to view, store, and update files remotely.
- Server Message Block (SMB): Also known as Common Internet File System (CIFS), SMB is another protocol for file sharing, allowing computers to perform read and write operations on files on a remote host.
Peer-to-Peer Distributed File Systems
All nodes in the system act as both clients and servers, sharing their local storage. Examples include InterPlanetary File System (IPFS) and some early versions of systems like Coda.
Cluster-Based Distributed File Systems
A group of interconnected storage nodes work together to provide a single, high-performance file system. Examples include Lustre, GlusterFS, and Ceph.
Cloud-Based Distributed File Systems
Leverage cloud storage services provided by vendors like Amazon (S3), Microsoft (Azure Blob Storage), and Google (Cloud Storage). These often have distributed architectures underneath.

Distributed Algorithms and Protocols

Leader Election Algorithms

Bully Algorithm

The Bully Algorithm is a leader election algorithm used in a distributed system where processes have unique IDs and can communicate with each other. It aims to elect a coordinator (leader), with the process having the highest ID ultimately winning the election.

Initiation: When a node detects that the current leader has failed (usually through a timeout mechanism), it initiates an election.
Election Process:
1. The initiating node sends an “election” message to all other nodes with higher priority (higher IDs).
2. Nodes with higher priority respond by either acknowledging their current leader status or declaring themselves candidates.
3. If no higher-priority node responds, the initiating node assumes leadership.
Notification: The newly elected leader informs all nodes of its leadership status, ensuring consistency across the distributed system.

Ring Algorithm

The Ring Election Algorithm is a method used in distributed systems to elect a leader among a group of interconnected nodes arranged in a logical ring structure. It ensures that only one node in the network becomes the leader, facilitating coordination and decision-making.

When a coordinator fails, a process initiates an election by sending a message with its ID to its successor. The message circulates, with each process adding its ID. When the message returns to the initiator, the highest ID in the message is declared the new coordinator. A COORDINATOR message then circulates to inform all processes. This ensures the highest active process becomes the leader, with message complexity proportional to the number of processes.

Remote Procedure Call (RPC)

Remote Procedure Call (RPC) is a protocol used in distributed systems that allows a program to execute a procedure (subroutine) on a remote server or system as if it were a local procedure call. RPC enables a client to invoke methods on a server residing in a different address space (often on a different machine) as if they were local procedures. The client and server communicate over a network, allowing for remote interaction and computation. RPC plays a crucial role in distributed systems by enabling seamless communication and interaction between different components or services that reside on separate machines or servers.

Client-Server Model

The Client-Server model is a distributed application structure that partitions tasks or workloads between the providers of a resource or service, called servers, and service requesters, called clients. In this architecture, when the client computer sends a request for data to the server through the internet, the server accepts the requested process and delivers the requested data packets back to the client. Clients typically do not share any of their resources. Examples include Email and the World Wide Web.

Distributed System Middleware and Technologies

Object Linking and Embedding (OLE)

OLE (Object Linking and Embedding) in a distributed system allows applications to share and manipulate data across different processes or even machines. It enables a client application to embed or link objects (like documents, spreadsheets, or images) created by a server application. In a distributed context, OLE can facilitate rich data integration across a network. However, its tight coupling with the Windows operating system historically limited its widespread use in heterogeneous distributed environments. Modern approaches often favor platform-independent technologies for cross-system data sharing.

ActiveX

ActiveX is a software framework from Microsoft (MSFT) that allows applications to share functionality and data with one another, often through web browsers, regardless of the programming language they are written in. ActiveX add-ons allowed early web browsers to embed multimedia files or deliver software updates to users. Many ActiveX controls run only on Windows and with Microsoft products such as Internet Explorer, Word, and Excel. JavaScript and other similar platforms are now more widely used than ActiveX.

ActiveX controls are pre-coded software similar to web browser plug-ins. For example, a web page displaying a Flash file might require a user to download a Flash ActiveX control so the file can be played directly in-browser without opening a new application.

Orbix (CORBA Implementation)

Orbix is a software platform that facilitates communication and integration between distributed, object-oriented applications. It is a full implementation of the Object Management Group (OMG) Common Object Request Broker Architecture (CORBA), enabling programs to communicate across a network, regardless of the programming languages or operating systems used. Orbix acts as a middleware, abstracting the complexities of network programming so that developers can treat objects as if they were local.

Kerberos Authentication Protocol

Kerberos is a security or authentication protocol for computer network security, developed in the 1980s by MIT for Project Athena. Kerberos functions with three actors and uses a Key Distribution Centre (KDC) as a third-party authorizer/intermediary between the client and the network. This mechanism prevents unauthorized network access and is strengthened further by Kerberos’s symmetric-key cryptography. Since most domain controllers (DCs)/network servers on the Active Directory Domain Services (ADDS) require and allow authentication for host access, Kerberos has found extensive use.

Distributed Shared Memory (DSM)

DSM (Distributed Shared Memory) servers are components in a distributed system that implement the shared memory model across nodes that do not physically share memory. These servers manage a portion of the virtual shared address space and handle requests from client nodes to access data within this space.

When a process on a client node attempts to access a memory location in the shared space, and that location is not locally available (e.g., in its cache), a request is sent to the DSM server responsible for that memory region. The server then fetches the data from the node that owns it and delivers it to the requesting client. Different DSM architectures exist, including centralized server, migration-based, and replication-based approaches, each with its own strategies for managing data location, consistency, and coherence across the distributed memory.

Common Object Request Broker Architecture (CORBA)

CORBA (Common Object Request Broker Architecture) is a middleware technology that allows distributed objects in a networked environment to communicate with one another. It is a specification that describes how objects written in different programming languages communicate. CORBA is a platform- and language-independent approach to distributed computing.

The core idea behind CORBA is that a client application sends a service request to a server object, which provides the requested service and returns the results to the client. The Object Request Broker (ORB) is a middleware component that achieves this, managing communication between the client and the server.

Distributed Component Object Model (DCOM)

Distributed Component Object Model (DCOM) is a Microsoft proprietary technology that allows software components to communicate and interact across networked computers running Windows operating systems. It extends the Component Object Model (COM), which enables communication between software components on a single machine, to a distributed environment. DCOM simplifies the development of client-server applications in a Windows-centric environment by abstracting away the complexities of network communication.

Sun Network File System (NFS)

The Sun Network File System (NFS) is a distributed file system protocol enabling clients to access remote files on a server as if they were local. In a distributed system, NFS facilitates shared access to data across multiple nodes, offering location transparency. It simplifies centralized storage management, allowing administrators to manage files on dedicated servers accessible by various clients, regardless of their operating systems. NFS employs a client-server model using RPC for communication, making resource sharing and collaboration more efficient within the distributed environment.

Defining Distributed Systems

Definition and Characteristics

A distributed system is a collection of computer programs that utilize computational resources across multiple, separate computation nodes to achieve a common, shared goal. Also known as distributed computing or distributed databases, it relies on separate nodes to communicate and synchronize over a common network. These nodes typically represent separate physical hardware devices but can also represent separate software processes or other recursive encapsulated systems.

Properties of Distributed Systems

Key properties include:

Transparency: Transparency is a mechanism that shields details of one file system from other file systems and users (e.g., hiding location or access details).
Performance: This metric measures the time needed to process user file access requests and includes processor time, network transmission time, and time needed to access the storage device and deliver the requested content.
Scalability: As storage requirements increase, users typically deploy additional storage resources, allowing the system to grow efficiently.

Goals of Distributed Systems

The main goal of a distributed system is to make it easy for users to access remote resources and to share them with others in a controlled way. Other important goals include:

Hiding the location of processes and resources (Location Transparency).
Openness: Offering services according to standard rules that describe the syntax and semantics of those services.
Scalability: A system can be scalable in several ways:
- Size scalability (adding more users and resources).
- Geographical scalability (users and resources can be geographically apart).
- Administrative scalability (possible to manage even if many administrative organizations are spanned).

Distributed Security Architecture

A distributed security architecture addresses the challenges of protecting resources and data spread across multiple interconnected nodes. It employs mechanisms like decentralized authentication and authorization, where trust management is distributed rather than centralized. Encryption, secure communication protocols, and distributed firewalls are crucial components. Managing identities and access control across diverse services and ensuring data confidentiality and integrity in a decentralized environment are key considerations. A robust architecture incorporates distributed firewalls and intrusion detection/prevention systems to monitor and filter network traffic at different points. Centralized security management tools provide a unified view for policy enforcement, monitoring, and incident response across the distributed infrastructure.

Distributed Coordination Based Systems

Distributed Coordination-Based Systems are complex networks of independent computers (nodes) working together to achieve common goals. These systems rely on coordination mechanisms to manage interactions and ensure consistent, reliable operations. Key coordination methods include consensus protocols (like Paxos and Raft), which help nodes agree on shared data or states, and distributed algorithms that handle tasks such as leader election and distributed transactions. In these systems, there is no central control; nodes communicate directly, share data, and synchronize their activities. Fault tolerance is achieved through data replication, redundancy, and failover mechanisms, allowing the system to function correctly even if some nodes fail.

Communication Models

Message-Oriented Communication

Message-oriented communication in a distributed system involves the exchange of discrete, self-contained units of data called messages between processes. These messages are typically addressed to a specific recipient and can include both data and control information. Systems employing message-oriented communication often provide mechanisms for message queuing, routing, and delivery guarantees, enabling asynchronous communication where senders and receivers do not need to interact simultaneously. Examples include message queues (like RabbitMQ or Kafka) and message-oriented middleware (MOM).

Stream-Oriented Communication

Stream-oriented communication focuses on transmitting a continuous flow of data, often used for multimedia applications like audio and video streaming. Timing and ordering of data units are crucial in this model to maintain the integrity and quality of the stream. Protocols in stream-oriented communication often address issues like bandwidth management, latency, jitter, and synchronization to ensure a smooth and coherent data flow. Examples include protocols used in video conferencing or live streaming services.