Understanding Distributed Systems: Concepts and Architectures

Posted on Jun 17, 2024 in Computers

Network Fundamentals

Bandwidth (BW): # Bits transmitted per unit time. Latency: Propagation + Transmission + Queueing. Propagation Delay: Distance / Speed of Light. Transmission Delay: 1 / BW.

IPv4: 32 bits (4 x 8 bits). IPv6: 128 bits (8 x 16-bit blocks).

IP Addressing

Class A: 0.0.0.0 – 127.255.255.255 (127 networks / 16M hosts per network)
Class B: 128.0.0.0 – 191.255.255.255 (16K networks / 64K hosts per network)
Class C: 192.0.0.0 – 223.255.255.255 (2M networks / 254 hosts per network)
Class D: 224.0.0.0 – 239.255.255.255 (Multicast)
Class E: 240.0.0.0 – 255.255.255.255 (Reserved)

Classless Inter-Domain Routing (CIDR): Described by variable-length prefix and length (L-network, 32-L Host). Ports: Identify processes.

IP and Sockets

IP: Best-effort delivery. Socket: Point where an application attaches to the network. Interface between application and transport protocol.

n > SQS: Block until (n-SQS) transferred to Receive Queue. If n > (SQS + RQS), blocks until the receiver calls recv() enough to read in n-(SQS+RQS) bytes.

HTTP and DNS

HTTP/1.0: Non-persistent. HTTP/1.1: Persistent.

DNS Hierarchy:

Root Servers: Identity hardwired into other servers.
Top-Level Domains (TLD): .org, .edu, .net, etc.
Authoritative Servers: Information about an organization.

Local DNS: Located near clients. Resolver: Software running on clients. DNS Caching: Time-to-Live (TTL) – cache responses to queries.

Content Delivery Networks (CDNs)

Components: Proxies, Caching, Load Balancing, Availability.

Reverse Proxy:

DNS (Round Robin)
Load Balancing (Network Address Translation)
Static Content Caching (client remembers everything)

Forward Proxies: Cache close to clients – less network traffic, less latency. Use ETag to identify changes.

Replica Goals

Live Server: For availability.
Lowest Load: Balance load.
Closest: Round Trip Time (RTT).
Performance: Throughput, Latency, Reliability.

CDNs Example: Replace www. with ak for Akamai.

Remote Procedure Calls (RPC)

RPC: Easy-to-program network communication that makes client-server communication transparent. Makes communication appear like a local procedure call.

Interface Description Language (IDL): Defines the API for procedure calls: names, parameters, return types, marshaling, unmarshaling.

Server Stub: Two parts:

Dispatcher: Receives a client’s RPC request.
Skeleton: Unmarshals the request, calls the local procedure, marshals the response.

Protocol Buffers (Protobuf): An IDL.

Data Structures: Called messages (Protocol Buffers).
Services/Procedures: gRPC.

Labels the fields so that the signature of the message can be changed later and still be compatible with legacy code.

File Calculation

Number of files calculated as follows:

Number of hashes depends on file size / block size + 64 bytes for filename and version.
Then compute.

Peer-to-Peer (P2P) Networks

Characteristics:

No centralized control.
Nodes are roughly symmetric.

Flooded Queries: O(N) lookup.

Distributed Hash Tables (DHT)

Key: Hash(data). Lookup(key): Returns IP address.

Modulo Hashing: Leads to heavy transfers. Consistent Hashing: Desired features:

Balance: No bucket has too many objects.
Smoothness: Addition/removal of a token minimizes object movements for other buckets.

Each physical node maintains v > 1 tokens where each token corresponds to a virtual node. Each virtual node owns an expected (1/(vn)th) of ID space. Redistribution: (v+1)/(v x 1/nth) => better load balancing with larger V.

Chord Protocol

Key Space: [0, 2^m -1]. Each node keeps m states and ranges (N+2^(k-1)) mod 2^m, 1.

Finger Table Implications:

Binary lookup tree rooted at every node – threaded through other node’s finger tables.
Better than arranging nodes in a single tree.
Each node acts as a root, so there’s no root hotspot, no single point of failure, and a lot more state in total.

Chord Algorithm Properties:

Interface
Efficient Lookup: O(log N)
Scalable State: O(log N) state per node
Robust: Survives massive failures.

Availability Metrics

MTBF: Mean Time Between Failures. MTTR: Mean Time To Repair. Availability: (MTBF – MTTR) / MTBF.

Yield: Queries completed / Queries offered. Harvest: Data available / Complete data. Data per query x queries per second -> constant.

Canary Request: Send a request to one or two leaf servers for testing. The remaining servers are queried if the root gets a successful response from the canary in a reasonable period of time.

Two-Phase Commit (2PC)

2PC: All commit or none does.

Safety:

If one commits, no one aborts.
If one aborts, no one commits.

Liveness:

If no failures and A and B can commit, the action commits.
If failures, reach a conclusion ASAP.

If everyone rebooted and is reachable, the Transaction Coordinator (TC) can just check for the commit record on disk and resend the action. If no commit record -> abort. If no yes record on disk -> abort.

If the yes record is on disk, execute the termination protocol.

This recovery protocol with non-volatile logging is called Two-Phase Commit (2PC).
Safety: All hosts that decide reach the same decision.
No commit unless everyone says”ye”.
Liveness: If no failures and all say”ye” then commit.
But if failures then 2PC might block.
TC must be up to decide.
Doesn’t tolerate faults well: must wait for repair.

In general, entities only need to record promises they make; anything that is NOT a promise need not be recorded. Prepare inquiries from the TC are not promises nor are NO responses. A participant can switch from NO->Yes but never from YES.

Quorum-Based Replication

Read Quorum: Contact Nr servers and select the highest version as the correct version.

Write Quorum: Contact Nw servers and increment the highest version -> write out the new version to the servers in the write quorum.

Constraints:

Nr + Nw > N
Nw > N/2

Raft Consensus Algorithm

AppendEntries():

The leader uses this to”pus” new operations to the replicated state machines.
Also used by the leader to tell the other nodes it is the leader.

RequestVote():

Used when the system starts up to select a leader.
Used when the leader fails to elect a new leader.
Used when the leader is unreachable due to a network partition to elect a new leader.

If electionTimeout elapses with no RPCs (100-500ms), the follower assumes the leader has crashed and starts a new election.

Election Properties

Safety: Allow at most one winner per term.

Each server votes only once per term (persists on disk).
Two different candidates can’t get majorities in the same term.

Liveness: Some candidate must eventually win.

Each chooses election timeouts randomly in [T, 2T].
One usually initiates and wins the election before others start.
Works well if T >> network RTT.

DynamoDB

Gossip: Once per second, each node contacts a randomly chosen other node. They exchange their lists of known nodes (including virtual node IDs).

Eventual Consistency

Dynamo emphasizes availability over consistency when there are partitions.
Tell the client the write is complete when only some replicas have stored it.
Propagate to other replicas in the background.
Allows writes in both partitions…but risks:
- Returning stale data.
- Write conflicts when the partition heals.

Sloppy Quorums

If no failure, reap consistency benefits of a single master.
Else sacrifice consistency to allow progress.
Dynamo tries to store all values put() under a key on the first N live nodes of the coordinator’s preference list.
BUT to speed up get() and put():
- Coordinator returns”succes” for put when W nodes have acknowledged the write.
- Coordinator returns”succes” for get when R nodes have replied.

Suppose C fails. Node E is in the preference list. It needs to receive a replica of the data. Hinted Handoff: The replica at E points to node C. When C comes back, E forwards the replicated data back to C.

Dynamo: Take-Away Ideas

Consistent hashing is broadly useful for replication.
Extreme emphasis on availability and low latency, unusually, at the cost of some inconsistency.
Eventual consistency lets writes and reads return quickly, even when partitions and failures occur.
Context allows some conflicts to be resolved automatically; others are left to the application.

dhnsagAAAAAASUVORK5CYII=

Ths1jXZYXVYAAAAASUVORK5CYII=