Understanding Distributed Systems: Concepts and Architectures
Network Fundamentals
Bandwidth (BW): # Bits transmitted per unit time. Latency: Propagation + Transmission + Queueing. Propagation Delay: Distance / Speed of Light. Transmission Delay: 1 / BW.
IPv4: 32 bits (4 x 8 bits). IPv6: 128 bits (8 x 16-bit blocks).
IP Addressing
- Class A: 0.0.0.0 – 127.255.255.255 (127 networks / 16M hosts per network)
- Class B: 128.0.0.0 – 191.255.255.255 (16K networks / 64K hosts per network)
- Class C: 192.0.0.0 – 223.255.255.255 (2M networks / 254 hosts per network)
- Class D: 224.0.0.0 – 239.255.255.255 (Multicast)
- Class E: 240.0.0.0 – 255.255.255.255 (Reserved)
Classless Inter-Domain Routing (CIDR): Described by variable-length prefix and length (L-network, 32-L Host). Ports: Identify processes.
IP and Sockets
IP: Best-effort delivery. Socket: Point where an application attaches to the network. Interface between application and transport protocol.
n > SQS: Block until (n-SQS) transferred to Receive Queue. If n > (SQS + RQS), blocks until the receiver calls recv() enough to read in n-(SQS+RQS) bytes.
HTTP and DNS
HTTP/1.0: Non-persistent. HTTP/1.1: Persistent.
DNS Hierarchy:
- Root Servers: Identity hardwired into other servers.
- Top-Level Domains (TLD): .org, .edu, .net, etc.
- Authoritative Servers: Information about an organization.
Local DNS: Located near clients. Resolver: Software running on clients. DNS Caching: Time-to-Live (TTL) – cache responses to queries.
Content Delivery Networks (CDNs)
Components: Proxies, Caching, Load Balancing, Availability.
Reverse Proxy:
- DNS (Round Robin)
- Load Balancing (Network Address Translation)
- Static Content Caching (client remembers everything)
Forward Proxies: Cache close to clients – less network traffic, less latency. Use ETag to identify changes.
Replica Goals
- Live Server: For availability.
- Lowest Load: Balance load.
- Closest: Round Trip Time (RTT).
- Performance: Throughput, Latency, Reliability.
CDNs Example: Replace www. with ak for Akamai.
Remote Procedure Calls (RPC)
RPC: Easy-to-program network communication that makes client-server communication transparent. Makes communication appear like a local procedure call.
Interface Description Language (IDL): Defines the API for procedure calls: names, parameters, return types, marshaling, unmarshaling.
Server Stub: Two parts:
- Dispatcher: Receives a client’s RPC request.
- Skeleton: Unmarshals the request, calls the local procedure, marshals the response.
Protocol Buffers (Protobuf): An IDL.
- Data Structures: Called messages (Protocol Buffers).
- Services/Procedures: gRPC.
Labels the fields so that the signature of the message can be changed later and still be compatible with legacy code.
File Calculation
Number of files calculated as follows:
- Number of hashes depends on file size / block size + 64 bytes for filename and version.
- Then compute.
Peer-to-Peer (P2P) Networks
Characteristics:
- No centralized control.
- Nodes are roughly symmetric.
Flooded Queries: O(N) lookup.
Distributed Hash Tables (DHT)
Key: Hash(data). Lookup(key): Returns IP address.
Modulo Hashing: Leads to heavy transfers. Consistent Hashing: Desired features:
- Balance: No bucket has too many objects.
- Smoothness: Addition/removal of a token minimizes object movements for other buckets.
Each physical node maintains v > 1 tokens where each token corresponds to a virtual node. Each virtual node owns an expected (1/(vn)th) of ID space. Redistribution: (v+1)/(v x 1/nth) => better load balancing with larger V.
Chord Protocol
Key Space: [0, 2^m -1]. Each node keeps m states and ranges (N+2^(k-1)) mod 2^m, 1.
Finger Table Implications:
- Binary lookup tree rooted at every node – threaded through other node’s finger tables.
- Better than arranging nodes in a single tree.
- Each node acts as a root, so there’s no root hotspot, no single point of failure, and a lot more state in total.
Chord Algorithm Properties:
- Interface
- Efficient Lookup: O(log N)
- Scalable State: O(log N) state per node
- Robust: Survives massive failures.
Availability Metrics
MTBF: Mean Time Between Failures. MTTR: Mean Time To Repair. Availability: (MTBF – MTTR) / MTBF.
Yield: Queries completed / Queries offered. Harvest: Data available / Complete data. Data per query x queries per second -> constant.
Canary Request: Send a request to one or two leaf servers for testing. The remaining servers are queried if the root gets a successful response from the canary in a reasonable period of time.
Two-Phase Commit (2PC)
2PC: All commit or none does.
Safety:
- If one commits, no one aborts.
- If one aborts, no one commits.
Liveness:
- If no failures and A and B can commit, the action commits.
- If failures, reach a conclusion ASAP.
If everyone rebooted and is reachable, the Transaction Coordinator (TC) can just check for the commit record on disk and resend the action. If no commit record -> abort. If no yes record on disk -> abort.
If the yes record is on disk, execute the termination protocol.
- This recovery protocol with non-volatile logging is called Two-Phase Commit (2PC).
- Safety: All hosts that decide reach the same decision.
- No commit unless everyone says”ye”.
- Liveness: If no failures and all say”ye” then commit.
- But if failures then 2PC might block.
- TC must be up to decide.
- Doesn’t tolerate faults well: must wait for repair.
In general, entities only need to record promises they make; anything that is NOT a promise need not be recorded. Prepare inquiries from the TC are not promises nor are NO responses. A participant can switch from NO->Yes but never from YES.
Quorum-Based Replication
Read Quorum: Contact Nr servers and select the highest version as the correct version.
Write Quorum: Contact Nw servers and increment the highest version -> write out the new version to the servers in the write quorum.
Constraints:
- Nr + Nw > N
- Nw > N/2
Raft Consensus Algorithm
AppendEntries():
- The leader uses this to”pus” new operations to the replicated state machines.
- Also used by the leader to tell the other nodes it is the leader.
RequestVote():
- Used when the system starts up to select a leader.
- Used when the leader fails to elect a new leader.
- Used when the leader is unreachable due to a network partition to elect a new leader.
If electionTimeout elapses with no RPCs (100-500ms), the follower assumes the leader has crashed and starts a new election.
Election Properties
Safety: Allow at most one winner per term.
- Each server votes only once per term (persists on disk).
- Two different candidates can’t get majorities in the same term.
Liveness: Some candidate must eventually win.
- Each chooses election timeouts randomly in [T, 2T].
- One usually initiates and wins the election before others start.
- Works well if T >> network RTT.
DynamoDB
Gossip: Once per second, each node contacts a randomly chosen other node. They exchange their lists of known nodes (including virtual node IDs).
Eventual Consistency
- Dynamo emphasizes availability over consistency when there are partitions.
- Tell the client the write is complete when only some replicas have stored it.
- Propagate to other replicas in the background.
- Allows writes in both partitions…but risks:
- Returning stale data.
- Write conflicts when the partition heals.
Sloppy Quorums
- If no failure, reap consistency benefits of a single master.
- Else sacrifice consistency to allow progress.
- Dynamo tries to store all values
put()under a key on the first N live nodes of the coordinator’s preference list. - BUT to speed up
get()andput():- Coordinator returns”succes” for
putwhen W nodes have acknowledged the write. - Coordinator returns”succes” for
getwhen R nodes have replied.
- Coordinator returns”succes” for
Suppose C fails. Node E is in the preference list. It needs to receive a replica of the data. Hinted Handoff: The replica at E points to node C. When C comes back, E forwards the replicated data back to C.
Dynamo: Take-Away Ideas
- Consistent hashing is broadly useful for replication.
- Extreme emphasis on availability and low latency, unusually, at the cost of some inconsistency.
- Eventual consistency lets writes and reads return quickly, even when partitions and failures occur.
- Context allows some conflicts to be resolved automatically; others are left to the application.
