Understanding Distributed Systems: Concepts and Architectures

Network Fundamentals

Bandwidth (BW): # Bits transmitted per unit time. Latency: Propagation + Transmission + Queueing. Propagation Delay: Distance / Speed of Light. Transmission Delay: 1 / BW.

IPv4: 32 bits (4 x 8 bits). IPv6: 128 bits (8 x 16-bit blocks).

IP Addressing

  • Class A: 0.0.0.0 – 127.255.255.255 (127 networks / 16M hosts per network)
  • Class B: 128.0.0.0 – 191.255.255.255 (16K networks / 64K hosts per network)
  • Class C: 192.0.0.0 – 223.255.255.255 (2M networks / 254 hosts per network)
  • Class D: 224.0.0.0 – 239.255.255.255 (Multicast)
  • Class E: 240.0.0.0 – 255.255.255.255 (Reserved)

Classless Inter-Domain Routing (CIDR): Described by variable-length prefix and length (L-network, 32-L Host). Ports: Identify processes.

IP and Sockets

IP: Best-effort delivery. Socket: Point where an application attaches to the network. Interface between application and transport protocol.

n > SQS: Block until (n-SQS) transferred to Receive Queue. If n > (SQS + RQS), blocks until the receiver calls recv() enough to read in n-(SQS+RQS) bytes.

HTTP and DNS

HTTP/1.0: Non-persistent. HTTP/1.1: Persistent.

DNS Hierarchy:

  • Root Servers: Identity hardwired into other servers.
  • Top-Level Domains (TLD): .org, .edu, .net, etc.
  • Authoritative Servers: Information about an organization.

Local DNS: Located near clients. Resolver: Software running on clients. DNS Caching: Time-to-Live (TTL) – cache responses to queries.

Content Delivery Networks (CDNs)

Components: Proxies, Caching, Load Balancing, Availability.

Reverse Proxy:

  1. DNS (Round Robin)
  2. Load Balancing (Network Address Translation)
  3. Static Content Caching (client remembers everything)

Forward Proxies: Cache close to clients – less network traffic, less latency. Use ETag to identify changes.

Replica Goals

  1. Live Server: For availability.
  2. Lowest Load: Balance load.
  3. Closest: Round Trip Time (RTT).
  4. Performance: Throughput, Latency, Reliability.

CDNs Example: Replace www. with ak for Akamai.

Remote Procedure Calls (RPC)

RPC: Easy-to-program network communication that makes client-server communication transparent. Makes communication appear like a local procedure call.

Interface Description Language (IDL): Defines the API for procedure calls: names, parameters, return types, marshaling, unmarshaling.

Server Stub: Two parts:

  1. Dispatcher: Receives a client’s RPC request.
  2. Skeleton: Unmarshals the request, calls the local procedure, marshals the response.

Protocol Buffers (Protobuf): An IDL.

  1. Data Structures: Called messages (Protocol Buffers).
  2. Services/Procedures: gRPC.

Labels the fields so that the signature of the message can be changed later and still be compatible with legacy code.

File Calculation

Number of files calculated as follows:

  1. Number of hashes depends on file size / block size + 64 bytes for filename and version.
  2. Then compute.

Peer-to-Peer (P2P) Networks

Characteristics:

  • No centralized control.
  • Nodes are roughly symmetric.

Flooded Queries: O(N) lookup.

Distributed Hash Tables (DHT)

Key: Hash(data). Lookup(key): Returns IP address.

Modulo Hashing: Leads to heavy transfers. Consistent Hashing: Desired features:

  1. Balance: No bucket has too many objects.
  2. Smoothness: Addition/removal of a token minimizes object movements for other buckets.

Each physical node maintains v > 1 tokens where each token corresponds to a virtual node. Each virtual node owns an expected (1/(vn)th) of ID space. Redistribution: (v+1)/(v x 1/nth) => better load balancing with larger V.

Chord Protocol

Key Space: [0, 2^m -1]. Each node keeps m states and ranges (N+2^(k-1)) mod 2^m, 1.

Finger Table Implications:

  • Binary lookup tree rooted at every node – threaded through other node’s finger tables.
  • Better than arranging nodes in a single tree.
  • Each node acts as a root, so there’s no root hotspot, no single point of failure, and a lot more state in total.

Chord Algorithm Properties:

  1. Interface
  2. Efficient Lookup: O(log N)
  3. Scalable State: O(log N) state per node
  4. Robust: Survives massive failures.

Availability Metrics

MTBF: Mean Time Between Failures. MTTR: Mean Time To Repair. Availability: (MTBF – MTTR) / MTBF.

Yield: Queries completed / Queries offered. Harvest: Data available / Complete data. Data per query x queries per second -> constant.

Canary Request: Send a request to one or two leaf servers for testing. The remaining servers are queried if the root gets a successful response from the canary in a reasonable period of time.

Two-Phase Commit (2PC)

2PC: All commit or none does.

Safety:

  • If one commits, no one aborts.
  • If one aborts, no one commits.

Liveness:

  • If no failures and A and B can commit, the action commits.
  • If failures, reach a conclusion ASAP.

If everyone rebooted and is reachable, the Transaction Coordinator (TC) can just check for the commit record on disk and resend the action. If no commit record -> abort. If no yes record on disk -> abort.

If the yes record is on disk, execute the termination protocol.

  • This recovery protocol with non-volatile logging is called Two-Phase Commit (2PC).
  • Safety: All hosts that decide reach the same decision.
  • No commit unless everyone says”ye”.
  • Liveness: If no failures and all say”ye” then commit.
  • But if failures then 2PC might block.
  • TC must be up to decide.
  • Doesn’t tolerate faults well: must wait for repair.

In general, entities only need to record promises they make; anything that is NOT a promise need not be recorded. Prepare inquiries from the TC are not promises nor are NO responses. A participant can switch from NO->Yes but never from YES.

Quorum-Based Replication

Read Quorum: Contact Nr servers and select the highest version as the correct version.

Write Quorum: Contact Nw servers and increment the highest version -> write out the new version to the servers in the write quorum.

Constraints:

  1. Nr + Nw > N
  2. Nw > N/2

Raft Consensus Algorithm

AppendEntries():

  • The leader uses this to”pus” new operations to the replicated state machines.
  • Also used by the leader to tell the other nodes it is the leader.

RequestVote():

  • Used when the system starts up to select a leader.
  • Used when the leader fails to elect a new leader.
  • Used when the leader is unreachable due to a network partition to elect a new leader.

If electionTimeout elapses with no RPCs (100-500ms), the follower assumes the leader has crashed and starts a new election.

Election Properties

Safety: Allow at most one winner per term.

  • Each server votes only once per term (persists on disk).
  • Two different candidates can’t get majorities in the same term.

Liveness: Some candidate must eventually win.

  • Each chooses election timeouts randomly in [T, 2T].
  • One usually initiates and wins the election before others start.
  • Works well if T >> network RTT.

DynamoDB

Gossip: Once per second, each node contacts a randomly chosen other node. They exchange their lists of known nodes (including virtual node IDs).

Eventual Consistency

  • Dynamo emphasizes availability over consistency when there are partitions.
  • Tell the client the write is complete when only some replicas have stored it.
  • Propagate to other replicas in the background.
  • Allows writes in both partitions…but risks:
    • Returning stale data.
    • Write conflicts when the partition heals.

Sloppy Quorums

  • If no failure, reap consistency benefits of a single master.
  • Else sacrifice consistency to allow progress.
  • Dynamo tries to store all values put() under a key on the first N live nodes of the coordinator’s preference list.
  • BUT to speed up get() and put():
    • Coordinator returns”succes” for put when W nodes have acknowledged the write.
    • Coordinator returns”succes” for get when R nodes have replied.

Suppose C fails. Node E is in the preference list. It needs to receive a replica of the data. Hinted Handoff: The replica at E points to node C. When C comes back, E forwards the replicated data back to C.

Dynamo: Take-Away Ideas

  • Consistent hashing is broadly useful for replication.
  • Extreme emphasis on availability and low latency, unusually, at the cost of some inconsistency.
  • Eventual consistency lets writes and reads return quickly, even when partitions and failures occur.
  • Context allows some conflicts to be resolved automatically; others are left to the application.

dhnsagAAAAAASUVORK5CYII=

Ths1jXZYXVYAAAAASUVORK5CYII=