Cloud and Distributed Systems: Fundamental Concepts and Platform Insights

Essential Cloud Platform Capabilities

  1. On-Demand Self-Service: Users can access computing resources (like servers, storage) whenever they need, without human help.
  2. Broad Network Access: Services are available over the internet and can be used from laptops, phones, or tablets.
  3. Resource Pooling: Cloud providers share resources (like storage, memory) among many users using virtualization.
  4. Rapid Elasticity: Resources can be increased or decreased quickly based on need (auto-scaling).
  5. Measured Service (Pay-as-You-Go): Users only pay for what they use (like mobile recharge) — helps save money.
  6. High Availability: Cloud platforms ensure services run 24/7 without downtime using backup and load balancing.
  7. Security and Compliance: Provides user authentication, data encryption, firewalls, and follows government policies.
  8. Automation: Many tasks like backups, updates, and scaling can be done automatically without manual work.
  9. Multi-Tenancy: Multiple users can use the same cloud system securely and privately.
  10. APIs and Developer Tools: Easy-to-use tools for developers to build, test, and deploy applications on the cloud.

Understanding MapReduce Processing

MapReduce Workflow Steps

  1. Input Splitting: The input data is split into small parts (blocks) for parallel processing.
  2. Map Function: Each part is processed in parallel by the Map function, converting data into key-value pairs. (Example: “apple” becomes <apple, 1>)
  3. Shuffling: The system groups all values with the same key together. (All <apple, 1> pairs are brought together.)
  4. Sorting: Keys are sorted before passing them to the reducer.
  5. Reduce Function: All grouped key-value pairs are processed by the Reduce function, performing operations like counting, summing, or averaging. (Example: <apple, [1,1,1]> becomes <apple, 3>)
  6. Output Generation: The final output is stored in the file system in key-value format.

Simple MapReduce Example

Input: A list of words → [“apple”, “apple”, “banana”]

Map Output: <apple, 1>, <apple, 1>, <banana, 1>

Shuffle + Reduce Output: <apple, 2>, <banana, 1>

Hadoop MapReduce Data Flow Explained

Hadoop executes MapReduce jobs via a well-structured data flow across various task trackers:

Phase 1: Setup and Input

  1. Data Partitioning: Input files stored in HDFS are split into M pieces. Each split is assigned to a Map Task.
  2. Computation Partitioning: Users write Map() and Reduce() functions. The Hadoop system forks user programs and distributes them to workers.
  3. Master and Workers Setup: One instance becomes the Master (JobTracker). Others become Workers (TaskTrackers or NodeManagers). The Master assigns Map/Reduce Tasks to workers.

Phase 2: Map Side Processing

  1. Input Reading: Each Map worker reads its assigned split and passes it to the Map() function.
  2. Map Function Execution: Produces intermediate (key, value) pairs.
  3. Combiner (Optional): Combines values locally (e.g., local sum) to reduce network data transfer.
  4. Partitioning Function: Intermediate data is split into R partitions (one per Reduce task) using: Hash(key) mod R.

Phase 3: Shuffle and Reduce

  1. Synchronization: Reduce workers wait for all Map tasks to complete.
  2. Communication: Reduce workers fetch partitions from all Map workers using RPC.
  3. Sorting & Grouping: Keys are sorted and grouped (all values with the same key together).
  4. Reduce Function Execution: Final results are written to HDFS output files.

Google File System (GFS) Data Mutation Sequence

  1. Client Requests Chunk Info: The client contacts the Master to ask which chunk server has the lease for the chunk and where other replicas are located.
  2. Master Responds: The Master replies with the identity of the Primary replica and the locations of Secondary replicas. The client caches this information for future requests.
  3. Client Pushes Data: The client sends the data to all replicas (primary + secondary). Data is stored in a buffer cache at each chunk server. This step is decoupled from control flow for better performance.
  4. Client Sends Write Request to Primary: After all servers receive the data, the client informs the Primary to begin the mutation. The Primary assigns serial numbers to maintain write order.
  5. Primary Forwards to Secondaries: The Primary sends the write request to all secondary replicas, enforcing the same serial order.
  6. Secondaries Acknowledge: Secondary replicas confirm that the mutation is applied successfully.
  7. Primary Responds to Client: After all secondaries reply, the Primary responds to the client. If any error occurs, the client marks the write as failed and retries the mutation (steps 3–7 or restarts from step 1 if needed).

Cloud Computing vs. Grid Computing

Cloud Computing

  • Delivers computing services (servers, storage, databases, networking, software, etc.) over the internet (the cloud).
  • Supports service models like IaaS (Infrastructure as a Service), PaaS (Platform as a Service), and SaaS (Software as a Service).
  • Highly scalable, on-demand, and cost-efficient.
  • Centralized resource management with virtualization support.
  • Widely used in business applications, social platforms, and big data analytics.

Grid Computing

  • Connects geographically dispersed computers to work on a common task by sharing processing power.
  • Typically used for scientific, technical, and academic computations requiring high-performance computing.
  • Resources are decentralized and may belong to multiple organizations.
  • Less focus on virtualization; more on raw computational capability.

Comparison: Cloud vs. Grid

  • Management:
    • Cloud Computing: Centralized
    • Grid Computing: Decentralized
  • Virtualization:
    • Cloud Computing: Fully supported
    • Grid Computing: Rare or limited
  • Scalability:
    • Cloud Computing: Elastic (automatic scaling)
    • Grid Computing: Static, manual provisioning
  • Use Case:
    • Cloud Computing: Web applications, enterprise software
    • Grid Computing: Scientific simulations, data analysis
  • Resource Ownership:
    • Cloud Computing: Single provider (e.g., AWS, Azure)
    • Grid Computing: Multiple entities

AWS and Azure Developer Ecosystems

Amazon Web Services (AWS)

  • Offers core services like:
    • EC2 for compute
    • S3 for object storage
    • RDS for databases
    • Lambda for serverless functions
  • Programming Tools:
    • AWS CLI (Command Line Interface)
    • SDKs (Python, Java, .NET, etc.)
    • CloudFormation (infrastructure as code)
    • Cloud9 IDE

Microsoft Azure

  • Core services include:
    • Azure VMs for compute
    • Blob Storage
    • SQL Database
    • Azure Functions (serverless)
  • Developer Tools:
    • Azure CLI & PowerShell
    • Visual Studio + Azure SDK
    • Azure Resource Manager templates

Comparison: AWS vs. Azure

  • Integration:
    • AWS: Open-source friendly
    • Azure: Seamless Microsoft integration
  • Tooling:
    • AWS: CLI, SDKs, CloudFormation
    • Azure: CLI, SDKs, Visual Studio, ARM
  • Cloud Strength:
    • AWS: Best for public cloud
    • Azure: Best for hybrid cloud
  • DevOps Tools:
    • AWS: CodePipeline, CodeBuild
    • Azure: Azure DevOps, GitHub Actions

Emerging Cloud Technologies and Applications

Emerging Cloud Software Environments

  • Kubernetes: Open-source system for automating deployment and scaling of containerized applications. Ideal for microservices architecture.
  • Serverless Computing: Execution model where cloud providers manage server infrastructure. Services like AWS Lambda, Azure Functions.
  • Edge Computing: Brings computation closer to data sources like IoT devices. Reduces latency, crucial for time-sensitive applications.
  • Multi-cloud/Hybrid Cloud: Combining services from multiple cloud providers or public/private environments. Ensures flexibility, redundancy, and compliance.

Real-World Applications

  • Healthcare: Remote monitoring using edge computing for fast data analysis.
  • Banking: Secure hybrid cloud setups to comply with regulations.
  • Retail & E-commerce: Serverless platforms auto-scale during sales or holiday seasons.
  • IoT & Smart Cities: Kubernetes with edge computing helps manage millions of devices with real-time insights.

Deploying Web Apps on Google App Engine (GAE)

Key Features of Google App Engine

  • Fully managed Platform-as-a-Service (PaaS).
  • Supports Python, Java, Go, and PHP.
  • Developers upload code via command-line (e.g., gcloud app deploy).
  • Auto-scaling: Automatically handles load without manual configuration.
  • Provides NoSQL (Datastore), Task Queues, Cron jobs, and Memcache.
  • Apps are sandboxed for security, with built-in monitoring and runtime support.
  • Ideal for web applications requiring rapid scaling and minimal infrastructure management.

Developing and Deploying on GAE

  1. Write Your App: Develop your application in supported languages like Python, Java, Go, or PHP. Use frameworks such as Flask (Python) or Spring Boot (Java).
  2. Prepare Configuration (app.yaml): Define the entry point, scaling mode, and instance type for your application.
    runtime: python39
    entrypoint: gunicorn -b :$PORT main:app
    automatic_scaling:
      min_instances: 1
      max_instances: 5
  3. Set up Google Cloud Project: Create a project on the GCP Console and enable the App Engine API.

Deploying and Auto-scaling in Action

  1. Install Google Cloud SDK: Authenticate using gcloud auth login and set your project with gcloud config set project [PROJECT_ID].
  2. Deploy Application: Use the command gcloud app deploy. Wait for the deployment URL to be generated.
  3. Auto-scaling: GAE allocates instances dynamically based on load, reducing idle costs and improving performance.
  4. Monitoring and Logs: Utilize GCP’s built-in monitoring and logging tools. Access your live URL: https://<project-id>.appspot.com.

Challenges in Parallel and Distributed Programs

Running a parallel program in a distributed system involves multiple challenges:

  1. Computation Partitioning: Splitting program logic into tasks that can be executed concurrently.
  2. Data Partitioning: Dividing data into chunks for parallel processing across nodes.
  3. Mapping: Assigning tasks/data chunks to suitable computing resources to ensure balanced workload.
  4. Synchronization: Ensuring coordination between tasks, especially where shared data exists.
  5. Communication: Managing intermediate data exchange using efficient communication protocols.
  6. Scheduling: Proper task sequencing when resources are limited to improve throughput and avoid delays.

OpenStack Nova Architecture Explained

OpenStack Nova is responsible for provisioning and managing compute instances:

  • Nova API: The interface for user requests (e.g., create VM).
  • Scheduler: Allocates the best available compute node for a new instance.
  • Nova Compute: Executes virtual machines (VMs) using hypervisors like KVM.
  • Conductor: Handles internal communication between services and the database.
  • RabbitMQ: A message queue used for inter-service messaging within OpenStack.
  • Database: Stores VM metadata, state, and configuration.

Nova supports scalable and fault-tolerant infrastructure for cloud environments.