Hadoop HDFS and MapReduce Core Concepts Explained

HDFS NameNode vs. DataNode Differences

  • NameNode is the master node in HDFS, while DataNode is the slave node.
  • NameNode stores metadata like file names, directory structure, permissions, etc.
  • DataNode stores the actual data blocks of the files.
  • NameNode manages the namespace and regulates access to files.
  • DataNode performs read and write operations on HDFS blocks.
  • There is only one active NameNode (with a standby for high availability).
  • There are usually many DataNodes in a Hadoop cluster.
  • If NameNode fails, the whole file system becomes inaccessible.
  • If a DataNode fails, data can still be accessed from other replicas.
  • NameNode keeps track of DataNodes and the blocks they store.
  • DataNode regularly sends a heartbeat signal to NameNode.
  • NameNode is critical for the operation of HDFS.
  • DataNodes are responsible for block creation, deletion, and replication based on NameNode instructions.

NameNode never handles actual data, only metadata. Heartbeats and block reports maintain cluster health. This architecture ensures fault tolerance and high throughput.

HDFS File Replication Factor Explained

The replication factor defines how many copies of a data block are stored in the cluster.

  • The default replication factor in Hadoop is 3.
  • This means each block of data will be copied to three different DataNodes.
  • Replication ensures data availability even if one or two nodes fail.
  • It provides fault tolerance and high reliability.
  • The replication factor can be changed as per requirements.
  • Higher replication increases storage usage but improves availability.
  • Lower replication saves space but increases the risk of data loss.
  • Hadoop balances replicas across DataNodes for load distribution.
  • Replication is handled automatically by the NameNode.
  • If a replica is lost, NameNode creates a new one.
  • Commands like hdfs dfs -setrep can change the replication factor.
  • The replication factor is set on a per-file basis in HDFS.

HDFS Command Line Interface (CLI)

HDFS can be accessed through a command-line interface (CLI). The CLI allows users to interact with HDFS like a traditional file system.

  • Common commands begin with hdfs dfs or hadoop fs.
  • Example: hdfs dfs -ls / to list contents in the root directory.
  • hdfs dfs -mkdir /user/data creates a new directory.
  • hdfs dfs -put localfile.txt /user/data uploads a file to HDFS.
  • hdfs dfs -get /user/data/file.txt downloads a file from HDFS.
  • hdfs dfs -rm /user/data/file.txt deletes a file.
  • The CLI supports file read, write, copy, move, and delete operations.
  • It can also change file permissions and ownership.
  • The CLI is powerful for automation using scripts.
  • It does not require a graphical interface.
  • Ideal for administrators and developers working on data processing tasks.

MapReduce: A Brief Explanation

MapReduce is a programming model in Hadoop used to process large datasets in parallel. It has two main functions: Map and Reduce.

  • Map phase: Processes input data and outputs key-value pairs.
  • Reduce phase: Groups and processes those pairs to get the final output.
  • Example: In word count, Map emits (word, 1); Reduce sums them.
  • Map tasks run in parallel on different nodes.
  • Data is shuffled and sorted between Map and Reduce phases.
  • Intermediate data is written to local disk.
  • Reduce tasks aggregate and write final output to HDFS.
  • MapReduce is fault-tolerant and scalable.
  • It can run on clusters of commodity hardware.
  • Ideal for tasks like log analysis, search indexing, and analytics.

Hadoop Data Serialization and File Structures

Serialization converts structured data into a byte stream, essential for data storage and transfer in Hadoop. It enables communication between different components.

Common formats include:

  • Avro: Row-based and supports schema evolution.
  • Parquet: Columnar, suitable for read-heavy jobs.
  • ORC: Optimized for Hive with high compression.
  • SequenceFile: Stores binary key-value pairs.

Serialized data is compact and efficient, helping in interoperability between languages. File-based structures improve read/write speed, reduce storage and processing time, and enable big data systems to scale effectively.

Hadoop Distributed File System Data Flow

Data flow in HDFS includes both write and read processes.

HDFS Write Process

  • When a client writes a file, it contacts the NameNode.
  • NameNode divides the file into blocks (usually 128MB).
  • It selects three DataNodes for each block (based on replication).
  • Data is written to the first DataNode, which forwards it to the next two. This is called pipeline replication.
  • Once all replicas are written, the client receives acknowledgment.

HDFS Read Process

  • For reading, the client requests block locations from NameNode.
  • It chooses the nearest DataNode to fetch the data.
  • Data is transferred directly from the DataNode to the client.

Hadoop Ecosystem Tools: Flume, Sqoop, and Archives

Apache Flume

  • Apache Flume is a tool for collecting and transferring log data.
  • It moves large volumes of streaming data to HDFS or HBase.
  • Commonly used for ingesting data from servers or social media.
  • Uses a simple architecture: Source → Channel → Sink.
  • Reliable and fault-tolerant design.
  • Suitable for real-time log collection.

Apache Sqoop

  • Apache Sqoop transfers data between Hadoop and Relational Database Management Systems (RDBMS).
  • Used to import data from MySQL, Oracle to HDFS.
  • Also exports data from HDFS to databases.
  • Supports incremental load and compression.
  • Easy integration with Hive and HBase.

Hadoop Archives (HAR)

  • HAR files are used to combine many small files into a single archive.
  • Helps reduce NameNode memory usage and manage small files better.

Hadoop Input/Output Compression Methods

Hadoop supports compression to reduce file size, which improves performance by reducing disk I/O. Compression can be applied during input, output, or intermediate steps.

Supported codecs include:

  • Gzip: Good compression but slower.
  • Bzip2: Compresses better but is the slowest.
  • Snappy: Fast but gives moderate compression.
  • LZO: (Implicitly mentioned as a common codec, though not detailed in the original text, I’ll keep it as it’s standard)

Key points about compression:

  • Compressed files take less space in HDFS.
  • Splittable formats (like Bzip2) allow parallel processing.
  • MapReduce jobs need the correct codec to read compressed files.
  • Hadoop uses InputFormat classes to support compression.
  • OutputFormat can also write compressed output files.
  • Compression is crucial for handling big data efficiently.

Avro File-Based Data Structure in Hadoop

Avro is a row-oriented data serialization framework developed under Apache Hadoop for data exchange.

  • It stores data along with its schema in the same file.
  • The schema is written in JSON format.
  • Data is stored in a compact binary format.
  • Supports schema evolution (new versions of schemas).
  • Useful in data serialization and Remote Procedure Calls (RPC).
  • Compatible with many programming languages.
  • Ideal for long-term data storage.
  • Works well with Hive, Pig, and MapReduce.
  • Can be compressed using supported codecs.
  • It is a self-describing file format.
  • Commonly used in data pipelines and batch processing.