Big Data Fundamentals: Concepts, Technologies, & Use Cases
What is Big Data?
Data that is very large in size is called Big Data. Normally, we work with data sizes in megabytes (MB) for documents like Word or Excel, or a maximum of gigabytes (GB) for files like movies or code. However, data in petabytes (PB), which is 1015 bytes, is considered Big Data. It is estimated that almost 90% of today’s data has been generated in the past three years.
Sources of Big Data
These data come from many sources, including:
- Social Networking Sites: Facebook, Google, and LinkedIn generate a huge amount of data daily due to their billions of users worldwide.
- E-commerce Sites: Platforms like Amazon, Flipkart, and Alibaba generate vast amounts of logs, which can be analyzed to trace user buying trends.
- Weather Stations: All weather stations and satellites provide massive datasets that are stored and manipulated to forecast weather.
- Telecom Companies: Telecom giants such as Airtel and Vodafone analyze user trends to publish their plans, storing data from millions of users.
- Stock Markets: Stock exchanges worldwide generate a huge amount of data through daily transactions.
The 3 Vs of Big Data
- Velocity: Data is increasing at a very fast rate. It is estimated that the volume of data will double every two years.
- Variety: Nowadays, data is not only stored in rows and columns. Data can be structured or unstructured. Log files and CCTV footage are examples of unstructured data. Data that can be saved in tables, such as bank transaction data, is structured data.
- Volume: The amount of data we deal with is very large, often in petabytes.
Big Data Use Case: E-commerce
An e-commerce site (with 100 million users) wants to offer a $100 gift voucher to its top 10 customers who spent the most in the previous year. Additionally, they want to identify the buying trends of these customers so the company can suggest more relevant items.
Challenges in Big Data Management
Managing a huge amount of unstructured data requires robust solutions for storage, processing, and analysis.
Big Data Solutions with Hadoop
- Storage: For this huge amount of data, Hadoop uses HDFS (Hadoop Distributed File System). HDFS utilizes commodity hardware to form clusters and store data in a distributed fashion. It operates on the ‘write once, read many times’ principle.
- Processing: The MapReduce paradigm is applied to data distributed over a network to find the required output.
- Analysis: Tools like Pig and Hive can be used to analyze the data.
- Cost-Effectiveness: Hadoop is open source, making cost less of an issue.
Types of Big Data Analytics
Using customer data as an example, the different branches of analytics that can be done with sets of big data include the following:
- Comparative Analysis: This examines customer behavior metrics and real-time customer engagement to compare a company’s products, services, and branding with those of its competitors.
- Social Media Listening: This analyzes what people are saying on social media about a business or product, helping to identify potential problems and target audiences for marketing campaigns.
- Marketing Analytics: This provides information that can be used to improve marketing campaigns and promotional offers for products, services, and business initiatives.
- Sentiment Analysis: All data gathered on customers can be analyzed to reveal their feelings about a company or brand, customer satisfaction levels, potential issues, and how customer service could be improved.
Big Data Management Technologies
Hadoop, an open-source distributed processing framework released in 2006, was initially central to most big data architectures. However, the development of Spark and other processing engines has shifted MapReduce, the engine built into Hadoop, more to the side. The result is a rich ecosystem of big data technologies that can be used for various applications, often deployed together.
Cloud Big Data Platforms & Managed Services
Big data platforms and managed services offered by IT vendors combine many of those technologies in a single package, primarily for use in the cloud. Currently, that includes these offerings, listed alphabetically:
- Amazon EMR (formerly Elastic MapReduce)
- Cloudera Data Platform
- Google Cloud Dataproc
- HPE Ezmeral Data Fabric (formerly MapR Data Platform)
- Microsoft Azure HDInsight
On-Premises & Cloud Deployment Tools
For organizations that want to deploy big data systems themselves, either on-premises or in the cloud, the technologies available to them in addition to Hadoop and Spark include the following categories of tools:
- Storage Repositories: Such as the Hadoop Distributed File System (HDFS) and cloud object storage services including Amazon Simple Storage Service (S3), Google Cloud Storage, and Azure Blob Storage.
- Cluster Management Frameworks: Like Kubernetes, Mesos, and YARN (Hadoop’s built-in resource manager and job scheduler, standing for Yet Another Resource Negotiator).
- Stream Processing Engines: Such as Flink, Hudi, Kafka, Samza, Storm, and the Spark Streaming and Structured Streaming modules built into Spark.
- NoSQL Databases: Including Cassandra, Couchbase, CouchDB, HBase, MarkLogic Data Hub, MongoDB, Neo4j, Redis, and various other technologies.
- Data Lake and Data Warehouse Platforms: Among them Amazon Redshift, Delta Lake, Google BigQuery, Kylin, and Snowflake.
- SQL Query Engines: Like Drill, Hive, Impala, Presto, and Trino.
Key Challenges in Big Data
In connection with processing capacity issues, designing a big data architecture is a common challenge. Big data systems must be tailored to an organization’s particular needs, a do-it-yourself (DIY) undertaking that requires IT and data management teams to piece together a customized set of technologies and tools. Deploying and managing big data systems also demand new skills compared to those typically possessed by database administrators and developers focused on relational software.
Both of these issues can be eased by using a managed cloud service. However, IT managers need to closely monitor cloud usage to ensure costs do not get out of hand. Additionally, migrating on-premises datasets and processing workloads to the cloud is often a complex process.
Other challenges in managing big data systems include making the data accessible to data scientists and analysts, especially in distributed environments that include a mix of different platforms and data stores. To help analysts find relevant data, data management and analytics teams are increasingly building data catalogs that incorporate metadata management and data lineage functions. The process of integrating big data sets is often complicated, particularly when data variety and velocity are significant factors.
Effective Big Data Strategy Keys
Developing an effective big data strategy within an organization requires a clear understanding of business goals and the data currently available. It also involves assessing the need for additional data to help meet objectives. Key steps include:
- Prioritizing planned use cases and applications.
- Identifying new systems and tools required.
- Creating a deployment roadmap.
- Evaluating internal skills to determine if retraining or hiring is necessary.
To ensure big data sets are clean, consistent, and used properly, a data governance program and associated data quality management processes must also be priorities. Other best practices for managing and analyzing big data include focusing on business needs for information over available technologies and using data visualization to aid in data discovery and analysis.
Big Data Collection & Regulations
As the collection and use of big data have increased, so has the potential for data misuse. Public outcry about data breaches and other personal privacy violations led the European Union to approve the General Data Protection Regulation (GDPR), a data privacy law that took effect in May 2018. GDPR limits the types of data organizations can collect and requires opt-in consent from individuals or compliance with other specified reasons for collecting personal data. It also includes a ‘right-to-be-forgotten’ provision, which allows EU residents to ask companies to delete their data.
While there are no similar federal laws in the U.S., the California Consumer Privacy Act (CCPA) aims to give California residents more control over the collection and use of their personal information by companies doing business in the state. CCPA was signed into law in 2018 and took effect on January 1, 2020.
To ensure compliance with such laws, businesses need to carefully manage big data collection. Controls must be implemented to identify regulated data and prevent unauthorized employee access.
Human Aspect of Big Data Analytics
Ultimately, the business value and benefits of big data initiatives depend on the workers tasked with managing and analyzing the data. Some big data tools enable less technical users to run predictive analytics applications or help businesses deploy suitable infrastructure for big data projects, minimizing the need for extensive hardware and distributed software knowledge.
Big data can be contrasted with small data, a term sometimes used to describe datasets that can be easily used for self-service Business Intelligence (BI) and analytics. A commonly quoted axiom states: “Big data is for machines; small data is for people.”
Big Data Storage & Analysis
While the storage capacities of hard drives have increased massively over the years, access speeds—the rate at which data can be read from drives—have not kept pace. For instance, a typical drive from 1990 could store 1,370 MB of data with a transfer speed of 4.4 MB/s, allowing all data to be read in about five minutes. Over 20 years later, one-terabyte drives are common, but transfer speeds are around 100 MB/s, meaning it takes over two and a half hours to read all data from a single disk. This is a significant time for a single drive, and writing is even slower. The obvious solution to reduce this time is to read from multiple disks simultaneously. Imagine having 100 drives, each holding one-hundredth of the data. Working in parallel, we could read the data in under two minutes.
Using only one-hundredth of a disk might seem wasteful. However, we can store one hundred datasets, each one terabyte, and provide shared access. Users of such a system would likely appreciate shared access in return for shorter analysis times, and statistically, their analysis jobs would be spread over time, minimizing interference.
However, there’s more to parallel data reading and writing across multiple disks. The first challenge is hardware failure: when using many hardware components, the chance of one failing is relatively high. A common method to prevent data loss is replication, where redundant copies of data are maintained by the system. This ensures that in the event of a failure, another copy is available. This principle is similar to how RAID works, and it’s also fundamental to Hadoop’s filesystem, the Hadoop Distributed File System (HDFS).
The second challenge is that most analysis tasks require combining data; data read from one disk may need to be combined with data from any of the other 99 disks. While various distributed systems allow data to be combined from multiple sources, doing so correctly is notoriously challenging. MapReduce provides a programming model that abstracts this problem from low-level disk reads and writes.
Big Data: Comparison with Other Systems
Big Data vs. RDBMS
In many ways, MapReduce can be seen as a complement to a Relational Database Management System (RDBMS). MapReduce is well-suited for problems requiring batch analysis of entire datasets, especially for ad-hoc analysis. An RDBMS, conversely, excels at point queries or updates where the dataset is indexed for low-latency retrieval and updates of relatively small amounts of data. MapReduce is ideal for applications where data is written once and read many times, whereas a relational database is good for datasets that are continually updated.
Another key difference between MapReduce and an RDBMS lies in the data structure they operate on. Structured data is organized into entities with a defined format, such as XML documents or database tables conforming to a predefined schema. This is the primary domain of an RDBMS. Semi-structured data, on the other hand, is looser; while a schema may exist, it’s often flexible or ignored, serving only as a guide to the data’s structure (e.g., a spreadsheet where the grid is the structure, but cells can hold various data types). Unstructured data lacks any particular internal structure, such as plain text or image data. MapReduce works exceptionally well with unstructured or semi-structured data, as it’s designed to interpret the data at processing time. The input keys and values for MapReduce are not an intrinsic property of the data but are chosen by the person analyzing it.
Big Data vs. Grid Computing
High-Performance Computing (HPC) and framework processing networks have handled enormous-scale information for decades, utilizing Application Program Interfaces (APIs) like the Message Passing Interface (MPI). Broadly, the HPC approach distributes work across a cluster of machines that access a shared filesystem, often hosted by a Storage Area Network (SAN). This works well for compute-intensive jobs. However, it becomes problematic when nodes need to access larger data volumes (hundreds of gigabytes, which is when Hadoop truly shines) because network bandwidth becomes the bottleneck, and processing nodes become idle.
Hadoop attempts to co-locate the data with the processing nodes, ensuring fast data access because it is local. This feature, known as data locality, is at the core of data processing in Hadoop and is the reason for its excellent performance. Recognizing that network bandwidth is the most valuable resource in a data center environment (it’s easy to saturate network links by copying data around), Hadoop strives to mitigate this by explicitly modeling network topology. This arrangement does not preclude high-CPU analyses in Hadoop. MPI provides significant control to programmers but requires them to explicitly handle the mechanics of the data stream, exposed via low-level C routines and constructs like sockets, as well as higher-level algorithms for analyses. Processing in Hadoop works at a higher level: the developer thinks in terms of the data model (such as key-value pairs for MapReduce), while the data stream remains implicit.
A Brief History of Hadoop
Hadoop is an open-source software programming framework designed for storing large amounts of data and performing computations. Its framework is primarily based on Java programming, supplemented with some native C code and shell scripts.
Hadoop’s Origins
The Apache Software Foundation developed Hadoop, with co-founders Doug Cutting and Mike Cafarella.
Co-founder Doug Cutting named it after his son’s toy elephant. In October 2003, the first paper on Google File System was released. In January 2006, MapReduce development began on Apache Nutch, which included approximately 6,000 lines of code for MapReduce and 5,000 lines for HDFS. Hadoop 0.1.0 was released in April 2006.
Hadoop features a distributed file system known as HDFS. HDFS splits files into blocks and distributes them across various nodes within large clusters. In the event of a node failure, the system continues to operate, and data transfer occurs between nodes, facilitated by HDFS’s fault-tolerance.
Advantages of HDFS
- Inexpensive
- Immutable in nature
- Stores data reliably
- Ability to tolerate faults
- Scalable
- Block-structured
- Can process large amounts of data simultaneously, and more.
Disadvantages of HDFS
- Not suitable for small quantities of data.
- Potential stability issues.
- Restrictive and somewhat rigid in nature.
Hadoop also supports a wide range of software packages, including Apache Flume, Apache Oozie, Apache HBase, Apache Sqoop, Apache Spark, Apache Storm, Apache Pig, Apache Hive, Apache Phoenix, and Cloudera Impala.
Common Hadoop Ecosystem Frameworks
- Hive: Uses HiveQL for data structuring and for writing complex MapReduce jobs in HDFS.
- Drill: Consists of user-defined functions and is used for data exploration.
- Storm: Allows real-time processing and streaming of data.
- Spark: Contains a Machine Learning Library (MLlib) for enhanced machine learning and is widely used for data processing. It also supports Java, Python, and Scala.
- Pig: Features Pig Latin, a SQL-like language, and performs data transformation of unstructured data.
- Tez: Reduces the complexities of Hive and Pig, helping their codes run faster.
Core Hadoop Modules
The Hadoop framework is made up of the following modules:
- Hadoop MapReduce: A programming model for handling and processing large datasets.
- Hadoop Distributed File System (HDFS): Distributes files in clusters among nodes.
- Hadoop YARN: A platform that manages computing resources.
- Hadoop Common: Contains packages and libraries used by other modules.
Hadoop: Advantages & Disadvantages
Advantages of Hadoop
- Ability to store a large amount of data.
- High flexibility.
- Cost-effective.
- High computational power.
- Tasks are independent.
- Linear scaling.
Disadvantages of Hadoop
- Not very effective for small data.
- Hard cluster management.
- Has stability issues.
- Security concerns.
Apache Hadoop & Its Ecosystem
Although Hadoop is best known for MapReduce and its distributed filesystem (HDFS, renamed from NDFS), the term also refers to a family of related projects that fall under the umbrella of infrastructure for distributed computing and large-scale data processing.
Most core projects are hosted by the Apache Software Foundation, which supports a community of open-source software projects, including the original HTTP Server from which it gets its name. As the Hadoop ecosystem grows, more projects emerge, not necessarily hosted at Apache, providing complementary services to Hadoop or building on its core to add higher-level abstractions.
Key Hadoop Ecosystem Projects
The Hadoop projects described briefly here include:
- Common: A set of components and interfaces for distributed filesystems and general I/O (serialization, Java RPC, persistent data structures).
- Avro: A serialization system for efficient, cross-language RPC, and persistent data storage.
- MapReduce: A distributed data processing model and execution environment that runs on large clusters of commodity machines.
- HDFS: A distributed filesystem that runs on large clusters of commodity machines.
- Pig: A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters.
- Hive: A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (translated by the runtime engine to MapReduce jobs) for querying data.
- HBase: A distributed, column-oriented database. HBase uses HDFS for its underlying storage and supports both batch-style computations using MapReduce and point queries (random reads).
- ZooKeeper: A distributed, highly available coordination service. ZooKeeper provides primitives like distributed locks for building distributed applications.
- Sqoop: A tool for efficiently moving data between relational databases and HDFS.