Algorithmic Paradigms: Greedy, DP, and Backtracking
Greedy Approach
A Greedy Algorithm is a problem–
Solving strategy that makes the choice that looks best at the moment at each stage, hoping this local optimum will lead to a global optimum.
Key Characteristics
1.
Local Optimal Choice:
It focuses on making the locally best decision without considering the consequences for future steps.
2
No Reconsideration:
Once a choice is made, it is permanent and never revisited or revised (it is “short-sighted”).
3
Speed:
They are often simpler and faster to implement than Dynamic Programming.
4
Optimality:
A greedy approach does not always guarantee the globally optimal solution.
It only works for problems with the “greedy choice property” and “optimal substructure.”
Example A greedy strategy for the
5
Coin Change Problem (giving change using the fewest coins) might be to always pick the largest denomination coin possible. For standard currency, this often works, but for a custom coin system, it might fail to find the minimum number of coins.
Dynamic Programming (DP)
Dynamic Programming is an optimization method that solves complex problems by breaking them down into simpler, overlapping subproblems
. It solves each subproblem only once and stores the solution to avoid redundant calculationsKey Characteristics:
Optimal Substructure:
An optimal solution to the overall problem can be constructed from the optimal solutions of its subproblems.
Overlapping Subproblems:
The same subproblems are encountered multiple times.DP solves these once and stores the result (using memoization or tabulation
).
Bottom-Up or Top-Down:
It solves problems either by starting with the smallest subproblems and building up (bottom-up/tabulation) or by starting with the main problem and recursively solving subproblems while storing results (top-down/memoization).
Optimality:
If applicable, Dynamic Programming guarantees the globally optimal solution.Example The Fibonacci sequence calculation (where $F(n) = F(n-1) + F(n-2)$) is a classic example. A naive recursive solution recalculates $F(i)$ multiple times. DP solves each $F(i)$ once and reuses the result.
Backtracking Algorithm
How Backtracking Works
Imagine you are in a maze and are trying to find the exit. Trial and Decision: You start at the beginning (the root of the search tree) and choose one path (a decision). You mark your current position. Forward Step: You move forward (make a recursive call) to the next decision point. Constraint Check (Pruning): At the new position, you check if the path is still valid (does it satisfy the problem’s constraints?). If it’s valid: Continue moving forward (recursive call). If it’s a dead end or violates a constraint: This is a pruning step. You realize this path is useless. Backtrack: If you hit a dead end, you undo your last move and return to the previous decision point. This is the backtracking step. You must revert the state change made by the failed move so the next trial starts from a clean slate. New Trial: From the previous decision point, you try the next available path. Termination: The process repeats until you find a valid solution or all possible paths have been explored.
Key Components
State-Space Tree (Decision Tree)
The conceptual structure representing all possible sequences of decisions (paths). Backtracking performs a Depth-First Search (DFS)
on this tree. Constraints (Bounding Function): Rules that determine if a partial solution is valid and can be extended further. This is what allows pruning and makes backtracking more efficient than a brute-force search.
Recursion
Backtracking is almost always implemented recursively, as it naturally handles the move forward (recursive call) and the move back (returning from the recursive call and undoing the action).Job Scheduling Algorithms (CPU Scheduling)
, CPU utilization, waiting time, and turnaround time. Algorithms are categorized as either Preemptive or Non-Preemptive.
Non-Preemptive:
Once a process starts execution, it runs to completion or until it enters a waiting state (e.G., for I/O). It cannot be interrupted.Preemptive
Here are some of the most common CPU scheduling algorithms:1. First-Come, First-Served (FCFS)
Principle
The process that arrives first in the ready queue is executed first. It is the simplest scheduling form, analogous to a FIFO (First-In, First-Out) queue. Type: Non-Preemptive. Drawback: Suffers from the Convoy Effect, where a long job arriving first makes all subsequent short jobs wait for a long time, leading to high average waiting time.
2. Shortest Job First (SJF)
Principle
The CPU is assigned to the process that has the smallest estimated execution time (burst time).
Type
Non-Preemptive SJF:
Once the job starts, it runs to completion.
Preemptive SJF (Shortest Remaining Time First or SRTF):
Advantage: Provides the minimum average waiting time for a given set of processes. Drawback: It requires knowing the next CPU burst time in advance, which is practically impossible. It can also cause starvation for very long jobs if short jobs keep arriving.3. Priority Scheduling
Principle
Each process is assigned a priority level, and the CPU is allocated to the process with the highest priority.
Type
Drawback:
Starvation is a major issue, where low-priority processes may wait indefinitely if a continuous stream of high-priority processes keeps arriving. This is typically solved using Aging (gradually increasing the priority of processes that have waited a long time).
4. Round Robin (RR)
Principle
Designed for time-sharing systems, it is similar to FCFS but with preemption. Each process is assigned a small unit of CPU time called a time quantum (or time slice).
Type
Mechanism: When a process executes for one time quantum, it is preempted and put back at the end of the ready queue. The scheduler then moves to the next process in the queue.Advantage
Drawback
Performance heavily depends on the size of the time quantum. A very small quantum leads to excessive context switching overhead.
Multilevel Queue Scheduling:
Processes are permanently divided into groups (e.G., system processes, interactive processes, batch processes), each with its own ready queue and scheduling algorithm (e.G., Round Robin for interactive, FCFS for batch).
Multilevel Feedback Queue (MLFQ):
The Sum of Subset Problem (SSP)
is a classic decision problem in computer science and mathematics.
Problem
Definition
Given a set of non-negative integers S and a target sum T, the goal of the decision version of the problem is to determine whether there exists any subset of S whose elements sum up to exactly T
.
Input : 1.A set of numbers:
S = {w1, w2,…….. , wn} 2.A target sum:
T (or m)
Question
Does there exist a subset S’
S such that Example:
1. Set S: {2, 3, 7, 8, 10} 2.
Target
T:
11
3.
Answer
True, because the subset {3, 8} sums to 11.
Computational Complexity
The Sum of Subset Problem is a well-known NP-complete problem. This means that no polynomial-time algorithm is known for solving it, and it is generally considered difficult to solve efficiently for large inputs.
Algorithmic Approaches
There are two main strategies used to solve the Sum of Subset Problem:
1. Backtracking (Search/Enumeration
This approach systematically explores the entire solution space using a Depth-First Search (DFS) of a decision tree (or state-space tree). Mechanism: For each element in the set, the algorithm faces a choice: Include the element in the current subset. Exclude the element from the current subset. Pruning (Optimization): The backtracking approach uses bounding functions to stop exploring a path early if it is impossible to reach the target sum T
.
For example, a branch is pruned if : The current sum already exceeds the target T. The current sum, plus the sum of all remaining unexamined elements, is less than the target T.
Goal
The Backtracking approach can be adapted to find all possible subsets that sum to T
.
2.
Dynamic Programming (Optimization)
This approach solves the problem in pseudo-polynomial time (meaning its complexity is polynomial in the magnitude of the target sum T, not just the size of the input set n). It is a great example of solving a problem with overlapping subproblems.
Mechanism
It builds a 2D boolean table, typically $DP[i][j]$, where:$DP[i][j] = \text{True}$ if a sum of $j$ can be achieved using a subset of the first $i$ elements of the input set.$DP[i][j] = DP[i-1][j]$ (if the $i^{th}$ element is excluded) OR $DP[i-1][j – w_i]$ (if the $i^{th}$ element is included).
Complexity
The time complexity is $O(n \cdot T)$, where n is the number of elements and T is the target sum. This is considered efficient when T is small relative to 2^n.
Branch and Bound (B&B)
is a general algorithmic paradigm used for solving optimization problems, particularly in discrete and combinatorial optimization (finding the best solution from a large, finite set). It is an extension of the backtracking technique.
How Branch and Bound Works
The core idea is to search through the entire space of candidate solutions, represented as a state-space tree, while using bounds to eliminate (prune) large portions of the search space that are guaranteed not to contain the optimal solution.
The method consists of three main components:
Branching
The set of all feasible solutions is recursively divided (or partitioned) into smaller, non-overlapping subsets, creating the search tree. Each node in the tree represents a subproblem (a subset of the solution space). BoundingPruning (Bounding Function):The key step. The current best complete solution found so far (the “incumbent”) is tracked. If the optimistic bound of a subproblem node is worse than the value of the current incumbent solution, that entire subproblem and all its branches are safely discarded (pruned) because they cannot possibly lead to a better solution than the one already found.
Search Strategy:
FIFO (Breadth-First Search): Useful for finding the optimal solution quickly, but memory-intensive.
LIFO (Depth-First Search):
Least Cost Search (Best-First):
Explores the node with the most promising bound first, which often leads to the optimal solution faster.Applications
B&B is the most commonly used tool for solving NP-hard optimization problems, such as:Traveling Salesman Problem (TS0/1 Knapsack Problem:
Select items to maximize total value without exceeding a weight limit.
Integer Programming:
Drawbacks of the Branch and Bound Method
While the Branch and Bound (B&B) method is a powerful paradigm for solving complex optimization problems (especially NP-hard ones), it has several practical limitations and drawbacks:
1. Worst-Case Exponential Time Complexity
1.
The
Problem
In the worst-case scenario, the algorithm may still have to explore the entire state-space tree. The number of nodes in this tree can grow exponentially with the size of the problem (e.G., O(2^n)). 2.
Implication
For large or highly complex problem instances, the B&B method can be extremely time-consuming and computationally infeasible, despite its pruning techniques. It only avoids exponential complexity when the bounds are highly effective.
2. High Dependency on the Bounding Function
1
.The Problem:
The efficiency of B&B is critically dependent on the quality of the bounding function . A weak or loose bound will fail to effectively prune non-optimal branches, causing the algorithm to explore too much of the search space and degenerate into a slow exhaustive search.
2.Implication:
Developing a tight (close to the optimal value) and computationally efficient bounding function is often the hardest and most problem-specific part of implementing B&B. A complex but tight bound might take too long to calculate, negating the speed-up from pruning.
3. High Memory Consumption (Depending on Strategy)
1.
The Problem
If a Best-First Search strategy (exploring the most promising node based on the bound) is used, the algorithm must store all active (live) subproblem nodes in a priority queue. 2
.Implication
For problems with large branching factors or a vast search space, the required memory to store these nodes can exceed available resources, making the algorithm memory-intensive.
4. Complexity of Implementation 1
The Problem:
Implementing a B&B algorithm is complex due to the requirement to integrate three distinct components: a systematic branching mechanism, a tightly defined bounding function, and an efficient incumbent management (tracking the best solution found so far).
2.Implication:
It is significantly more difficult to implement and debug than simpler techniques like a greedy algorithm or a standard recursive backtrack search.
Uses of Control Abstraction in Algorithms:-
Control Abstraction is the process of defining a procedure whose flow of control is clear but whose primary operations (the “what”) are specified by other procedures or variables whose precise meanings (the “how”) are left undefined or abstracted away.In essence, it is the fundamental tool used to structure and simplify complex algorithms, allowing programmers to focus on the high-level logic rather than low-level execution details.The control abstraction algorithm (or framework) defines the template for a specific algorithmic approach. The primary uses are seen across major algorithmic paradigms:
1. Defining Algorithmic Paradigms (The Framework)
Control abstraction provides the reusable template for an entire class of algorithms.Algorithmic ParadigmControl Abstraction FrameworkKey Operations Left AbstractedDivide and ConquerDANDC(P)SMALL(P) (stopping condition), S(P) (solution for small problems), COMBINE() (merging sub-solutions).Greedy MethodGreedyMethod(a, n)SELECT() (choosing the next optimal element), FEASIBLE() (checking if the choice is valid), UNION() (adding the choice to the solution).BacktrackingREC_BACKTRACK(k)T() (generating next possible state/candidate), Bk() (the bounding/feasibility function to prune invalid paths).
2. Enhancing Code Quality and Reusability (Software Engineering)
In general programming and software development, control abstraction is used to manage complexity and improve code quality. 1.Simplification and Readability:
Complex control flows (like nested loops or conditional logic) are encapsulated into reusable functions or control structures (like for, while, if/else, try-catch). This simplifies the main program, making it cleaner and easier to understand. 2
.Modular Control:
By breaking down the flow into reusable functions, the control logic is modularized. This allows for easier debugging, testing, and maintenance, as changes can be isolated to specific functions. This aligns with the Don’t Repeat Yourself (DRY)
Principle
3
.Encapsulation of Control Logic
Control structures hide the intricate details of how an operation is repeated or conditionally executed, exposing only the higher-level purpose. For example, a sort() function abstracts away whether the underlying implementation uses Merge Sort or Quick Sort.
3. Implementing Advanced Control Structures
Control abstraction is crucial for building custom or specialized control constructs, especially in advanced programming contexts:
1.Handling Resources (e
g., C#’s using block, Java’s try-with-resource): These language features are, at their heart, control abstractions. They define a sequence of steps (acquire resource, execute block, guarantee resource release) that happens automatically, abstracting away the explicit try…Finally boilerplate code. 2
.Parallel and Asynchronous Programming
In parallel languages, control abstraction allows programmers to define new constructs for expressing parallelism (e.G., promises, futures, async/await). This separates the logical ordering of statements from the implementation of how that parallelism is exploited on different hardware architectures. 3.
Embedded Domain-Specific Languages (DSLs)
Control abstractions enable a base language to evolve with user-defined control constructs, allowing developers to create highly readable, domain-specific code (e.G., a finance DSL can define a CalculateAnnualInterest construct).
Amortized Analysis
Amortized Analysis is a technique used to analyze the performance of a sequence of operations on a data structure. It determines the worst-case average cost per operation over that sequence, even if some individual operations are very expensive.
The core idea is to “smooth out” the cost: the expensive operation is so infrequent that its high cost, when averaged over all preceding and subsequent cheap operations, results in a low, constant average time.
Key Points:
Worst-Case Guarantee:
Unlike average-case analysis (which uses probability assumptions), amortized analysis provides a guaranteed upper bound on the total cost of any sequence of operations.
Cost Spreading
It uses the fact that an expensive operation often changes the data structure’s state in a way that prevents another expensive operation from happening immediately, thus “paying” for itself.
Methods
The three main techniques are:
Aggregate Method
Compute the total cost T(n) for n operations, then the amortized cost per operation is T(n)/n.
Accounting Method:
Overcharge cheap operations and store the excess cost as “credit” to pay for later expensive operations.
Potential Method:
(The most formal approach) Assigns a “potential energy” to the data structure’s state. Amortized cost is c i +ΔΦ i (Actual cost + Change in Potential).
Amortized Cost of Stack Operations
We analyze the stack operations PUSH, POP, and MULTIPOP (which removes k items at once) on an initially empty stack using the Aggregate Method and the Potential Method.
1. Operation Definitions
OperationDescriptionActual Worst-Case Cost c
i
PUSH(S,x)Adds item x to the stack S.O(1) (Cost: 1)
POP(S)Removes the top item from the stack S.O(1) (Cost: 1)
MULTIPOP(S,k)Pops $\min(k,S
2. Analysis using the Aggregate Method
We consider an arbitrary sequence of n operations (PUSH, POP, MULTIPOP) starting with an empty stack.
Total Actual Cost T(n
: The total number of items popped in the entire sequence (from POP or MULTIPOP) can be at most the total number of items pushed. Since there are at most n PUSH operations in a sequence of n total operations, the maximum number of times the POP primitive operation can be executed across the entire sequence is O(n). Since the actual cost of a PUSH or POP is 1, the total cost for the entire sequence of n operations is:
Amortized Cost per Operation:
n
3. Analysis using the Potential Method
We define a potential function Φ that represents the “credit” stored in the stack.
Potential Function: Φ(S)=∣S∣, the number of elements currently in the stack.
Φ(S) is always ≥0.
Φ(D
0
)=0 (for an initially empty stack).
We calculate the amortized cost
Short Notes on Advanced Algorithmic Concepts Here is a brief overview of the concepts you listed, primarily focusing on complexity analysis and problem-solving paradigms for difficult problems.
1. Aggregate Analysis
Definition: A method of Amortized Analysis that computes the total actual cost, T(n), of a sequence of n operations. The amortized cost per operation is then calculated as the average:
Purpose
Example
Analyzing n operations on a dynamic array (like a Java ArrayList or C++ std::vector), where resizing is O(n), but the total cost of n insertions is only O(n), leading to an O(1) amortized cost per insertion.
2. Accounting Method (Banker’s Method)
Definition: A method of Amortized Analysis that assigns different costs to different operations, charging some operations more than their actual cost and storing the surplus as “credit” (or potential).
Mechanism:
Cheap Operations: Are overcharged, and the excess is deposited as credit.Expensive Operations: Are undercharged, and the deficit is covered by the stored credit.
Example
3. Tractable and Non-Tractable Problems
Tractable Problem: A computational problem that can be solved by an algorithm whose worst-case time complexity is bounded by a polynomial function of the input size n.
Complexity Cl
Examples
Sorting, searching, finding the shortest path in a graph.
Non-Tractable Problem (Intractable Problem): A computational problem for which no polynomial-time algorithm exists (or is known to exist). Their best-known algorithms have exponential time complexity.
Complexity Class:
Many problems in the class NP-Complete are generally considered non-tractable. Examples: Traveling Salesman Problem (TSP), Boolean Satisfiability (SAT).
4. Potential Function
Definition: The most formal technique for Amortized Analysis. It uses a potential function Φ(D) that maps the state D of a data structure to a non-negative real number, representing the “prepaid work” stored.
Amortized Costc^ i ) of an operation is calculated as: c ^ i =c i +ΔΦ i =c i+Φ(Di )−Φ(D i−1) where c i is the actual cost, and ΔΦ i is the change in potential. Goal: To define Φ such that c ^ i is small (often O(1)) for all operations, proving the total actual cost is low.
5. Randomized Algorithm
Las Vegas Algorithm
Always produces the correct result, but its running time is random (e.G., Quick Sort with a random pivot). The expected running time is analyzed.
Monte Carlo Algorithm:
Use
Often simpler, faster in practice, or the only way to solve certain problems compared to their deterministic counterparts.
6. Approximation Algorithms
Definition
A technique used to solve NP-hard optimization problems by finding a solution that is close to the optimal solution within a provable guarantee, rather than spending exponential time searching for the absolute optimal.
Performance Guarantee∣C
Use
Essential for solving practical instances of intractable problems like the Traveling Salesman Problem (TSP) or the Vertex Cover Problem, where finding the exact optimal solution is too time-consuming.
Embedded Algorithm
An embedded algorithm is a set of defined rules, calculations, or steps implemented in software to achieve a specific task within an embedded system. Essentially, it is the computational logic that runs on a specialized, non-general-purpose computer system designed to perform one or a few dedicated functions.
Key Characteristics:
Resource Constrained
They must operate efficiently within tight limits on memory, processing power (CPU speed), and power consumption (especially for battery-operated devices).Real-Time Requirements:
Interaction with Hardware:
They directly interact with sensors, actuators, and hardware interfaces (I/O) to monitor physical processes and exert control over them.Optimized for Single Task
Examples of Embedded Algorithms
1. A PID controller (Proportional-Integral-Derivative) algorithm used in a thermostat to maintain a stable temperature. 2.A Kalman filter algorithm used in GPS systems to estimate a vehicle’s true position despite noisy sensor data. 3.A signal processing algorithm (like FFT) used in a hearing aid to filter noise and enhance voice.
Embedded System Scheduling
Objectives of Embedded Scheduling
1.Meet Deadlines: The primary goal is to ensure all tasks, especially hard real-time tasks, meet their defined deadlines. 2.Maximize CPU Utilization: Keep the processor busy whenever there is work to do. 3.Minimize Latency: Reduce the time between a trigger event (e.G., sensor reading) and the corresponding action. 4. Predictability: The system’s behavior must be predictable and verifiable under all operating conditions.Types of Real-Time Systems
The scheduling requirements depend on the system’s criticality:
System TypeDeadline ConstraintConsequence of Failure
Hard Real-TimeAbsolute and critical.Catastrophic failure (e.G., avionics, medical implants).
Soft Real
TimeDesirable, but missing one is tolerable.Degraded performance (e.G., video streaming, gaming).
Common Scheduling Algorithms
Embedded schedulers are almost always priority-driven and preemptive to handle critical events instantly.
1. Rate Monotonic Scheduling (RMS)
Mechanism
A static priority algorithm where priorities are assigned before runtime and do not change.
Priority Rule
The task with the highest rate (shortest period/most frequent) is given the highest priority.
Optimality
RMS is optimal among all fixed-priority scheduling algorithms. If a set of tasks can be scheduled by any fixed-priority algorithm, it can be scheduled by RMS.
2. Earliest Deadline First (EDF)
Mechanism: A dynamic priority algorithm where priorities change at runtime.
Priority Rule: The task whose deadline is closest is given the highest priority.
Optimality: EDF is optimal among all dynamic priority scheduling algorithms. If a set of tasks can be scheduled by any algorithm, it can be scheduled by EDF.
Drawback: Difficult to implement and analyze under system overload, as a large number of tasks missing deadlines can make the system unstable.
3. Priority Ceiling Protocol (PCP)
Mechanism: An extension of priority-based scheduling used specifically to solve the problem of priority inversion and deadlocks when tasks share resources (like mutexes). Principle: When a task locks a shared resource, its priority is temporarily raised to the ceiling priority of that resource (the priority of the highest-priority task that might ever use it). This prevents medium-priority tasks from preempting the task holding the lock, thereby minimizing the time a high-priority task is blocked.
Spawn and sync keywords used in multi threaded programming
1. The spawn Keyword (or Primitive)
Purpose
To delegate a sub-task to a worker or separate thread, allowing the original thread (the parent) to continue execution without waiting for the sub-task to finish. This is the mechanism for introducing parallelism.Characteristics
Non-Blocking: When the parent thread executes spawn, it immediately proceeds to the next instruction without waiting for the spawned child task to complete.Asynchronous
Context
Spawn typically creates a new, lightweight task or thread context capable of running on any available core.
Example (Conceptual)
In a parallel algorithm, if you need to process two halves of an array simultaneously:
// Parent thread starts
spawn process_half(array_A); // Child 1 starts processing A, parent continues
spawn process_half(array_B); // Child 2 starts processing B, parent continues
// Parent performs other work while A and B are being processed
2. The sync Keyword (or Primitive)
Purpose
To ensure that all necessary concurrent work has finished before the parent thread proceeds to a step that requires the results of those concurrent tasks (e.G., merging the results). This is the mechanism for re-introducing sequential execution and ensuring data integrity.Characteristics
Blocking
When the parent thread executes sync, it blocks (pauses its own execution) until all tasks spawned within the same logical block have terminated.
Barrier
Result Integration
It allows the parent thread to safely collect and combine the results produced by the children.
Relationship: The Fork-Join Model
Multithreaded Merge Sort Algorithm The Multithreaded Merge Sort algorithm is an adaptation of the classic Divide and Conquer Merge Sort designed to leverage multiple processing cores (threads) to reduce the overall execution time. It follows the standard Merge Sort structure but executes the recursive sorting steps in parallel. 1. The Merge Sort Structure The standard (sequential) Merge Sort works in three main steps: Divide: The input list/array of n elements is divided into two halves of n/2 elements each. Conquer (Recurse): The two halves are recursively sorted. Combine (Merge): The two sorted halves are merged back together to form the final sorted list. 2. Multithreaded Adaptation The parallelism is introduced in the Conquer step using the Fork-Join model: A. Parallel Divide (Fork) Instead of the parent thread recursively calling the sort function on both halves sequentially, it uses the spawn primitive (or thread creation mechanism) to initiate the two recursive calls concurrently.
Small Subproblems: For very small subarrays (e.G., size < 16), the algorithm typically stops spawning new threads and reverts to the standard sequential Merge Sort. This is crucial because thread creation overhead can quickly negate the benefits of parallelism for small tasks.
B. Sequential Combine (Merge)
The sync operation ensures that the parent thread blocks until both spawned sorting subroutines have completed. Once synchronized, the final MERGE operation is typically performed sequentially because merging two sorted lists is inherently difficult to parallelize efficiently.
3. Analysis of Multithreaded Merge Sort
The performance of a parallel algorithm is analyzed using two key metrics: Work and Span.
A. Work:-
Work is the total time spent by a single processor, which is equivalent to the complexity of the sequential algorithm. Work Recurrence: Since the steps are the same as sequential Merge Sort, the work remains unchanged. Work Complexity: =Θ(nlogn).
C. Speedup and Parallelism
Speedup (S): How much faster the parallel algorithm is compared to the sequential one.
Maximum Possible Speedup: Limited by the Span.
Parallelism: The maximum possible speedup is relatively low (Θ(logn)) because the final Θ(n) merging step at each level becomes the bottleneck (limiting factor), known as the Amdahl’s Law limitation.
Rabin-Karp Algorithm for String Matching
The Rabin-Karp algorithm is a string-matching algorithm that uses hashing to efficiently find all occurrences of a pattern of length $m$ within a text of length n. Unlike algorithms that perform character-by-character comparisons (like the brute-force method), Rabin-Karp primarily relies on comparing the hash value of the pattern with the hash values of all possible substrings of the text.
1. The Core Idea: Hashing
The algorithm calculates a unique hash value for the pattern $P$ and then compares this value to the hash value of every substring of length $m$ in the text T.
Pattern Hash
Calculate the hash value h_p for the pattern P.
Text Hashes
Calculate the hash value h_t,i for the text substring $T[i \dots i+m-1]$.A. Initial Match Check (Hash Comparison)If h_p \ne h_{t, i}, the pattern cannot match the substring, and the algorithm shifts to the next position.If $h_p = h_{t, i}$, there are two possibilities:A true match: The pattern perfectly matches the substring.A spurious hit (Collision): The pattern does not match, but the hash values are identical.B. Verification (Character Comparison)To rule out a spurious hit, if the hash values match, the algorithm performs a character-by-character comparison between $P$ and $T[i \dots i+m-1]$.2. The Efficiency Driver: Rolling HashThe key to Rabin-Karp’s efficiency is the Rolling Hash technique. Instead of recalculating the entire hash for every new $m$-length substring, the algorithm calculates the next hash $h_{t, i+1}$ efficiently from the previous hash $h_{t, i}$ in $O(1)$ time.A common method for this is using a base $b$ and a modulus $q$ (often a large prime number) to treat the substring as a number in base $b$.The hash value $h_{t, i}$ for the substring $T[i \dots i+m-1]$ is:$$h_{t, i} = \left( \sum_{j=0}^{m-1} T[i+j] \cdot b^{m-1-j} \right) \pmod q$$To get the next hash $h_{t, i+1}$ (for $T[i+1 \dots i+m]$), we:Subtract the contribution of the leading character $T[i]$.Multiply the result by $b$ to shift all powers up.Add the contribution of the trailing character $T[i+m]$.$$h_{t, i+1} = \left[ b \left( h_{t, i} – T[i] \cdot b^{m-1} \pmod q \right) + T[i+m] \right] \pmod q$$This calculation is done in $O(1)$ time, making the overall scan very fast.3. AnalysisMetricWorst-Case Time ComplexityAverage-Case Time ComplexityRunning Time$\Theta(nm)$$\mathbf{\Theta(n+m)}$Worst-Case: Occurs when there are many spurious hits (hash collisions). In this case, the algorithm is forced to perform $\Theta(m)$ character comparisons for every position in the text, leading to the same complexity as the brute-force approach.Average-Case: By choosing a good large prime modulus $q$ and a suitable base $b$, the probability of spurious hits can be minimized. In this ideal scenario, the total time is dominated by the initial calculation ($O(m)$) and the $O(1)$ rolling hash calculation for $n-m+1$ positions ($O(n)$ total), resulting in linear time $\Theta(n+m)$.
