Data Structures and Algorithms: Complexity, Hashing, and Trees
Algorithm Analysis: Time, Space, and Efficiency
Algorithm analysis evaluates how efficient an algorithm is by measuring its resource usage. The three main resources analyzed are:
Programming Requirements
This refers to how easy or complex it is to implement the algorithm in code. It involves understanding the logic, writing the correct syntax, and debugging. Simpler algorithms may require less programming effort but might be less efficient.
Time Requirements (Time Complexity)
This measures how much time an algorithm takes to complete as the input size grows. It is usually expressed using Big O notation (e.g., O(n), O(log n)) and helps predict the speed of an algorithm.
Space Requirements (Space Complexity)
This measures how much memory an algorithm needs during its execution, including input storage, temporary variables, and recursion stack. It also uses Big O notation (e.g., O(1), O(n)).
Types of Algorithm Complexity Analysis
Best Case Complexity: The minimum time or space required by the algorithm for any input of size n.
Worst Case Complexity: The maximum time or space required, ensuring performance guarantees.
Average Case Complexity: The expected time or space for a typical input.
Fundamental Data Structures and Classification
Classification of Data Structures
Data structures are broadly classified into two main categories:
1. Primitive Data Structures
These are the basic data types provided by a programming language. They directly operate upon the machine instructions.
Examples: Integer, Float, Character, Boolean, etc.
2. Non-Primitive Data Structures
These are more complex data structures that are built using primitive data types. They are further classified into:
Linear Data Structures
Elements are arranged in a sequential manner. Each element is connected to its previous and next element.
Examples: Array, Linked List, Stack, Queue.
Non-Linear Data Structures
Elements are arranged in a hierarchical or interconnected manner, not sequentially.
Examples: Tree, Graph.
Arrays and Stack Operations
Advantages and Disadvantages of Arrays
Advantages of Arrays:
Simple and easy to use for storing multiple elements of the same type.
Fast access to elements using index (constant time O(1)).
Efficient memory utilization when the number of elements is known in advance.
Supports random access which helps in quick searching and sorting.
Disadvantages of Arrays:
Fixed size — the size must be declared before use and cannot be changed dynamically.
Insertion and deletion are costly as they may require shifting elements (O(n) time).
Wastes memory if allocated size is much larger than the number of elements stored.
Does not support flexible memory management like linked lists.
Stack Operations (LIFO)
A stack follows the Last-In, First-Out (LIFO) principle.
1. PUSH Operation (Insertion)
Adds an element to the top of the stack.
Steps:
Check if the stack is full (stack overflow).
If not full, increment the top pointer.
Insert the new element at the top position.
2. POP Operation (Deletion)
Removes the element from the top of the stack.
Steps:
Check if the stack is empty (stack underflow).
If not empty, retrieve the element at the top.
Decrement the top pointer to remove the element.
Queues and Deques
Circular Queue Implementation
A Circular Queue is a linear data structure that follows the FIFO (First In First Out) principle but connects the end of the queue back to the front, forming a circle. This allows efficient use of storage by reusing spaces freed by dequeued elements.
Unlike a simple linear queue, it avoids the “queue full” problem when there is free space at the front.
It uses two pointers: Front (points to the first element) and Rear (points to the last element).
Algorithm to Insert an Element Data
into Circular Queue
Preconditions: Let MAX
be the size of the queue array. front
and rear
are integer pointers initialized to -1
when the queue is empty. queue
is the array to store elements.
Check if queue is full: If
(rear + 1) % MAX == front
, then the queue is full and insertion is not possible. Report overflow.If queue is empty: If
front == -1
, setfront = 0
andrear = 0
.Otherwise: Update
rear = (rear + 1) % MAX
to move the rear pointer circularly.Insert the new element:
queue[rear] = Data
End of insertion.
Priority Queue
A Priority Queue is an abstract data type similar to a regular queue, but with an added feature: each element has a priority associated with it.
Key Points of Priority Queues
Elements are served based on priority, not just the order they arrive.
The element with the highest priority is dequeued before others.
If two elements have the same priority, they are served in the order they arrived (depending on implementation).
Can be implemented using arrays, linked lists, or more efficiently with heaps (especially binary heaps).
Priority Queue Applications
Task scheduling in operating systems.
Handling emergencies in hospital triage systems.
Managing bandwidth in network routers.
Algorithms like Dijkstra’s shortest path use priority queues.
Queue vs. Deque Comparison
Aspect | Queue (FIFO) | Deque (Double-Ended Queue) |
---|---|---|
Definition | A linear data structure that follows FIFO (First In First Out) order. | A linear data structure where insertion and deletion can happen at both ends. |
Insertion | Insert only at the rear (enqueue). | Insert at both front and rear ends. |
Deletion | Delete only from the front (dequeue). | Delete from both front and rear ends. |
Operations | Two main operations: enqueue and dequeue. | Four operations: insert front, insert rear, delete front, delete rear. |
Use Cases | Scheduling, buffering, resource management. | More flexible, used in palindrome checking, undo functionality. |
Linked Lists: Types and Features
Definition of a Linked List
A Linked List is a linear data structure where elements, called nodes, are stored in a sequence. Each node contains two parts:
Data: The value stored in the node.
Pointer (or Link): A reference to the next node in the sequence.
Unlike arrays, linked lists do not store elements in contiguous memory locations, allowing dynamic memory allocation and efficient insertions/deletions.
Types of Linked Lists
Singly Linked List
Each node has a pointer to the next node only.
Traversal is one-way (forward).
Doubly Linked List (Two-Way Linked List)
A Doubly Linked List is a type of linked list in which each node contains three fields:
Data — stores the actual value.
Next Pointer — points to the next node in the list.
Previous Pointer — points to the previous node in the list.
Key Features of Doubly Linked Lists
Allows traversal in both directions: forward and backward.
Easier to implement operations like insertion and deletion from both ends or at any position, compared to a singly linked list.
Requires extra memory to store the previous pointer in each node.
Circular Linked List
The last node points back to the first node, forming a circle.
Can be singly or doubly linked.
Header Linked List
A Header Linked List is a type of linked list where each list has a special node called a header node. This header node contains information about the list such as the count of nodes or pointers to the first and last nodes. The header node itself may or may not hold actual data.
Features of Header Linked List
Simplifies list operations by storing metadata in the header node.
Helps in easily managing the list size or other attributes.
The header node acts as the starting point of the list.
Categories of Header Linked List
Single Header Linked List: Has a single header node pointing to the first node of the list. The header stores information like the count of elements.
Double Header Linked List: Has two header nodes, typically pointing to the first and last nodes. Useful for efficient operations at both ends (like in a double-ended queue).
Circular Header Linked List: The list is circular and has a header node. The last node points back to the header node or the first node, enabling circular traversal.
Applications of Linked Lists
Dynamic Memory Allocation: Easily grows or shrinks during program execution.
Implementing Stacks and Queues: Using linked lists provides efficient insertion and deletion.
Graph Adjacency Lists: Represent graphs efficiently.
Undo/Redo Functionality: Doubly linked lists help move forward and backward through states.
Navigation Systems: Like browser history where back and forward moves are needed.
Trees and Graphs
AVL Tree (Self-Balancing Binary Search Tree)
An AVL Tree is a type of self-balancing binary search tree named after its inventors Adelson-Velsky and Landis.
Key Features of AVL Trees
For every node in the tree, the height difference (balance factor) between its left and right subtrees is at most 1.
The balance factor = height of left subtree − height of right subtree.
If the balance factor becomes less than -1 or greater than 1 after insertion or deletion, the tree is rebalanced using rotations (single or double rotations).
Advantages of AVL Trees
Guarantees O(log n) time complexity for search, insertion, and deletion.
Maintains balanced height, improving efficiency compared to ordinary binary search trees which can become skewed.
AVL Tree Applications
Used in databases and file systems where quick lookups, insertions, and deletions are required.
Graph Traversal Algorithms
Depth-First Search (DFS) Traversal Algorithm
Start at the given vertex, mark it as visited, and process it.
Recursively visit all unvisited adjacent vertices of the current vertex.
Repeat step 2 until all vertices reachable from the starting vertex are visited.
BFS vs. DFS Comparison
Aspect | BFS (Breadth-First Search) | DFS (Depth-First Search) |
---|---|---|
Traversal Order | Explores all neighbors at the current level before moving to the next level (level by level). | Explores as far as possible along one branch before backtracking. |
Data Structure Used | Queue | Stack (or recursion) |
Use Case | Finds shortest path in unweighted graphs. | Used for pathfinding, topological sorting, and cycle detection. |
Memory Usage | Uses more memory (stores all neighbors in queue). | Uses less memory (follows one path deeply). |
Implementation | Iterative | Recursive or iterative |
Types of Graphs
Weakly Connected Graph: A directed graph where replacing all edges with undirected edges makes it connected.
Strongly Connected Graph: A directed graph where there is a path from every vertex to every other vertex.
Multigraph: A graph that may have multiple edges between the same pair of vertices.
Unconnected Graph: A graph where at least two vertices have no path connecting them.
Weighted Graph: A graph where each edge has an associated numerical value (weight).
Hashing and Compression
What is a Hash Function?
A Hash Function is a function that takes input data (called a key) and converts it into a fixed-size integer value called a hash code or hash value. This value is used to index into a hash table, allowing for fast data retrieval.
Properties of a Good Hash Function
Deterministic: The same input key should always produce the same hash value.
Uniform Distribution: It should distribute hash values uniformly across the hash table to minimize collisions.
Fast Computation: The function should be efficient to compute to ensure quick access.
Minimizes Collisions: Different keys should rarely produce the same hash value (collision).
Avalanche Effect: A small change in the input key should produce a significantly different hash value.
Applications of Hash Functions
Hash Tables: Used for efficient data retrieval in databases, caches, and associative arrays.
Data Integrity: In checksums and digital signatures to verify data authenticity.
Cryptography: Secure hashing algorithms (like SHA, MD5) protect passwords and sensitive data.
Caching: Quickly locate stored data.
Load Balancing: Distribute requests evenly across servers.
Collision Resolution Techniques
When two different keys map to the same hash index, a collision occurs. These techniques resolve collisions:
Double Hashing
Double hashing is a collision resolution technique where two different hash functions are used. When a collision occurs, the second hash function computes a step size to find the next available slot, reducing clustering and improving distribution.
Bucket Hashing
Bucket hashing stores multiple elements in each slot of the hash table using a bucket (a small container like a linked list or an array). This way, collisions are handled by grouping all colliding keys in the same bucket. It is efficient when the number of collisions is small and supports dynamic growth of buckets.
Bucket Hashing Applications
Used in database indexing and situations where collisions are frequent but manageable.
Huffman Algorithm for Data Compression
Huffman algorithm is a greedy algorithm used for lossless data compression. It assigns variable-length binary codes to input characters, with shorter codes for more frequent characters, reducing the total number of bits needed to represent the data.
Steps of Huffman Algorithm
Calculate frequency of each character in the input.
Create leaf nodes for each character and build a priority queue (min-heap) based on frequencies.
Extract two nodes with the smallest frequencies and combine them into a new node with frequency equal to their sum.
Insert the new node back into the priority queue.
Repeat steps 3 and 4 until only one node remains — this becomes the root of the Huffman tree.
Assign binary codes by traversing the tree: left edge = 0, right edge = 1.
Huffman Example
Given characters with frequencies: A: 5, B: 9, C: 12, D: 13, E: 16, F: 45.
Combine smallest: A(5) + B(9) = 14
New frequencies: 14, 12, 13, 16, 45
Combine smallest: 12 + 13 = 25
New frequencies: 14, 16, 25, 45
Specialized Data Structures and Techniques
Sparse Matrix Representation
A Sparse Matrix is a matrix in which most of the elements are zero. Typically, if more than half of the elements are zero, the matrix is considered sparse.
Concept and Storage Efficiency
Storing all elements, including zeros, wastes memory. Special techniques store only the non-zero elements and their positions to save space.
Common storage methods include Coordinate List (COO) or Compressed Sparse Row (CSR) format.
Sparse Matrix Example
Consider a 4×4 matrix:
[0 0 3 0]
[0 0 0 0]
[0 7 0 0]
[0 0 0 0]
Here, only 2 out of 16 elements are non-zero.
Sparse Matrix Applications
Used in scientific computing, graph algorithms, and machine learning where large datasets often contain many zeros.
Recursion
Recursion is a programming technique where a function calls itself to solve a smaller instance of the same problem until it reaches a base case (a condition where the recursion stops).
Recursion Example
Calculating factorial of a number n
:
factorial(n) = n * factorial(n-1), with factorial(0) = 1 (base case)
Demerits of Recursion
High Memory Usage: Each recursive call adds a new layer to the call stack, consuming more memory. This can lead to stack overflow if recursion is too deep.
Slower Execution: Function calls have overhead. Recursive solutions can be slower than iterative ones because of this overhead and repeated function calls.
Complex Debugging: Recursive code can be harder to debug and understand, especially if the recursion logic is complex.
Risk of Infinite Recursion: If the base case is not properly defined or reached, recursion can go on indefinitely, causing program crash.
Search Algorithm Comparison
Aspect | Linear Search | Binary Search |
---|---|---|
Definition | Searches elements sequentially one by one. | Searches by repeatedly dividing the sorted list in half. |
Data Requirement | Works on both sorted and unsorted lists. | Requires the list to be sorted beforehand. |
Search Method | Checks each element until target is found or list ends. | Compares target with middle element and narrows search to one half. |
Time Complexity | O(n) — worst case, needs to check all elements. | O(log n) — efficiently halves search space each time. |
Efficiency | Less efficient for large data sets. | Much more efficient for large, sorted data. |
Implementation | Simple to implement. | Slightly more complex due to recursion or loop with indices. |