Recommendation Systems, Similarity Metrics & Graph Algorithms
Design Recommendation System — Collaborative Filtering
124. Design recommendation system using collaborative filtering (movie streaming) (10 marks — 10 points)
Input: users, movies, ratings → construct utility matrix.
Step 1: Compute similarity between users (user-based) or movies (item-based).
Step 2: Identify nearest neighbors using cosine or Pearson similarity.
Step 3: Predict ratings for unseen movies via a weighted average of neighbors’ ratings.
Step 4: Recommend top-N movies with the highest predicted ratings.
Handle sparsity with matrix factorization (SVD) if necessary. Update the system continuously with new user ratings. Evaluate using RMSE or precision/recall metrics. Incorporate hybrid features by combining content-based metadata. Deploy in a streaming service interface with personalization.
Girvan–Newman Algorithm with Example
125. Girvan-Newman Algorithm with Example (5 marks — 6 points)
Used for community detection in networks.
Compute edge betweenness for all edges.
Remove the edge with the highest betweenness.
Recompute betweenness iteratively until the graph breaks into communities.
Example: Social network with 6 nodes (A–F) and edges with varying betweenness → remove edges sequentially → identify two clusters {A, B, C} and {D, E, F}.
Detects communities without prior knowledge of the number of communities. Works well for medium-sized networks but has high complexity for very large graphs.
Nearest Neighbor Experiment — Retail Collaborative Filtering
126. Experiment using nearest neighbor technique in collaborative filtering (retail website) (10 marks)
Steps:
Data: Customer–item purchase matrix (utility matrix).
Preprocessing: Normalize ratings and handle missing values.
Compute similarities: User-based: similarity between customers using cosine or Pearson. Item-based: similarity between products.
Nearest neighbor selection: Choose top-k similar users/items.
Prediction: Weighted average of neighbors’ ratings for unseen products.
Recommendation: Top-N products with highest predicted ratings.
Evaluation: Split data into training/test → compute RMSE, precision, and recall.
Experiment variations: Change k, similarity metric, or use a hybrid approach.
Observation: Analyze accuracy and coverage of recommendations.
Scaling: For large datasets, implement approximate nearest neighbor methods (LSH, MinHash).
Conclusion: Identify best parameter settings and report top recommended items per user.
Triangle Counting — MapReduce vs Graph Methods
122. Efficient approach for counting triangles — MapReduce vs Graph-specific
Graph-specific algorithms (like Node-Iterator, Edge-Iterator) are faster for smaller graphs. MapReduce is scalable and handles huge distributed graphs but has overhead (shuffle, reduce).
Conclusion: Use graph-specific algorithms for efficiency; use MapReduce for very large distributed datasets.
Influencer Detection in Social Networks
120. Algorithm to detect influencers in social network (5 marks — 7 points)
Represent the social network as a graph G(V, E).
Compute centrality metrics: degree, closeness, betweenness, PageRank.
Rank nodes by a combined score (weighted or normalized).
Identify top-k nodes as influencers.
Optionally detect communities to find local influencers.
Validate using engagement or reach metrics.
Output the top influencers list.
MapReduce for Triangle Counting
118. MapReduce & counting triangles in a graph (5 marks — 6 points)
Triangles: Three nodes mutually connected.
Map phase: Emit all possible edge pairs per node.
Reduce phase: Join pairs to check shared neighbors → identify triangles.
Scales to large graphs using distributed computation.
Handles sparse graphs efficiently.
Can use multiple iterations to avoid duplicate counting.
Long Tail Phenomenon in Recommendations
114. Long Tail Phenomenon (5 marks)
The majority of sales come from a small number of popular items (the head). A large number of niche items (the long tail) collectively contribute significantly. Example: Netflix → popular shows (head) vs rare indie films (long tail).
Recommender systems leverage the long tail to suggest niche content, enabling platforms to monetize rare items and encourage variety and personalization in content delivery.
Components of a Social Network Graph
113. Components of a Social Network Graph (5 marks — 6 points)
Nodes / Vertices: Represent users or entities.
Edges / Links: Represent relationships (friendship, follow, interaction).
Weights (optional): Strength or frequency of interaction.
Neighborhoods: Local clusters of connected nodes.
Betweenness / Centrality metrics: Detect influencers or critical nodes. Interactions reveal patterns: communities, influence, and connectivity.
Content-Based Filtering Recommendation System
112. Content-Based Filtering Recommendation System (5 marks — 7 points)
Recommend items similar to those a user liked in the past.
Use item features (genre, keywords, tags).
Example: Movie “Inception” → recommend “Interstellar” (both sci‑fi, same director).
Compute similarity between item profiles (cosine, Euclidean).
Build a user profile from previously liked items.
Suggest items with the highest similarity scores.
Works independently of other users (unlike collaborative filtering).
What Is a Recommendation System?
106. What is a Recommendation System? (2 marks — 3 points)
A system that suggests relevant items to users based on their preferences or behavior.
Helps users navigate large datasets efficiently (e.g., movies, products).
Commonly used in e-commerce (Amazon), streaming services (Netflix), and social media.
Collaborative Filtering and Nearest Neighbor
107. How does collaborative filtering relate to nearest neighbor technique? (2 marks — 3 points)
Collaborative filtering predicts user preferences based on similar users or items.
Nearest neighbor identifies the most similar users or items using similarity metrics like cosine or Pearson.
The prediction is often a weighted average of neighbors’ ratings.
Definition of a Utility Matrix
108. What is a Utility Matrix? (2 marks — 3 points)
A matrix representing users (rows) vs items (columns).
Each cell shows a user’s rating for an item (explicit or implicit).
Sparse for large systems; used as input for collaborative filtering.
Understanding Betweenness
109. What do you understand by Betweenness? (2 marks — 3 points)
A graph metric measuring how often a node or edge lies on the shortest paths between other nodes.
High betweenness → node or edge is critical for network connectivity.
Useful for detecting influencers or critical links in social networks.
Why Use Clustering Techniques?
110. Why do we use clustering techniques? (2 marks — 4 points)
To group similar data points automatically.
Reduce complexity for analysis.
Discover hidden patterns.
Useful in recommendation, marketing segmentation, and anomaly detection.
Rounding Data in a Utility Matrix
111. Rounding the Data in Utility Matrix (2 marks — 3 points)
Convert continuous or predicted ratings into discrete values.
Example: 3.6 → 4 stars.
Helps simplify predictions and decision-making for recommendations.
Cosine Similarity & Sentence Distance
119. Cosine Similarity & Cosine Distance of sentences (2 marks — 4 points)
Sentences:
S1: “AI is our friend and it has been friendly”
S2: “AI and humans have always been friendly”
Steps:
Convert to vectors using word counts (bag of words).
Compute the dot product and norms.
Cosine similarity ≈ 0.547 (approximate example).
Cosine distance = 1 − similarity ≈ 0.453.
