Understanding Set Similarity, Spatial Search, and Graph Algorithms
# π Beyond Set Similarity, Spatial Similarity Search, and Graph Algorithms β DS-GA 1004: Big Data
## π Key Topics
– Locality-sensitive hashing (LSH)
– Bags/multi-sets similarity
– Spatial similarity search (vector database)
– Cosine similarity and LSH
– Graph-based relevance: PageRank
# 1. Locality-Sensitive Hashing (LSH)
## π‘ Key Use Cases
– Search: Content relevance via query-document similarity.
– Recommendation: Personalization from feedback.
– Graph algorithms: Relevance from network structure.
# 2. Beyond Simple Sets
## 1) Bags (Multi-Sets)
– Items matter how often they appear.
– Ruzicka Similarity: Jaccard-like for bags.
– Expand sets:
– {dog} β {dog1}
– {dog, dog} β {dog1, dog2}
– Formula:
\[
R(A,B) = \frac{\sum_i \min(A[i], B[i])}{\sum_j \max(A[j], B[j])}
\]
Example:
– X = [1, 3, 5, 7]
– Y = [2, 3, 1, 6]
– R(X, Y) = 11 / 17 β 0.65
## 2) Spatial Similarity Search (Vector Databases)
– Compare vectors (e.g., document embeddings).
– Suitable metric: Cosine similarity.
Formula:
\[
\cos(\theta) = \frac{\mathbf{a}^T\mathbf{b}}{\|\mathbf{a}\|\|\mathbf{b}\|}
\]
– Cos(0) = 1, Cos(90) = 0, Cos(180) = -1.
Example:
– doc0 = [1, 1]
– doc1 = [3, 5] β cSim = 0.97
– doc2 = [4, 2] β cSim = 0.95
– doc3 = [1, 5] β cSim = 0.83
# 3. LSH for Cosine Similarity
## π Problem
– Computing all pairwise cosine similarities: O(Nd) complexity.
– Idea: Use LSH to prune search space.
## π Charikarβs Approach (2002)
– Random hyperplanes partition vector space.
– Collision if two vectors fall on the same side.
– Probability of collision relates to angle ΞΈ.
Mechanism:
– Random vector w.
– Hash:
– hw(x) = 1 if wβ
x β₯ 0
– hw(x) = -1 if wβ
x < 0
– Collision probability:
\[
P[\text{collision}] = 1 – \frac{\theta}{\pi}
\]
Not exactly cos(ΞΈ), but monotonically decreasing with angle.
## π Signal Amplification
– Use multiple projections.
– Probability of no collision across m projections:
\[
(\theta/\pi)^m
\]
– Probability of at least one collision:
\[
1 – (\theta/\pi)^m
\]
# 4. Relevance from Graphs
# PageRank and Web Search
## π Early Web Search (Pre-Google)
– Matching query text with page text.
– Problems: Spam attacks (keyword stuffing).
## π PageRank Insight
– Structure of the web = trust signal.
– Links from trusted pages matter.
## π Random Surfer Model
– Web = directed graph (nodes = pages, edges = links).
– Model user as random walk:
\[
P[v | u] = \text{[link from u to v]} / \text{out-degree(u)}
\]
– Pages with high in-degree attract more traffic.
# 5. Markov Chains
## βοΈ Basics
– M[v, u] = transition probability from u to v.
– M is a stochastic matrix (columns sum to 1).
– p = probability vector over pages.
– One step:
\[
p’ = M p
\]
## π§ͺ Steady-State
– Solve:
\[
p = Mp
\]
– p = eigenvector of M with eigenvalue 1.
## π Example: Relationships/Dating Dynamics
– Markov transition matrix example.
– System stabilizes at dominant eigenvector.
# 6. Power Iteration
## π‘ Computation
– p0 = uniform distribution.
– Repeatedly apply p β M p.
– Converges to steady state.
Small Example:
– After iterations, p stabilizes at steady-state vector.
# 7. Practical Considerations
## π Requirements
– Graph must be connected.
– No “sinks” (nodes without outgoing links).
– Fix: Add self-loops.
## π΅οΈ Spider Traps
– Nodes with no escape cause problems.
– Fix: Teleportation (Random Restart).
Teleportation Adjustment:
\[
M[v, u] \to a \times M[v, u] + (1-a) \times \frac{1}{N}
\]
# 8. PageRank Applications
## ποΈ Using PageRank
– Rank pages based on network topology.
– Use PageRank to reorder search results.
## π€ Personalizing PageRank
– Uniform teleportation unrealistic.
– Replace uniform jump with personalized distribution q.
Personalized PageRank iteration:
\[
p = a \times Mp + (1-a) \times q
\]
# 9. Distributed PageRank with Spark
– Matrix multiplication via MapReduce/Spark.
– Use GraphFrames API for easier graph handling.
Official resource: GraphFrames Documentation
# π Coming Up Next
– Socio-cultural impact of Big Data
– Filter bubbles
– Polarization
– Differential privacy
– The future of Big Data
# π‘ Summary
– LSH helps with large-scale similarity search.
– Cosine similarity = angle between vectors.
– PageRank models the web as a Markov chain.
– Power iteration helps approximate stationary distribution.
– Teleportation solves real-world web graph issues.
– Personalized PageRank adapts to user preferences.
– PageRank can be distributed using Spark.