Understanding Set Similarity, Spatial Search, and Graph Algorithms

Posted on May 16, 2025 in Computer Engineering

# 📚 Beyond Set Similarity, Spatial Similarity Search, and Graph Algorithms — DS-GA 1004: Big Data

## 🕛 Key Topics
– Locality-sensitive hashing (LSH)
– Bags/multi-sets similarity
– Spatial similarity search (vector database)
– Cosine similarity and LSH
– Graph-based relevance: PageRank

# 1. Locality-Sensitive Hashing (LSH)

## 💡 Key Use Cases
– Search: Content relevance via query-document similarity.
– Recommendation: Personalization from feedback.
– Graph algorithms: Relevance from network structure.

# 2. Beyond Simple Sets

## 1) Bags (Multi-Sets)

– Items matter how often they appear.
– Ruzicka Similarity: Jaccard-like for bags.
– Expand sets:
– {dog} → {dog1}
– {dog, dog} → {dog1, dog2}
– Formula:
\[
R(A,B) = \frac{\sum_i \min(A[i], B[i])}{\sum_j \max(A[j], B[j])}
\]

Example:
– X = [1, 3, 5, 7]
– Y = [2, 3, 1, 6]
– R(X, Y) = 11 / 17 ≈ 0.65

## 2) Spatial Similarity Search (Vector Databases)

– Compare vectors (e.g., document embeddings).
– Suitable metric: Cosine similarity.

Formula:
\[
\cos(\theta) = \frac{\mathbf{a}^T\mathbf{b}}{\|\mathbf{a}\|\|\mathbf{b}\|}
\]
– Cos(0) = 1, Cos(90) = 0, Cos(180) = -1.

Example:
– doc0 = [1, 1]
– doc1 = [3, 5] ➜ cSim = 0.97
– doc2 = [4, 2] ➜ cSim = 0.95
– doc3 = [1, 5] ➜ cSim = 0.83

# 3. LSH for Cosine Similarity

## 🔍 Problem
– Computing all pairwise cosine similarities: O(Nd) complexity.
– Idea: Use LSH to prune search space.

## 🌐 Charikar’s Approach (2002)
– Random hyperplanes partition vector space.
– Collision if two vectors fall on the same side.
– Probability of collision relates to angle θ.

Mechanism:
– Random vector w.
– Hash:
– hw(x) = 1 if w⋅x ≥ 0
– hw(x) = -1 if w⋅x < 0
– Collision probability:
\[
P[\text{collision}] = 1 – \frac{\theta}{\pi}
\]

Not exactly cos(θ), but monotonically decreasing with angle.

## 🔄 Signal Amplification
– Use multiple projections.
– Probability of no collision across m projections:
\[
(\theta/\pi)^m
\]
– Probability of at least one collision:
\[
1 – (\theta/\pi)^m
\]

# 4. Relevance from Graphs

# PageRank and Web Search

## 📚 Early Web Search (Pre-Google)
– Matching query text with page text.
– Problems: Spam attacks (keyword stuffing).

## 🌀 PageRank Insight
– Structure of the web = trust signal.
– Links from trusted pages matter.

## 🖐 Random Surfer Model
– Web = directed graph (nodes = pages, edges = links).
– Model user as random walk:
\[
P[v | u] = \text{[link from u to v]} / \text{out-degree(u)}
\]
– Pages with high in-degree attract more traffic.

# 5. Markov Chains

## ⚖️ Basics
– M[v, u] = transition probability from u to v.
– M is a stochastic matrix (columns sum to 1).
– p = probability vector over pages.
– One step:
\[
p’ = M p
\]

## 🧪 Steady-State
– Solve:
\[
p = Mp
\]
– p = eigenvector of M with eigenvalue 1.

## 📊 Example: Relationships/Dating Dynamics
– Markov transition matrix example.
– System stabilizes at dominant eigenvector.

# 6. Power Iteration

## 💡 Computation
– p0 = uniform distribution.
– Repeatedly apply p ← M p.
– Converges to steady state.

Small Example:
– After iterations, p stabilizes at steady-state vector.

# 7. Practical Considerations

## 🔗 Requirements
– Graph must be connected.
– No “sinks” (nodes without outgoing links).
– Fix: Add self-loops.

## 🕵️ Spider Traps
– Nodes with no escape cause problems.
– Fix: Teleportation (Random Restart).

Teleportation Adjustment:
\[
M[v, u] \to a \times M[v, u] + (1-a) \times \frac{1}{N}
\]

# 8. PageRank Applications

## 👁️ Using PageRank
– Rank pages based on network topology.
– Use PageRank to reorder search results.

## 👤 Personalizing PageRank
– Uniform teleportation unrealistic.
– Replace uniform jump with personalized distribution q.

Personalized PageRank iteration:
\[
p = a \times Mp + (1-a) \times q
\]

# 9. Distributed PageRank with Spark

– Matrix multiplication via MapReduce/Spark.
– Use GraphFrames API for easier graph handling.

Official resource: GraphFrames Documentation

# 📅 Coming Up Next

– Socio-cultural impact of Big Data
– Filter bubbles
– Polarization
– Differential privacy
– The future of Big Data

# 💡 Summary

– LSH helps with large-scale similarity search.
– Cosine similarity = angle between vectors.
– PageRank models the web as a Markov chain.
– Power iteration helps approximate stationary distribution.
– Teleportation solves real-world web graph issues.
– Personalized PageRank adapts to user preferences.
– PageRank can be distributed using Spark.

Understanding Set Similarity, Spatial Search, and Graph Algorithms

Recent Notes

Subjects

Publicidad