Understanding Set Similarity, Spatial Search, and Graph Algorithms

# πŸ“š Beyond Set Similarity, Spatial Similarity Search, and Graph Algorithms β€” DS-GA 1004: Big Data


## πŸ•› Key Topics
– Locality-sensitive hashing (LSH)
– Bags/multi-sets similarity
– Spatial similarity search (vector database)
– Cosine similarity and LSH
– Graph-based relevance: PageRank


# 1. Locality-Sensitive Hashing (LSH)


## πŸ’‘ Key Use Cases
– Search: Content relevance via query-document similarity.
– Recommendation: Personalization from feedback.
– Graph algorithms: Relevance from network structure.


# 2. Beyond Simple Sets


## 1) Bags (Multi-Sets)

– Items matter how often they appear.
Ruzicka Similarity: Jaccard-like for bags.
– Expand sets:
– {dog} β†’ {dog1}
– {dog, dog} β†’ {dog1, dog2}
– Formula:
\[
R(A,B) = \frac{\sum_i \min(A[i], B[i])}{\sum_j \max(A[j], B[j])}
\]

Example:
– X = [1, 3, 5, 7]
– Y = [2, 3, 1, 6]
– R(X, Y) = 11 / 17 β‰ˆ 0.65


## 2) Spatial Similarity Search (Vector Databases)

– Compare vectors (e.g., document embeddings).
– Suitable metric: Cosine similarity.

Formula:
\[
\cos(\theta) = \frac{\mathbf{a}^T\mathbf{b}}{\|\mathbf{a}\|\|\mathbf{b}\|}
\]
– Cos(0) = 1, Cos(90) = 0, Cos(180) = -1.

Example:
– doc0 = [1, 1]
– doc1 = [3, 5] ➜ cSim = 0.97
– doc2 = [4, 2] ➜ cSim = 0.95
– doc3 = [1, 5] ➜ cSim = 0.83


# 3. LSH for Cosine Similarity


## πŸ” Problem
– Computing all pairwise cosine similarities: O(Nd) complexity.
– Idea: Use LSH to prune search space.

## 🌐 Charikar’s Approach (2002)
Random hyperplanes partition vector space.
Collision if two vectors fall on the same side.
– Probability of collision relates to angle ΞΈ.

Mechanism:
– Random vector w.
– Hash:
– hw(x) = 1 if wβ‹…x β‰₯ 0
– hw(x) = -1 if wβ‹…x < 0
– Collision probability:
\[
P[\text{collision}] = 1 – \frac{\theta}{\pi}
\]

Not exactly cos(ΞΈ), but monotonically decreasing with angle.

## πŸ”„ Signal Amplification
– Use multiple projections.
– Probability of no collision across m projections:
\[
(\theta/\pi)^m
\]
– Probability of at least one collision:
\[
1 – (\theta/\pi)^m
\]


# 4. Relevance from Graphs


# PageRank and Web Search


## πŸ“š Early Web Search (Pre-Google)
– Matching query text with page text.
– Problems: Spam attacks (keyword stuffing).

## πŸŒ€ PageRank Insight
– Structure of the web = trust signal.
– Links from trusted pages matter.

## πŸ– Random Surfer Model
– Web = directed graph (nodes = pages, edges = links).
– Model user as random walk:
\[
P[v | u] = \text{[link from u to v]} / \text{out-degree(u)}
\]
– Pages with high in-degree attract more traffic.


# 5. Markov Chains


## βš–οΈ Basics
– M[v, u] = transition probability from u to v.
M is a stochastic matrix (columns sum to 1).
– p = probability vector over pages.
– One step:
\[
p’ = M p
\]

## πŸ§ͺ Steady-State
– Solve:
\[
p = Mp
\]
– p = eigenvector of M with eigenvalue 1.

## πŸ“Š Example: Relationships/Dating Dynamics
– Markov transition matrix example.
– System stabilizes at dominant eigenvector.


# 6. Power Iteration


## πŸ’‘ Computation
– p0 = uniform distribution.
– Repeatedly apply p ← M p.
– Converges to steady state.

Small Example:
– After iterations, p stabilizes at steady-state vector.


# 7. Practical Considerations


## πŸ”— Requirements
Graph must be connected.
– No “sinks” (nodes without outgoing links).
– Fix: Add self-loops.

## πŸ•΅οΈ Spider Traps
– Nodes with no escape cause problems.
– Fix: Teleportation (Random Restart).

Teleportation Adjustment:
\[
M[v, u] \to a \times M[v, u] + (1-a) \times \frac{1}{N}
\]


# 8. PageRank Applications


## πŸ‘οΈ Using PageRank
– Rank pages based on network topology.
– Use PageRank to reorder search results.

## πŸ‘€ Personalizing PageRank
– Uniform teleportation unrealistic.
– Replace uniform jump with personalized distribution q.

Personalized PageRank iteration:
\[
p = a \times Mp + (1-a) \times q
\]


# 9. Distributed PageRank with Spark


Matrix multiplication via MapReduce/Spark.
– Use GraphFrames API for easier graph handling.

Official resource: GraphFrames Documentation


# πŸ“… Coming Up Next

– Socio-cultural impact of Big Data
– Filter bubbles
– Polarization
– Differential privacy
– The future of Big Data


# πŸ’‘ Summary

– LSH helps with large-scale similarity search.
– Cosine similarity = angle between vectors.
– PageRank models the web as a Markov chain.
– Power iteration helps approximate stationary distribution.
Teleportation solves real-world web graph issues.
– Personalized PageRank adapts to user preferences.
– PageRank can be distributed using Spark.