Churn Prediction with Logistic Regression, Choice Models & NLP
⚙️ 1️⃣ Logistic Regression — Churn Prediction
Predict P(churn = 1) via logit link → ln(p/(1-p)) = β0 + βx
| Metric | Meaning | Trade-off / Exam Tip |
|---|---|---|
| Precision = TP/(TP+FP) | How accurate are my churn flags? | Higher precision → fewer false alarms |
| Recall = TP/(TP+FN) | How many real churners are caught? | Higher recall → catch more churners |
| AUC | Probability the model ranks a churner higher than a non‑churner | 0.5 = random, 1 = perfect |
| ROC Curve | TPR (Recall) vs FPR (1 − Specificity) | Curves that bow toward the top‑left are better |
| Threshold | 0.3 → higher recall, lower precision; 0.7 → lower recall, higher precision | Choose threshold using business context |
- FN = missed churner (big loss) → prefer high recall.
- β > 0 → increased log‑odds → increased churn probability.
- exp(β) = odds ratio.
🚗 2️⃣ Choice Model (MNL) — Long Format Structure
Rows = individual × alternative.
Example: 151 people × 3 routes = 453 rows.
| Type | Example | Include Rule | Meaning |
|---|---|---|---|
| ASC | asc_rural, asc_freeway | Drop base (arterial) | Alternative specific constant: baseline bias vs base |
| ASV | dist_arter, dist_rural, dist_freew | Keep all | Route attributes |
| ISV | male_rural, male_freew | Drop base (male_arter) | Individual trait × route effect |
Utility functions:
Uarterial = 0
Urural = ASCr + β1 distr + β2 vehager + β3 maler
Ufreew = ASCf + β4 distf + β5 vehagef + β6 malef
Pj = exp(Uj) / Σk exp(Uk)
Interpretation:
ASC > 0 → preferred versus base; β < 0 (dist) → dislike long routes. Drop the base for ASC & ISV to avoid collinearity.
✅ Exam Checklist:
☑ Include ASCs (non‑base) ☑ Keep all ASVs ☑ Interact ISVs ☑ Drop 1 baseline ☑ Rows = ind × alts.
🔤 3️⃣ Tokenization — Regex vs spaCy
| Pattern / Tool | Keeps | Loses | Use for |
|---|---|---|---|
\w+ | Letters, digits, underscore | Splits “don’t” → “don”, “t” | Keep numbers (e.g., prices, IDs) |
[A-Za-z']+ | Letters, apostrophe | Drops numbers | Keep contractions (don’t, I’m) |
| spaCy | Tokens + POS + lemma | — | Linguistic pipeline |
split() | Whitespace only | Breaks on spaces | Quick, simple use |
Trade‑offs:
Regex = rule‑based and fast ⚡ (no POS or lemma).
spaCy = linguistically aware 🧠 (gives tokenization, POS, and lemma in one call).
🔠 4️⃣ Lemma vs Stem
| Feature | Stemming | Lemmatization |
|---|---|---|
| Rule | Cuts suffix (heuristic) | Uses dictionary + POS |
| POS‑aware | ❌ | ✅ |
| Example | “running” → run* | “running” → run |
| “better” | better | good |
| Speed | Fast | Slower |
| Accuracy | Low | High |
| Use for | Quick counting | Model inputs (TF‑IDF, sentiment) |
Purpose: unify word variants → smaller vocabulary → better generalization.
Not compulsory but recommended for TF‑IDF and ML models.
🧩 5️⃣ TF‑IDF — Formulae & Behavior
TF: term frequency in a document
Raw = count. Sublinear = 1 + log(TF).
IDF: inverse document frequency
= log((1 + N) / (1 + df)) + 1
TF‑IDF = TF × IDF
| Concept | Meaning |
|---|---|
| High TF, High IDF | Distinctive keyword |
| High TF, Low IDF | Common word (reduced weight) |
| Sublinear TF | Reduces gap between 100 vs 10 (repetition ≠ 10× meaning) |
| Normalization | Makes long and short documents comparable (L2 norm) |
| POS filter before TF‑IDF | Keep NOUN/VERB/ADJ → less noise |
📊 6️⃣ Long Format & Interactions (Choice Model)
Structure: each person has multiple rows (one per alternative).
choice = 1if that alternative is chosen.idslink rows for the same person.alt= arterial / rural / freeway.
Interactions: vehage_freew = vehage × (alt == freeway)
Creates within‑person variation → required for ISVs.
✅ Drop base alternative for ASC & ISV (arterial).
✅ Keep all ASVs.
📈 7️⃣ ROC / Precision–Recall Quick Block
- ROC: TPR vs FPR. AUC = model ranking ability.
- PR curve: Precision vs Recall — best for imbalanced data.
- High recall → catch more positives. High precision → few false alarms.
- Lower threshold → higher recall, lower precision.
- In churn (cost of missed churners high) → prefer recall.
🔍 8️⃣ Regex Pattern Reminders
| Pattern | Matches | Use case |
|---|---|---|
\w+ | letters + digits + underscore | Keep numbers |
[A-Za-z']+ | letters + apostrophe | Keep contractions |
\d+ | digits | Extract numbers |
[A-Za-z0-9_]+ | letters + digits + underscore | Identifiers |
[^A-Za-z0-9]+ | non‑alphanumeric | Split on punctuation |
Trade‑off:
Preserve apostrophes for English contractions; decide whether numbers matter (years, prices).
