Churn Prediction with Logistic Regression, Choice Models & NLP

⚙️ 1️⃣ Logistic Regression — Churn Prediction

Predict P(churn = 1) via logit link → ln(p/(1-p)) = β0 + βx

MetricMeaningTrade-off / Exam Tip
Precision = TP/(TP+FP)How accurate are my churn flags?Higher precision → fewer false alarms
Recall = TP/(TP+FN)How many real churners are caught?Higher recall → catch more churners
AUCProbability the model ranks a churner higher than a non‑churner0.5 = random, 1 = perfect
ROC CurveTPR (Recall) vs FPR (1 − Specificity)Curves that bow toward the top‑left are better
Threshold0.3 → higher recall, lower precision; 0.7 → lower recall, higher precisionChoose threshold using business context
  • FN = missed churner (big loss) → prefer high recall.
  • β > 0 → increased log‑odds → increased churn probability.
  • exp(β) = odds ratio.

🚗 2️⃣ Choice Model (MNL) — Long Format Structure

Rows = individual × alternative.
Example: 151 people × 3 routes = 453 rows.

TypeExampleInclude RuleMeaning
ASCasc_rural, asc_freewayDrop base (arterial)Alternative specific constant: baseline bias vs base
ASVdist_arter, dist_rural, dist_freewKeep allRoute attributes
ISVmale_rural, male_freewDrop base (male_arter)Individual trait × route effect

Utility functions:
Uarterial = 0
Urural = ASCr + β1 distr + β2 vehager + β3 maler
Ufreew = ASCf + β4 distf + β5 vehagef + β6 malef

Pj = exp(Uj) / Σk exp(Uk)

Interpretation:
ASC > 0 → preferred versus base; β < 0 (dist) → dislike long routes. Drop the base for ASC & ISV to avoid collinearity.

✅ Exam Checklist:
☑ Include ASCs (non‑base) ☑ Keep all ASVs ☑ Interact ISVs ☑ Drop 1 baseline ☑ Rows = ind × alts.


🔤 3️⃣ Tokenization — Regex vs spaCy

Pattern / ToolKeepsLosesUse for
\w+Letters, digits, underscoreSplits “don’t” → “don”, “t”Keep numbers (e.g., prices, IDs)
[A-Za-z']+Letters, apostropheDrops numbersKeep contractions (don’t, I’m)
spaCyTokens + POS + lemmaLinguistic pipeline
split()Whitespace onlyBreaks on spacesQuick, simple use

Trade‑offs:
Regex = rule‑based and fast ⚡ (no POS or lemma).
spaCy = linguistically aware 🧠 (gives tokenization, POS, and lemma in one call).


🔠 4️⃣ Lemma vs Stem

FeatureStemmingLemmatization
RuleCuts suffix (heuristic)Uses dictionary + POS
POS‑aware
Example“running” → run*“running” → run
“better”bettergood
SpeedFastSlower
AccuracyLowHigh
Use forQuick countingModel inputs (TF‑IDF, sentiment)

Purpose: unify word variants → smaller vocabulary → better generalization.
Not compulsory but recommended for TF‑IDF and ML models.


🧩 5️⃣ TF‑IDF — Formulae & Behavior

TF: term frequency in a document
Raw = count. Sublinear = 1 + log(TF).

IDF: inverse document frequency
= log((1 + N) / (1 + df)) + 1

TF‑IDF = TF × IDF

ConceptMeaning
High TF, High IDFDistinctive keyword
High TF, Low IDFCommon word (reduced weight)
Sublinear TFReduces gap between 100 vs 10 (repetition ≠ 10× meaning)
NormalizationMakes long and short documents comparable (L2 norm)
POS filter before TF‑IDFKeep NOUN/VERB/ADJ → less noise

📊 6️⃣ Long Format & Interactions (Choice Model)

Structure: each person has multiple rows (one per alternative).

  • choice = 1 if that alternative is chosen.
  • ids link rows for the same person.
  • alt = arterial / rural / freeway.

Interactions: vehage_freew = vehage × (alt == freeway)
Creates within‑person variation → required for ISVs.

✅ Drop base alternative for ASC & ISV (arterial).
✅ Keep all ASVs.


📈 7️⃣ ROC / Precision–Recall Quick Block

  • ROC: TPR vs FPR. AUC = model ranking ability.
  • PR curve: Precision vs Recall — best for imbalanced data.
  • High recall → catch more positives. High precision → few false alarms.
  • Lower threshold → higher recall, lower precision.
  • In churn (cost of missed churners high) → prefer recall.

🔍 8️⃣ Regex Pattern Reminders

PatternMatchesUse case
\w+letters + digits + underscoreKeep numbers
[A-Za-z']+letters + apostropheKeep contractions
\d+digitsExtract numbers
[A-Za-z0-9_]+letters + digits + underscoreIdentifiers
[^A-Za-z0-9]+non‑alphanumericSplit on punctuation

Trade‑off:
Preserve apostrophes for English contractions; decide whether numbers matter (years, prices).