Data Structuring, Acquisition, Cleaning, and ETL Cheatsheet

Module 1 Cheatsheet — Foundations, Acquisition, DOM/XPath, Structuring


1) What “Big Data” means:


No single size cutoff. Think:
Volume (lots of rows/large objects), Variety (heterogeneous), Velocity (fast-changing), Veracity (quality).Implication: choose tools/storage/pipelines that match data scale and shape.

2) Three canonical encodings:


Tables / Relations / DataFrames

Heterogeneous columns; ideal for SQL/Pandas joins, filters, aggregations.

Arrays / Matrices / Tensors

Uniform numeric cells; used for vectorized math and ML.

Nested (JSON/XML)

Flexible schemas (dicts/lists); natural for web/APIs; can be flattened to tables. When to pick: Tables for stable schemas & heavy joins;
Nested for evolving, sparse fields;
Arrays for numerical ML.

3) End-to-end structuring pipeline:


Acquisition → Wrangling → Integration → Cleaning → (Analysis later)

Be able to justify ordering and cost/error impact.Example “three building blocks” you might assemble:company_data_df (from CSV categories)company_ceos_df (from an HTML table)
execs_df (from profile pages via XPath birthdays)

4) Acquisition patterns & tools

Structured files


urllib.Request + pandas.Read_csv(...). Apply projection (keep needed columns) as early as possible.

HTML tables

pandas.Read_html(url)[0] to grab the first table quickly.

DBMS snapshots

Store intermediate/cleaned tables in SQLite/Postgres; query back later for reproducibility.

Why persist snapshots?

Re-runs are cheaper (no re-scrape/re-parse), and you keep an auditable, reproducible “cleaned input” state.

5) Web data theory:

DOM & XPath HTML/XML parse into a DOM tree (hierarchical nodes).
XPath is a concise path/query language over the DOM; returns ordered node sets.

XPath essentials
// = search anywhere; / = from root. [] = filters (predicates) @attr = attribute access contains (@class, "vcard") = substring match on attributes Robust patterns for birthdays (examples)

//span[@class="bday"]/text() //table[contains(@class,"vcard")] //span[@class="bday"]/text()

Why XPath (not regex)?

It respects tree structure and attributes, surviving many layout changes; regex over raw HTML is brittle. Avoid brittle absolute XPaths like /html/body/div[3]/…; prefer semantic predicates (@class, contains).

6) Completing acquisition:

building execs_df Concept: Map executive → profile URL → fetch HTML → parse DOM → run XPath → extract DOB → collect rows into execs_df(Executive, birth_date, profile_url). Now you have three coherent inputs to integrate.

7) Wrangling choices you must justify

Project & filter early


Reduce I/O and memory; shrink inputs before joins.

Normalize keys before joining

Trim whitespace, casefold, and apply Unicode normalization (e.G., NFKC) to reduce mismatches.

Persist snapshots

Makes steps idempotent (re-running on the same inputs yields the same outputs) and faster on repeat.

8) Structuring trade-offs (tables vs nested vs arrays)

Tables


Stable schema, heavy relational ops, leverage indexes/optimizers.

Nested

Evolving, sparse optional fields; store raw JSON, then flatten views for analytics.

Arrays

Numeric features for ML; not ideal for heterogeneous attributes.

“Given evolving nested event logs, what store + why?”

Answer: Ingest as JSON (document store), keep raw, expose relational views (e.G., daily active users, funnels) for stable analytics.

9) Acquisition pitfalls & defenses

Fragile selectors → use robust, scoped XPath with semantic predicates.
Network/markup variance → timeouts, retries/backoff, try/except, accept None gracefully.
Mismatched keys → normalize early; after integration, generate an unmatched report (e.G., left join + indicator) for manual fixes.

10) What to memorize (rapid-fire prompts)

Two robust XPath patterns for DOB and why they’re robust. The three building blocks and how each was acquired (CSV, HTML table, XPath on profiles). The three encodings and one-liner use case for each. One-sentence reason to persist snapshots (cost + reproducibility) and a tool (SQLite/Postgres). The pipeline order and why project/filter early.

Tiny exemplar phrasings

CSV acquisition:


“Fetch with urllib, parse with pandas.Read_csv.” HTML table:
pandas.Read_html(url)[0] grabs the first table.” XPath essence:
//table[contains (@class,"vcard")] //span [@class= "bday"]/text() gets birthday text inside a vcard.” DOM vs regex:
“DOM+XPath is tree-aware and robust; regex over HTML is brittle.”Module 2 Data Cleaning, Validation, Record Linking, ETL, Aggregation

1) Data quality & cleaning (concepts you must name)


Dimensions:

accuracy, completeness, consistency, timeliness, validity, uniqueness.

Common issues:

missing values, inconsistent encodings (case/whitespace/Unicode), typos, duplicates, outliers, malformed URLs/emails, mixed units/timezones.

Normalization checklist:

trim, casefold, Unicode normalize (NFKC), canonicalize whitespace, standardize date/time/units, map categorical synonyms, dedupe keys.

Why cleaning ≠ validation:

Validation detects problems (rules, reference lists, cross-table checks). Cleaning decides fixes (transform/drop/impute) and must be documented (can bias data).

2) Validation

Pattern rules:


regex for emails/URLs/phone; domain-specific formats (e.G., ZIP, SSN where allowed).

Reference rules:

lookup against ISO lists, country/state codes, product catalogs.

Cross-field rules:

birth_date ≤ hire_date; start_time < end_time; state matches country.

Cross-table rules:

foreign keys present in dimension tables.

Statistical sanity:

range checks, z-score/IQR outlier flags (flag, don’t auto-drop).

Row-level status:

valid, invalid, suspect; keep counts for audit.

Diagnostics you should produce:

Invalid-value report (by column + reason), null-rate summary, outlier counts, referential-integrity violations, duplicate keys.

3) Record linking / entity resolution (theory you’ll be asked)


Goal:

decide whether two records refer to the same real-world entity when exact keys don’t match.

Similarity functions (know at least 3): Jaccard (sets):

J(A,B)=∣A∩B∣/∣A∪B∣J(A,B) = |A∩B| / |A∪B|J(A,B)=∣A∩B∣/∣A∪B∣ (use on token sets, q-grams).

Edit distance (Levenshtein):

min edits to transform string A → B; often use a normalized version.

Jaro/Jaro–Winkler:

good for short names.
Cosine on TF-IDF/q-gram vectors for longer fields.

Features to compare:

name, address parts, email/phone, date of birth, domain (for emails), city/state/country.

Blocking (for scalability):

restrict candidate pairs before scoring. Keys: first letter of last name; Soundex/Metaphone; email domain; ZIP prefix; q-gram prefix; canopy clustering (cheap similarity, e.G., TF-IDF with loose threshold).

Decision:

threshold on a weighted similarity or a classifier; include review zone (manual) near boundary.

Metrics:

precision/recall/F1 on labeled pairs; ROC/PR curves for threshold choice.

Dedup vs linking:

within one table vs across multiple sources.

Pitfalls:

transposed names, nicknames, accent marks, multi-byte whitespace, apartment/unit fields, PO boxes vs street addresses, international address formats.

4) ETL (Extract → Transform → Load) & workflows

Extract:


files/APIs/DBs; parameterize dates/paths; idempotent downloads; checksum/log sizes.

Transform:

cleaning, normalization, type casting, unit conversion, validation tagging, enrichment, record linking.

Load:

write to DB/warehouse/object storage;
partition by date/key; ensure atomic writes (temp → rename).

Reproducibility:

deterministic steps, seeds set, versioned code/config, environment pinned.

Idempotence:

re-running the job on the same inputs yields the same outputs (important for retries).

Orchestration:

schedule, backfill windows, retries with exponential backoff, dead-letter queues.

Lineage & logging:

source → staging → curated; record run IDs, input counts, invalid counts, output counts.

Schema evolution:

additive changes preferred; use views to provide stable analytical schema.

5) Columnwise ops & row filtering

Project early:


keep only needed columns to reduce I/O and memory before joins.

Vectorized transforms:

prefer column ops or apply (Pandas) / SQL expressions over Python loops.

Row filtering semantics:

Pandas: boolean masks; beware NaN in comparisons (propagates as False). SQL: WHERE filters before grouping; HAVING filters after
GROUP BY.

Examples (conceptual):

Normalize names: trim, casefold, Unicode normalize. Validate URLs/emails: apply a validator; produce boolean/enum status. Derived columns: parse dates → date, year, month, week.

6) Joins (theory & gotchas)


Types:

inner, left outer, right outer, full outer, semi, anti.
When to use: Left outer to keep all from left and bring attributes when present.
Inner to enforce matching keys only.
Anti to find non-matches (for diagnostics/unmatched report).

Key prep:

standardize/normalize join keys before the join.

Diagnostics:

include a merge indicator (both, left_only, right_only) or use anti-joins to list orphan rows.

Join planning:

push selection/projection before joins; join smaller early to reduce intermediate size; index selective predicates (SQL). 7) Aggregation for analyticsGroup & aggregate:
count/sum/avg/min/max; distinct counts; multi-agg with named outputs.

Null handling:

SQL aggregates typically skip NULLs; Pandas often skips NaN by default (confirm behavior).

Bin/derive features:

decade of birth, price buckets, day/week/month, rolling windows (if allowed).

Post-aggregation filters:

use HAVING in SQL to filter on aggregated values.

Approximate techniques (optional mentions):

HyperLogLog for approx distinct, sampling for sanity checks.8) Practical patterns you can quote A) Validation (pattern)
Rule set per column with status and reason fields.Produce a validation summary table with counts by (column, status, reason). Keep invalid rows but mark them; decide later whether to fix or exclude.

B) Unmatched report after joins

Do a left join from fact → dimension; rows with nulls on the right are unmatched;
Export that list for remediation.

C) Record linking (lightweight baseline)

Normalize names (NFKC, trim, casefold). Block by key (e.G., email domain or ZIP prefix). Within blocks, compute Jaccard on q-grams (q=3–5)
or Jaro–Winkler; accept ≥ τ, review τ±ε band.

ETL idempotence

Write outputs to a temp path/table, validate counts/checksums, then atomic rename/swap to final location.

9) Common exam prompts & crisp answers

Q: Validation vs cleaning?


A: Validation detects and labels issues; cleaning changes data (transform/drop/impute). Always log both.

Q: Why project early?

A: Cuts I/O and memory footprint; shrinks join/agg inputs; fewer bugs on unused columns.

Q: When left vs inner join?

A: Left keeps all left rows (good for audits/unmatched); inner keeps only exact matches.

Q: Define Jaccard and where to use it

A: ∣A∩B∣/∣A∪B∣|A∩B| / |A∪B|∣A∩B∣/∣A∪B∣; use on token sets or q-grams for name/address similarity.

Q: What is blocking in record linkage?

A: A prefilter that limits candidate pairs (e.G., same ZIP prefix), drastically reducing comparisons.

Q: WHERE vs HAVING?

A: WHERE filters before grouping; HAVING filters after grouping on aggregate results.

Q: What makes an ETL step idempotent?

A: Re-running on identical inputs yields identical outputs (no duplicates/side-effects).

Q: Top 3 join pitfalls?

A: Dirty keys (trim/Unicode), many-to-many explosions, silent null mismatches (fix with indicators/anti-joins).