Machine Learning & AI Foundations: Definitions, Lifecycle, and Tools

Posted on Sep 26, 2025 in Mathematics and Computer Science

CRISP-ML(Q) Project Lifecycle

Definition: A 6-phase framework for managing machine learning projects, with a focus on quality at each step.
Phases and Examples:
1. Business & Data Understanding
  - Definition: Define the business problem and assess available data.
  - Example: Goal: Reduce customer churn by 15%. Data: Purchase history, support tickets.
2. Data Preparation
  - Definition: Clean, organize, and transform raw data for modeling.
  - Example: Create “age” from “date of birth”; unify country codes like “USA” and “U.S.A.”
3. Modeling
  - Definition: Select an algorithm and train the machine learning model.
  - Example: Train a random forest to predict who might leave the company.
4. Evaluation
  - Definition: Test the accuracy and fairness of the model.
  - Example: Target 90% accuracy; check for potential bias.
5. Deployment
  - Definition: Put the model into use for real-world tasks.
  - Example: Add the churn prediction model to a customer dashboard.
6. Monitoring & Maintenance
  - Definition: Watch for model drift and retrain with new data as needed.
  - Example: Retrain the model after a new product launch changes customer behavior.

Data Engineering Essentials

Data Pipeline
- Definition: An automated “assembly line” that moves, cleans, and prepares data.
- Example: A nightly pipeline combines sales data from all stores.
Data Storage Types
- Data Warehouse
  - Definition: Stores clean, structured data optimized for business analytics.
  - Example: Quarterly sales reports.
- Data Lake
  - Definition: Large storage for all raw data, including images and text.
  - Example: Customer reviews and social media posts.
- Data Lakehouse
  - Definition: Combines the flexibility of a data lake with the structure of a data warehouse.
  - Example: Both unstructured and structured data analyzed in one place.

Key ML Concepts & Roles

How AI Recognizes Objects
- Definition: Learns from thousands of labeled examples to spot patterns.
- Example: Trained on cat images, it identifies features like fur and whiskers to recognize new cats.
Random Forest
- Definition: An ensemble method where many decision trees vote to make strong overall predictions.
- Example: Predicting “spam or not spam” by combining votes from many small decision trees.
Typical Roles
- Data Engineer: Builds data pipelines and manages data storage.
- Data Scientist: Explores data and builds initial models.
- ML Engineer:: Transforms models into robust, production-ready applications.
- Project Manager: Ensures the project stays on schedule and meets objectives.

Responsible AI Principles

Fairness & Bias
- Definition: Ensuring AI systems are not unfair or discriminatory.
- Example: A loan approval AI that treats all applicants equitably.
Transparency & Explainability
- Definition: Understanding why the AI made a given decision.
- Example: Explaining that a loan was denied due to a low credit score.
Robustness & Reliability
- Definition: AI performs correctly even with novel or unexpected data.
- Example: If accuracy drops after a company enters a new market, retrain the model to fix it.
Privacy & Security
- Definition: Protecting data and making models safe from attacks.
- Example: An attacker attempts to trick a model with manipulated input; adversarial training enhances its robustness.
Key Mitigation Strategies
- Privacy: Data anonymization and differential privacy.
- Security: Train models to resist adversarial data.
- Robustness: Retrain models when data patterns change.

Example Tools and Practices

Data Pipeline Tools
- Apache Airflow: Schedules and manages complex workflows.
- AWS Glue: A serverless data integration service.
- dbt: Transforms data inside your warehouse using SQL.
Design & Governance
- Architecture Design Session (ADS): Maps the data flow from sources to dashboards.
- Compliance: Ensures systems comply with data privacy laws (e.g., GDPR).

Quick-Reference Table: Key Terms

Concept	Definition	Real-World Example
Data Pipeline	Automated steps moving and cleaning data	Collects sales data each night
Data Warehouse	Store for clean, structured data	Accurate monthly sales dashboards
Data Lake	Store for all raw, flexible data	Store tweets, images, logs
Data Lakehouse	Flexible storage with structure and reliability	Run business reports & ML on same data
Random Forest	Many trees voting for accurate prediction	Predict “spam or not” in email
Feature Engineering	Creating new useful variables	“Age” from “date of birth”
Model Drift	Model accuracy drops as data changes	Sales model fails after new promo
Bias	Model consistently unfair to a group	AI rejects qualified female candidates
Adversarial Attack	Input designed to fool models	Tricking self-driving cars with stickers

This document provides foundational knowledge, practical examples, and ethical considerations essential for success in AI and machine learning projects.

Machine Learning & AI Foundations: Definitions, Lifecycle, and Tools

CRISP-ML(Q) Project Lifecycle

Business & Data Understanding

Data Preparation

Modeling

Evaluation

Deployment

Monitoring & Maintenance

Data Engineering Essentials

Data Pipeline

Data Storage Types

Data Warehouse

Data Lake

Data Lakehouse

Key ML Concepts & Roles

How AI Recognizes Objects

Random Forest

Typical Roles

Responsible AI Principles

Fairness & Bias

Transparency & Explainability

Robustness & Reliability

Privacy & Security

Key Mitigation Strategies

Example Tools and Practices

Data Pipeline Tools

Design & Governance

Quick-Reference Table: Key Terms

Recent Notes

Subjects

Publicidad