Machine Learning & AI Foundations: Definitions, Lifecycle, and Tools
CRISP-ML(Q) Project Lifecycle
- Definition: A 6-phase framework for managing machine learning projects, with a focus on quality at each step.
- Phases and Examples:
Business & Data Understanding
- Definition: Define the business problem and assess available data.
- Example: Goal: Reduce customer churn by 15%. Data: Purchase history, support tickets.
Data Preparation
- Definition: Clean, organize, and transform raw data for modeling.
- Example: Create “age” from “date of birth”; unify country codes like “USA” and “U.S.A.”
Modeling
- Definition: Select an algorithm and train the machine learning model.
- Example: Train a random forest to predict who might leave the company.
Evaluation
- Definition: Test the accuracy and fairness of the model.
- Example: Target 90% accuracy; check for potential bias.
Deployment
- Definition: Put the model into use for real-world tasks.
- Example: Add the churn prediction model to a customer dashboard.
Monitoring & Maintenance
- Definition: Watch for model drift and retrain with new data as needed.
- Example: Retrain the model after a new product launch changes customer behavior.
Data Engineering Essentials
Data Pipeline
- Definition: An automated “assembly line” that moves, cleans, and prepares data.
- Example: A nightly pipeline combines sales data from all stores.
Data Storage Types
Data Warehouse
- Definition: Stores clean, structured data optimized for business analytics.
- Example: Quarterly sales reports.
Data Lake
- Definition: Large storage for all raw data, including images and text.
- Example: Customer reviews and social media posts.
Data Lakehouse
- Definition: Combines the flexibility of a data lake with the structure of a data warehouse.
- Example: Both unstructured and structured data analyzed in one place.
Key ML Concepts & Roles
How AI Recognizes Objects
- Definition: Learns from thousands of labeled examples to spot patterns.
- Example: Trained on cat images, it identifies features like fur and whiskers to recognize new cats.
Random Forest
- Definition: An ensemble method where many decision trees vote to make strong overall predictions.
- Example: Predicting “spam or not spam” by combining votes from many small decision trees.
Typical Roles
- Data Engineer: Builds data pipelines and manages data storage.
- Data Scientist: Explores data and builds initial models.
- ML Engineer:: Transforms models into robust, production-ready applications.
- Project Manager: Ensures the project stays on schedule and meets objectives.
Responsible AI Principles
Fairness & Bias
- Definition: Ensuring AI systems are not unfair or discriminatory.
- Example: A loan approval AI that treats all applicants equitably.
Transparency & Explainability
- Definition: Understanding why the AI made a given decision.
- Example: Explaining that a loan was denied due to a low credit score.
Robustness & Reliability
- Definition: AI performs correctly even with novel or unexpected data.
- Example: If accuracy drops after a company enters a new market, retrain the model to fix it.
Privacy & Security
- Definition: Protecting data and making models safe from attacks.
- Example: An attacker attempts to trick a model with manipulated input; adversarial training enhances its robustness.
Key Mitigation Strategies
- Privacy: Data anonymization and differential privacy.
- Security: Train models to resist adversarial data.
- Robustness: Retrain models when data patterns change.
Example Tools and Practices
Data Pipeline Tools
- Apache Airflow: Schedules and manages complex workflows.
- AWS Glue: A serverless data integration service.
- dbt: Transforms data inside your warehouse using SQL.
Design & Governance
- Architecture Design Session (ADS): Maps the data flow from sources to dashboards.
- Compliance: Ensures systems comply with data privacy laws (e.g., GDPR).
Quick-Reference Table: Key Terms
Concept | Definition | Real-World Example |
---|---|---|
Data Pipeline | Automated steps moving and cleaning data | Collects sales data each night |
Data Warehouse | Store for clean, structured data | Accurate monthly sales dashboards |
Data Lake | Store for all raw, flexible data | Store tweets, images, logs |
Data Lakehouse | Flexible storage with structure and reliability | Run business reports & ML on same data |
Random Forest | Many trees voting for accurate prediction | Predict “spam or not” in email |
Feature Engineering | Creating new useful variables | “Age” from “date of birth” |
Model Drift | Model accuracy drops as data changes | Sales model fails after new promo |
Bias | Model consistently unfair to a group | AI rejects qualified female candidates |
Adversarial Attack | Input designed to fool models | Tricking self-driving cars with stickers |
This document provides foundational knowledge, practical examples, and ethical considerations essential for success in AI and machine learning projects.