Star vs. Snowflake Schema: SQL Examples & Python Clustering
Snowflake Schema Example
Below are SQL examples demonstrating the structure of a Snowflake schema:
CREATE TABLE sales_fact (
customer_id VARCHAR(10),
car_id VARCHAR(10),
date_id DATE
);CREATE TABLE customer_dim (
customer_id INT PRIMARY KEY,
name VARCHAR(20),
address VARCHAR(25),
phone_no INT,
city VARCHAR(10)
);CREATE TABLE car_dim (
car_id INT PRIMARY KEY,
name VARCHAR(10),
year INT,
model VARCHAR(10)
);CREATE TABLE date_dim (
date_id INT PRIMARY KEY,
day_of_week VARCHAR(10),
month VARCHAR(10),
year INT
);CREATE TABLE city_dim (
state VARCHAR(10),
zipcode INT
);CREATE TABLE model_dim (
model_id INT,
year INT,
items_sold INT
);Star Schema Example
Below are SQL examples demonstrating the structure of a Star schema:
CREATE TABLE sales_fact (
customer_id VARCHAR(10),
car_id INT,
date_id DATE,
amount MONEY
);CREATE TABLE car_dim (
car_id INT PRIMARY KEY,
name VARCHAR(10),
year INT,
model VARCHAR(10)
);CREATE TABLE customer_dim (
customer_id INT PRIMARY KEY,
name VARCHAR(20),
address VARCHAR(25),
phone_no INT
);CREATE TABLE date_dim (
date_id INT PRIMARY KEY,
date DATE,
day_of_week INT,
year INT,
month VARCHAR(10)
);K-Means Clustering with Python
This Python script demonstrates K-Means clustering on the Iris dataset using scikit-learn.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
# Apply k-means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)
# Get cluster labels and cluster centers
labels = kmeans.labels_
centers = kmeans.cluster_centers_
# Reduce the dimensionality for visualization (using PCA)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
# Plot the clusters
plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='viridis', edgecolor='k', s=80)
plt.scatter(centers[:, 0], centers[:, 1], c='red', marker='X', s=200, label='Cluster Centers')
plt.title('K-Means Clustering of Iris Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.show()Installation
Install the necessary Python libraries using pip:
pip install numpy matplotlib scikit-learnData Warehouse Table Examples
Dimension Table: Time
An example of a time dimension table:
| time_id | date | day_of_week | month | quarter | year |
|---|---|---|---|---|---|
| 1 | 2023-01-01 | Sunday | January | Q1 | 2023 |
| 2 | 2023-01-02 | Monday | January | Q1 | 2023 |
| … | … | … | … | … | … |
Column Descriptions:
time_id: Primary key for the time dimension.date: The specific date.day_of_week: Day of the week.month: Month of the year.quarter: Quarter of the year.year: Year.
Fact Table: HospitalFacts
An example of a fact table for hospital data:
| patient_id | doctor_id | time_id | admission_count | diagnosis_code | treatment_code | cost |
|---|---|---|---|---|---|---|
| 101 | 201 | 1 | 2 | A | X | 500 |
| 102 | 202 | 2 | 1 | B | Y | 300 |
| … | … | … | … | … | … | … |
Column Descriptions:
patient_id: Foreign key to the patient dimension.doctor_id: Foreign key to the doctor dimension.time_id: Foreign key to the time dimension.admission_count: Number of times a patient is admitted.diagnosis_code: Code representing the diagnosis.treatment_code: Code representing the treatment.cost: Cost associated with the admission, diagnosis, and treatment.
