Practical Machine Learning and Data Analysis with Python
California Housing Dataset: Exploratory Data Analysis
This section demonstrates essential steps for Exploratory Data Analysis (EDA) on the California Housing dataset, focusing on numerical features. We will visualize data distributions using histograms and box plots, and identify outliers using the Interquartile Range (IQR) method.
Numerical Feature Analysis
First, we load the dataset and identify its numerical features. The following Python code snippet performs the necessary imports and data loading:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.datasets import fetch_california_housing
# Load the California Housing dataset
california = fetch_california_housing(as_frame=True)
df = california.frame
# Identify numerical features
numerical_features = df.select_dtypes(include=['number']).columns
Histograms of Features
Histograms provide a visual representation of the distribution of numerical data. They help in understanding the central tendency, spread, and shape of the data for each feature.
plt.figure(figsize=(15, 10))
for i, feature in enumerate(numerical_features):
plt.subplot(3, 3, i + 1)
sns.histplot(df , kde=True)
plt.title(f'Histogram of {feature}')
plt.tight_layout()
plt.show()
Box Plots and Outlier Analysis
Box plots are excellent for visualizing the distribution of numerical data and identifying potential outliers. They display the median, quartiles, and extreme values of a dataset.
plt.figure(figsize=(15, 10))
for i, feature in enumerate(numerical_features):
plt.subplot(3, 3, i + 1)
sns.boxplot(y=df )
plt.title(f'Boxplot of {feature}')
plt.tight_layout()
plt.show()
IQR Outlier Identification
The Interquartile Range (IQR) method is a common technique for detecting outliers. An outlier is typically defined as a data point that falls below Q1 – 1.5 * IQR or above Q3 + 1.5 * IQR.
def detect_outliers_iqr(data):
q1 = data.quantile(0.25)
q3 = data.quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
outliers = data[(data < lower_bound) | (data > upper_bound)]
return outliers
print("Outlier Analysis Results:")
for feature in numerical_features:
outliers = detect_outliers_iqr(df )
print(f"\nFeature: {feature}")
print(f"Number of outliers: {len(outliers)}")
if len(outliers) > 0 and len(outliers) < 20:
print(f"Outlier values: {outliers.values}")
elif len(outliers) >= 20:
print("Too many outliers to display.")
else:
print("No outliers detected.")
California Housing Dataset: Correlation and Pair Plots
Understanding the relationships between features is crucial in data analysis. This section focuses on visualizing these relationships within the California Housing dataset using a correlation matrix heatmap and a pair plot.
Visualizing Feature Relationships
The following Python code defines a function to load the dataset, compute the correlation matrix, and generate both a heatmap and a pair plot.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing
def analyze_california_housing():
"""
Computes and visualizes the correlation matrix and pair plot for the California Housing dataset.
"""
try:
# Load the California Housing dataset
california = fetch_california_housing(as_frame=True)
df = california.frame
# Compute the correlation matrix
correlation_matrix = df.corr()
# Visualize the correlation matrix using a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix of California Housing Features')
plt.show()
# Create a pair plot to visualize pairwise relationships between features
sns.pairplot(df)
plt.show()
except Exception as e:
print(f"An error occurred: {e}")
if __name__ == "__main__":
analyze_california_housing()
Correlation Matrix Heatmap
A correlation matrix displays the correlation coefficients between all pairs of features. A heatmap provides a color-coded visualization of this matrix, making it easy to spot strong positive or negative correlations.
Pair Plot Visualization
A pair plot (or scatterplot matrix) shows scatter plots for each pair of features and histograms for individual features. This comprehensive visualization helps in understanding distributions and pairwise relationships simultaneously.
Principal Component Analysis (PCA) for Dimensionality Reduction
Principal Component Analysis (PCA) is a powerful dimensionality reduction technique widely used in machine learning and data analysis. It transforms high-dimensional data into a lower-dimensional representation while preserving as much variance as possible.
Implementing PCA from Scratch
This section provides a custom implementation of the PCA algorithm in Python, followed by its application to the well-known Iris dataset.
PCA Algorithm Steps
The custom pca
function performs the following steps:
- Center the data: Subtract the mean of each feature from the data.
- Compute the covariance matrix: This matrix describes the variance and covariance between all pairs of features.
- Compute eigenvalues and eigenvectors: Eigenvectors represent the principal components, and eigenvalues indicate the amount of variance explained by each component.
- Sort eigenvalues and eigenvectors: Sort them in descending order to identify the components that capture the most variance.
- Select top n components: Choose the top
n_components
eigenvectors to form the projection matrix. - Transform the data: Project the centered data onto the selected principal components.
import numpy as np
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
def pca(X, n_components):
"""
Performs Principal Component Analysis (PCA).
Args:
X (numpy.ndarray): Input data matrix (n_samples, n_features).
n_components (int): Number of principal components to retain.
Returns:
numpy.ndarray: Transformed data matrix (n_samples, n_components).
"""
# 1. Center the data
X_meaned = X - np.mean(X, axis=0)
# 2. Compute the covariance matrix
cov_mat = np.cov(X_meaned, rowvar=False)
# 3. Compute eigenvalues and eigenvectors
eigen_values, eigen_vectors = np.linalg.eigh(cov_mat)
# 4. Sort eigenvalues and eigenvectors in descending order
sorted_index = np.argsort(eigen_values)[::-1]
sorted_eigenvectors = eigen_vectors[:, sorted_index]
# 5. Select the top n_components eigenvectors
principal_components = sorted_eigenvectors[:, :n_components]
# 6. Transform the data
X_transformed = np.dot(X_meaned, principal_components)
return X_transformed
Applying PCA to the Iris Dataset
The Iris dataset is a classic dataset in machine learning, often used for classification and clustering tasks. Here, we apply our custom PCA function to reduce its dimensionality to two components, making it suitable for 2D visualization.
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
# Apply PCA to reduce dimensionality to 2 components
X_pca = pca(X, 2)
Visualizing PCA Results
The transformed data can then be visualized as a scatter plot, where each point represents a sample, colored by its original species label. This helps in understanding how well the principal components separate the different classes.
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA of Iris Dataset (2 Components)')
plt.colorbar(ticks=np.unique(y), label='Species')
plt.show()
Machine Learning: The Find-S Algorithm
The Find-S algorithm is a fundamental concept learning algorithm in machine learning. Its goal is to find the most specific hypothesis that is consistent with all positive training examples. This implementation demonstrates how to derive such a hypothesis from a given dataset.
Understanding the Find-S Algorithm
This implementation of Find-S works by starting with the most general hypothesis (e.g., all ‘?’s) and then specializing it based on positive training examples. If an attribute value in a positive example matches the current hypothesis, it is retained. If it doesn’t match, the hypothesis is generalized to ‘?’ for that attribute, meaning any value is acceptable.
Dataset Definition
For this demonstration, we define a simple dataset directly within the code, representing attributes like Sky, Temperature, Humidity, and Wind, with ‘PlayTennis’ as the target variable.
import pandas as pd
def find_s_algorithm():
# Define the dataset directly
data = pd.DataFrame({
'Sky': ['Sunny', 'Sunny', 'Cloudy', 'Rainy', 'Sunny'],
'Temperature': ['Warm', 'Hot', 'Warm', 'Cold', 'Warm'],
'Humidity': ['Normal', 'High', 'High', 'Normal', 'Normal'],
'Wind': ['Strong', 'Weak', 'Strong', 'Strong', 'Weak'],
'PlayTennis': ['Yes', 'No', 'Yes', 'No', 'Yes'] # Target column
})
print("Training data:")
print(data)
Algorithm Implementation
The core of the Find-S algorithm involves iterating through positive examples and updating the hypothesis. The algorithm processes only positive examples to refine the hypothesis.
attributes = data.columns[:-1] # All columns except the last one
class_label = data.columns[-1] # The last column is the target variable
hypothesis = ['?' for _ in attributes] # Initialize with the most general hypothesis
for index, row in data.iterrows():
if row[class_label] == 'Yes': # Process only positive examples
for i, value in enumerate(row[attributes]):
if hypothesis[i] == '?' or hypothesis[i] == value:
hypothesis[i] = value # Retain attribute value if it matches
else:
hypothesis[i] = '?' # Generalize otherwise
return hypothesis
Finding the Most Specific Hypothesis
After processing all positive examples, the algorithm returns the most specific hypothesis that is consistent with the training data.
# Run the algorithm
hypothesis = find_s_algorithm()
print("\nThe final hypothesis is:", hypothesis)
K-Nearest Neighbors (KNN) Classification
K-Nearest Neighbors (KNN) is a simple, non-parametric supervised machine learning algorithm used for classification and regression tasks. It classifies a data point based on the majority class of its ‘k’ nearest neighbors in the feature space.
Implementing KNN from Scratch
This section provides a custom Python implementation of the KNN classification algorithm and demonstrates its application on a synthetic 1D dataset.
KNN Classification Function
The knn_classify
function takes training data, training labels, test data, and the number of neighbors (k) as input. For each test point, it calculates distances to all training points, identifies the k closest neighbors, and predicts the label based on the most frequent class among these neighbors.
import numpy as np
import matplotlib.pyplot as plt
def knn_classify(train_x, train_y, test_x, k):
"""
Classifies test_x using k-Nearest Neighbors algorithm.
Args:
train_x: Training data features (1D array).
train_y: Training data labels (1D array).
test_x: Test data features (1D array).
k: Number of neighbors to consider.
Returns:
Predicted labels for test_x (1D array).
"""
predictions = []
for test_point in test_x:
distances = np.abs(train_x - test_point) # Calculate absolute distances
nearest_indices = np.argsort(distances)[:k] # Find indices of k nearest neighbors
nearest_labels = train_y[nearest_indices]
# Determine the most frequent label among the k neighbors
unique_labels, counts = np.unique(nearest_labels, return_counts=True)
predicted_label = unique_labels[np.argmax(counts)]
predictions.append(predicted_label)
return np.array(predictions)
Generating Synthetic Data and Classification
We generate a 1D synthetic dataset where the first 50 points are training data with labels based on their value (Class 1 if <= 0.5, Class 2 otherwise), and the remaining 50 points are test data. The KNN algorithm is then applied to classify these test points for various values of k.
# Generate 100 random values in the range [0, 1]
np.random.seed(42) # For reproducibility
x = np.random.rand(100)
# Label the first 50 points
y = np.zeros(100)
y[:50] = (x[:50] <= 0.5).astype(int) + 1 # Class 1 if <= 0.5, Class 2 otherwise
y[50:] = -1 # Initialize test values to -1, they will be overridden by KNN prediction
# Split data into training and test sets
train_x, train_y = x[:50], y[:50]
test_x = x[50:]
# Classify the remaining points using KNN for different values of k
k_values = [1, 2, 3, 4, 5, 20, 30]
predictions = {}
for k in k_values:
predictions[k] = knn_classify(train_x, train_y, test_x, k)
# Note: The original code overwrites y[50:] in each iteration,
# which means only the last k's prediction will be plotted.
# For plotting all k's, separate plots or a different approach would be needed.
print(f"Predictions for k={k}: {predictions[k]}")
Visualizing KNN Performance
The results are visualized using a scatter plot, showing the training data points and the classified test data points. This helps in understanding how the decision boundary changes with different values of k.
# Plot the results for each k (only the last k's plot will be shown due to loop structure)
# To show all plots, move plt.figure and plt.show() inside the loop.
plt.figure(figsize=(8, 6))
plt.scatter(train_x, np.zeros_like(train_x), c=train_y, marker='o', label='Training Data')
# The original code plots only the last k's prediction because plt.show() is outside the loop.
# To see plots for all k values, plt.figure() and plt.show() should be inside the loop.
# For this fixed HTML, we'll assume the intent was to show one example plot.
plt.scatter(test_x, np.zeros_like(test_x), c=predictions[k_values[-1]], marker='x', label=f'Test Data (k={k_values[-1]})')
plt.xlabel('x')
plt.ylabel('Class')
plt.title(f'KNN Classification (k={k_values[-1]})')
plt.legend()
plt.yticks([]) # Remove y-axis ticks
plt.show()