Practical Machine Learning and Data Analysis with Python

California Housing Dataset: Exploratory Data Analysis

This section demonstrates essential steps for Exploratory Data Analysis (EDA) on the California Housing dataset, focusing on numerical features. We will visualize data distributions using histograms and box plots, and identify outliers using the Interquartile Range (IQR) method.

Numerical Feature Analysis

First, we load the dataset and identify its numerical features. The following Python code snippet performs the necessary imports and data loading:


import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.datasets import fetch_california_housing

# Load the California Housing dataset
california = fetch_california_housing(as_frame=True)
df = california.frame

# Identify numerical features
numerical_features = df.select_dtypes(include=['number']).columns

Histograms of Features

Histograms provide a visual representation of the distribution of numerical data. They help in understanding the central tendency, spread, and shape of the data for each feature.


plt.figure(figsize=(15, 10))
for i, feature in enumerate(numerical_features):
    plt.subplot(3, 3, i + 1)
    sns.histplot(df
  • , kde=True) plt.title(f'Histogram of {feature}') plt.tight_layout() plt.show()

    Box Plots and Outlier Analysis

    Box plots are excellent for visualizing the distribution of numerical data and identifying potential outliers. They display the median, quartiles, and extreme values of a dataset.

    
    plt.figure(figsize=(15, 10))
    for i, feature in enumerate(numerical_features):
        plt.subplot(3, 3, i + 1)
        sns.boxplot(y=df
  • ) plt.title(f'Boxplot of {feature}') plt.tight_layout() plt.show()

    IQR Outlier Identification

    The Interquartile Range (IQR) method is a common technique for detecting outliers. An outlier is typically defined as a data point that falls below Q1 – 1.5 * IQR or above Q3 + 1.5 * IQR.

    
    def detect_outliers_iqr(data):
        q1 = data.quantile(0.25)
        q3 = data.quantile(0.75)
        iqr = q3 - q1
        lower_bound = q1 - 1.5 * iqr
        upper_bound = q3 + 1.5 * iqr
        outliers = data[(data < lower_bound) | (data > upper_bound)]
        return outliers
    
    print("Outlier Analysis Results:")
    for feature in numerical_features:
        outliers = detect_outliers_iqr(df
  • ) print(f"\nFeature: {feature}") print(f"Number of outliers: {len(outliers)}") if len(outliers) > 0 and len(outliers) < 20: print(f"Outlier values: {outliers.values}") elif len(outliers) >= 20: print("Too many outliers to display.") else: print("No outliers detected.")

    California Housing Dataset: Correlation and Pair Plots

    Understanding the relationships between features is crucial in data analysis. This section focuses on visualizing these relationships within the California Housing dataset using a correlation matrix heatmap and a pair plot.

    Visualizing Feature Relationships

    The following Python code defines a function to load the dataset, compute the correlation matrix, and generate both a heatmap and a pair plot.

    
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    from sklearn.datasets import fetch_california_housing
    
    def analyze_california_housing():
        """
        Computes and visualizes the correlation matrix and pair plot for the California Housing dataset.
        """
        try:
            # Load the California Housing dataset
            california = fetch_california_housing(as_frame=True)
            df = california.frame
    
            # Compute the correlation matrix
            correlation_matrix = df.corr()
    
            # Visualize the correlation matrix using a heatmap
            plt.figure(figsize=(10, 8))
            sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
            plt.title('Correlation Matrix of California Housing Features')
            plt.show()
    
            # Create a pair plot to visualize pairwise relationships between features
            sns.pairplot(df)
            plt.show()
    
        except Exception as e:
            print(f"An error occurred: {e}")
    
    if __name__ == "__main__":
        analyze_california_housing()
    

    Correlation Matrix Heatmap

    A correlation matrix displays the correlation coefficients between all pairs of features. A heatmap provides a color-coded visualization of this matrix, making it easy to spot strong positive or negative correlations.

    Pair Plot Visualization

    A pair plot (or scatterplot matrix) shows scatter plots for each pair of features and histograms for individual features. This comprehensive visualization helps in understanding distributions and pairwise relationships simultaneously.

    Principal Component Analysis (PCA) for Dimensionality Reduction

    Principal Component Analysis (PCA) is a powerful dimensionality reduction technique widely used in machine learning and data analysis. It transforms high-dimensional data into a lower-dimensional representation while preserving as much variance as possible.

    Implementing PCA from Scratch

    This section provides a custom implementation of the PCA algorithm in Python, followed by its application to the well-known Iris dataset.

    PCA Algorithm Steps

    The custom pca function performs the following steps:

    1. Center the data: Subtract the mean of each feature from the data.
    2. Compute the covariance matrix: This matrix describes the variance and covariance between all pairs of features.
    3. Compute eigenvalues and eigenvectors: Eigenvectors represent the principal components, and eigenvalues indicate the amount of variance explained by each component.
    4. Sort eigenvalues and eigenvectors: Sort them in descending order to identify the components that capture the most variance.
    5. Select top n components: Choose the top n_components eigenvectors to form the projection matrix.
    6. Transform the data: Project the centered data onto the selected principal components.
    
    import numpy as np
    from sklearn.datasets import load_iris
    import matplotlib.pyplot as plt
    
    def pca(X, n_components):
        """
        Performs Principal Component Analysis (PCA).
        Args:
            X (numpy.ndarray): Input data matrix (n_samples, n_features).
            n_components (int): Number of principal components to retain.
        Returns:
            numpy.ndarray: Transformed data matrix (n_samples, n_components).
        """
        # 1. Center the data
        X_meaned = X - np.mean(X, axis=0)
        # 2. Compute the covariance matrix
        cov_mat = np.cov(X_meaned, rowvar=False)
        # 3. Compute eigenvalues and eigenvectors
        eigen_values, eigen_vectors = np.linalg.eigh(cov_mat)
        # 4. Sort eigenvalues and eigenvectors in descending order
        sorted_index = np.argsort(eigen_values)[::-1]
        sorted_eigenvectors = eigen_vectors[:, sorted_index]
        # 5. Select the top n_components eigenvectors
        principal_components = sorted_eigenvectors[:, :n_components]
        # 6. Transform the data
        X_transformed = np.dot(X_meaned, principal_components)
        return X_transformed
    

    Applying PCA to the Iris Dataset

    The Iris dataset is a classic dataset in machine learning, often used for classification and clustering tasks. Here, we apply our custom PCA function to reduce its dimensionality to two components, making it suitable for 2D visualization.

    
    # Load the Iris dataset
    iris = load_iris()
    X = iris.data
    y = iris.target
    
    # Apply PCA to reduce dimensionality to 2 components
    X_pca = pca(X, 2)
    

    Visualizing PCA Results

    The transformed data can then be visualized as a scatter plot, where each point represents a sample, colored by its original species label. This helps in understanding how well the principal components separate the different classes.

    
    plt.figure(figsize=(8, 6))
    plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis')
    plt.xlabel('Principal Component 1')
    plt.ylabel('Principal Component 2')
    plt.title('PCA of Iris Dataset (2 Components)')
    plt.colorbar(ticks=np.unique(y), label='Species')
    plt.show()
    

    Machine Learning: The Find-S Algorithm

    The Find-S algorithm is a fundamental concept learning algorithm in machine learning. Its goal is to find the most specific hypothesis that is consistent with all positive training examples. This implementation demonstrates how to derive such a hypothesis from a given dataset.

    Understanding the Find-S Algorithm

    This implementation of Find-S works by starting with the most general hypothesis (e.g., all ‘?’s) and then specializing it based on positive training examples. If an attribute value in a positive example matches the current hypothesis, it is retained. If it doesn’t match, the hypothesis is generalized to ‘?’ for that attribute, meaning any value is acceptable.

    Dataset Definition

    For this demonstration, we define a simple dataset directly within the code, representing attributes like Sky, Temperature, Humidity, and Wind, with ‘PlayTennis’ as the target variable.

    
    import pandas as pd
    
    def find_s_algorithm():
        # Define the dataset directly
        data = pd.DataFrame({
            'Sky': ['Sunny', 'Sunny', 'Cloudy', 'Rainy', 'Sunny'],
            'Temperature': ['Warm', 'Hot', 'Warm', 'Cold', 'Warm'],
            'Humidity': ['Normal', 'High', 'High', 'Normal', 'Normal'],
            'Wind': ['Strong', 'Weak', 'Strong', 'Strong', 'Weak'],
            'PlayTennis': ['Yes', 'No', 'Yes', 'No', 'Yes']  # Target column
        })
    
        print("Training data:")
        print(data)
    

    Algorithm Implementation

    The core of the Find-S algorithm involves iterating through positive examples and updating the hypothesis. The algorithm processes only positive examples to refine the hypothesis.

    
        attributes = data.columns[:-1]  # All columns except the last one
        class_label = data.columns[-1]  # The last column is the target variable
    
        hypothesis = ['?' for _ in attributes]  # Initialize with the most general hypothesis
    
        for index, row in data.iterrows():
            if row[class_label] == 'Yes':  # Process only positive examples
                for i, value in enumerate(row[attributes]):
                    if hypothesis[i] == '?' or hypothesis[i] == value:
                        hypothesis[i] = value  # Retain attribute value if it matches
                    else:
                        hypothesis[i] = '?'  # Generalize otherwise
    
        return hypothesis
    

    Finding the Most Specific Hypothesis

    After processing all positive examples, the algorithm returns the most specific hypothesis that is consistent with the training data.

    
    # Run the algorithm
    hypothesis = find_s_algorithm()
    print("\nThe final hypothesis is:", hypothesis)
    

    K-Nearest Neighbors (KNN) Classification

    K-Nearest Neighbors (KNN) is a simple, non-parametric supervised machine learning algorithm used for classification and regression tasks. It classifies a data point based on the majority class of its ‘k’ nearest neighbors in the feature space.

    Implementing KNN from Scratch

    This section provides a custom Python implementation of the KNN classification algorithm and demonstrates its application on a synthetic 1D dataset.

    KNN Classification Function

    The knn_classify function takes training data, training labels, test data, and the number of neighbors (k) as input. For each test point, it calculates distances to all training points, identifies the k closest neighbors, and predicts the label based on the most frequent class among these neighbors.

    
    import numpy as np
    import matplotlib.pyplot as plt
    
    def knn_classify(train_x, train_y, test_x, k):
        """
        Classifies test_x using k-Nearest Neighbors algorithm.
        Args:
            train_x: Training data features (1D array).
            train_y: Training data labels (1D array).
            test_x: Test data features (1D array).
            k: Number of neighbors to consider.
        Returns:
            Predicted labels for test_x (1D array).
        """
        predictions = []
        for test_point in test_x:
            distances = np.abs(train_x - test_point)  # Calculate absolute distances
            nearest_indices = np.argsort(distances)[:k]  # Find indices of k nearest neighbors
            nearest_labels = train_y[nearest_indices]
            # Determine the most frequent label among the k neighbors
            unique_labels, counts = np.unique(nearest_labels, return_counts=True)
            predicted_label = unique_labels[np.argmax(counts)]
            predictions.append(predicted_label)
        return np.array(predictions)
    

    Generating Synthetic Data and Classification

    We generate a 1D synthetic dataset where the first 50 points are training data with labels based on their value (Class 1 if <= 0.5, Class 2 otherwise), and the remaining 50 points are test data. The KNN algorithm is then applied to classify these test points for various values of k.

    
    # Generate 100 random values in the range [0, 1]
    np.random.seed(42)  # For reproducibility
    x = np.random.rand(100)
    
    # Label the first 50 points
    y = np.zeros(100)
    y[:50] = (x[:50] <= 0.5).astype(int) + 1 # Class 1 if <= 0.5, Class 2 otherwise
    y[50:] = -1 # Initialize test values to -1, they will be overridden by KNN prediction
    
    # Split data into training and test sets
    train_x, train_y = x[:50], y[:50]
    test_x = x[50:]
    
    # Classify the remaining points using KNN for different values of k
    k_values = [1, 2, 3, 4, 5, 20, 30]
    predictions = {}
    for k in k_values:
        predictions[k] = knn_classify(train_x, train_y, test_x, k)
        # Note: The original code overwrites y[50:] in each iteration,
        # which means only the last k's prediction will be plotted.
        # For plotting all k's, separate plots or a different approach would be needed.
        print(f"Predictions for k={k}: {predictions[k]}")
    

    Visualizing KNN Performance

    The results are visualized using a scatter plot, showing the training data points and the classified test data points. This helps in understanding how the decision boundary changes with different values of k.

    
    # Plot the results for each k (only the last k's plot will be shown due to loop structure)
    # To show all plots, move plt.figure and plt.show() inside the loop.
    plt.figure(figsize=(8, 6))
    plt.scatter(train_x, np.zeros_like(train_x), c=train_y, marker='o', label='Training Data')
    # The original code plots only the last k's prediction because plt.show() is outside the loop.
    # To see plots for all k values, plt.figure() and plt.show() should be inside the loop.
    # For this fixed HTML, we'll assume the intent was to show one example plot.
    plt.scatter(test_x, np.zeros_like(test_x), c=predictions[k_values[-1]], marker='x', label=f'Test Data (k={k_values[-1]})')
    plt.xlabel('x')
    plt.ylabel('Class')
    plt.title(f'KNN Classification (k={k_values[-1]})')
    plt.legend()
    plt.yticks([]) # Remove y-axis ticks
    plt.show()