Naive Bayes Algorithm on the Iris Dataset: A Python Implementation
Naive Bayes On The Iris Dataset
Python Implementation
This document provides a Python implementation of the Naive Bayes algorithm applied to the Iris dataset. It includes functions for data loading, preprocessing, cross-validation, and algorithm evaluation.
Data Loading and Preprocessing
The code begins by defining functions to load data from a CSV file and convert string columns to numerical values. These functions are essential for preparing the Iris dataset for use with the Naive Bayes algorithm.
Load a CSV File
The load_csv(filename) function reads a CSV file and returns a list of lists representing the dataset.
Convert String Column to Float
The str_column_to_float(dataset, column) function converts a specified column in the dataset from string values to float values.
Convert String Column to Integer
The str_column_to_int(dataset, column) function converts a specified column in the dataset from string values to integer values. This is particularly useful for the class column, which contains categorical data.
Cross-Validation and Evaluation
The code implements cross-validation to evaluate the performance of the Naive Bayes algorithm. It splits the dataset into folds, trains the algorithm on a subset of the data, and tests it on the remaining fold. This process is repeated for each fold, and the accuracy scores are averaged to provide an overall performance metric.
Split a Dataset into k Folds
The cross_validation_split(dataset, n_folds) function divides the dataset into k folds for cross-validation.
Calculate Accuracy Percentage
The accuracy_metric(actual, predicted) function calculates the accuracy of the predictions made by the algorithm.
Evaluate an Algorithm Using a Cross Validation Split
The evaluate_algorithm(dataset, algorithm, n_folds, *args) function evaluates the performance of a given algorithm using cross-validation.
Naive Bayes Algorithm Implementation
The core of the code is the implementation of the Naive Bayes algorithm. It involves calculating probabilities based on the distribution of features in the training data and using these probabilities to make predictions on new data.
Separate the Dataset by Class Values
The separate_by_class(dataset) function groups the dataset rows by their class values.
Calculate the Mean of a List of Numbers
The mean(numbers) function calculates the average of a list of numbers.
Calculate the Standard Deviation of a List of Numbers
The stdev(numbers) function calculates the standard deviation of a list of numbers.
Calculate the Mean, Stdev and Count for Each Column in a Dataset
The summarize_dataset(dataset) function calculates summary statistics for each column in the dataset.
Split Dataset by Class then Calculate Statistics for Each Row
The summarize_by_class(dataset) function calculates summary statistics for each class in the dataset.
Calculate the Gaussian Probability Distribution Function for x
The calculate_probability(x, mean, stdev) function calculates the probability of a given value x under a Gaussian distribution with a specified mean and standard deviation.
Calculate the Probabilities of Predicting Each Class for a Given Row
The calculate_class_probabilities(summaries, row) function calculates the probability of each class for a given data row.
Predict the Class for a Given Row
The predict(summaries, row) function predicts the class label for a given data row based on the calculated probabilities.
Naive Bayes Algorithm
The naive_bayes(train, test) function implements the Naive Bayes algorithm for classification.
Testing the Algorithm
The code concludes by testing the Naive Bayes algorithm on the Iris dataset. It loads the dataset, preprocesses it, and evaluates the algorithm using 5-fold cross-validation. The accuracy scores for each fold and the mean accuracy are printed to the console.
Test Naive Bayes on Iris Dataset
seed(1)
filename = ‘iris.csv’
dataset = load_csv(filename)
for i in range(len(dataset[0])-1):
str_column_to_float(dataset, i)
# convert class column to integers
str_column_to_int(dataset, len(dataset[0])-1)
# evaluate algorithm
n_folds = 5
scores = evaluate_algorithm(dataset, naive_bayes, n_folds)
print(‘Scores: %s’ % scores)
print(‘Mean Accuracy: %.3f%%’ % (sum(scores)/float(len(scores))))
