Naive Bayes Algorithm on the Iris Dataset: A Python Implementation

Naive Bayes On The Iris Dataset

Python Implementation

This document provides a Python implementation of the Naive Bayes algorithm applied to the Iris dataset. It includes functions for data loading, preprocessing, cross-validation, and algorithm evaluation.

Data Loading and Preprocessing

The code begins by defining functions to load data from a CSV file and convert string columns to numerical values. These functions are essential for preparing the Iris dataset for use with the Naive Bayes algorithm.

Load a CSV File

The load_csv(filename) function reads a CSV file and returns a list of lists representing the dataset.

Convert String Column to Float

The str_column_to_float(dataset, column) function converts a specified column in the dataset from string values to float values.

Convert String Column to Integer

The str_column_to_int(dataset, column) function converts a specified column in the dataset from string values to integer values. This is particularly useful for the class column, which contains categorical data.

Cross-Validation and Evaluation

The code implements cross-validation to evaluate the performance of the Naive Bayes algorithm. It splits the dataset into folds, trains the algorithm on a subset of the data, and tests it on the remaining fold. This process is repeated for each fold, and the accuracy scores are averaged to provide an overall performance metric.

Split a Dataset into k Folds

The cross_validation_split(dataset, n_folds) function divides the dataset into k folds for cross-validation.

Calculate Accuracy Percentage

The accuracy_metric(actual, predicted) function calculates the accuracy of the predictions made by the algorithm.

Evaluate an Algorithm Using a Cross Validation Split

The evaluate_algorithm(dataset, algorithm, n_folds, *args) function evaluates the performance of a given algorithm using cross-validation.

Naive Bayes Algorithm Implementation

The core of the code is the implementation of the Naive Bayes algorithm. It involves calculating probabilities based on the distribution of features in the training data and using these probabilities to make predictions on new data.

Separate the Dataset by Class Values

The separate_by_class(dataset) function groups the dataset rows by their class values.

Calculate the Mean of a List of Numbers

The mean(numbers) function calculates the average of a list of numbers.

Calculate the Standard Deviation of a List of Numbers

The stdev(numbers) function calculates the standard deviation of a list of numbers.

Calculate the Mean, Stdev and Count for Each Column in a Dataset

The summarize_dataset(dataset) function calculates summary statistics for each column in the dataset.

Split Dataset by Class then Calculate Statistics for Each Row

The summarize_by_class(dataset) function calculates summary statistics for each class in the dataset.

Calculate the Gaussian Probability Distribution Function for x

The calculate_probability(x, mean, stdev) function calculates the probability of a given value x under a Gaussian distribution with a specified mean and standard deviation.

Calculate the Probabilities of Predicting Each Class for a Given Row

The calculate_class_probabilities(summaries, row) function calculates the probability of each class for a given data row.

Predict the Class for a Given Row

The predict(summaries, row) function predicts the class label for a given data row based on the calculated probabilities.

Naive Bayes Algorithm

The naive_bayes(train, test) function implements the Naive Bayes algorithm for classification.

Testing the Algorithm

The code concludes by testing the Naive Bayes algorithm on the Iris dataset. It loads the dataset, preprocesses it, and evaluates the algorithm using 5-fold cross-validation. The accuracy scores for each fold and the mean accuracy are printed to the console.

Test Naive Bayes on Iris Dataset

seed(1)

filename = ‘iris.csv’

dataset = load_csv(filename)

for i in range(len(dataset[0])-1):

                str_column_to_float(dataset, i)

# convert class column to integers

str_column_to_int(dataset, len(dataset[0])-1)

# evaluate algorithm

n_folds = 5

scores = evaluate_algorithm(dataset, naive_bayes, n_folds)

print(‘Scores: %s’ % scores)

print(‘Mean Accuracy: %.3f%%’ % (sum(scores)/float(len(scores))))