Data Science and Machine Learning Implementation in Python

Statistical Analytics with Pandas

Import library
import pandas as pd

Part 1: Descriptive Statistics

Load dataset
df = pd.read_csv("stat.csv")

print("First 5 rows:")
print(df.head())

Group by categorical column (income) and numeric column (age)
print("\nSummary Statistics (Grouped by income):")
print(df.groupby("income")["age"].describe())

Individual Statistics

print("\nMean:")
print(df.groupby("income")["age"].mean())

print("\nMedian:")
print(df.groupby("income")["age"].median())

print("\nMin:")
print(df.groupby("income")["age"].min())

print("\nMax:")
print(df.groupby("income")["age"].max())

print("\nStandard Deviation:")
print(df.groupby("income")["age"].std())

List of values for each category
value_list = df.groupby("income")["age"].apply(list)
print("\nList of values:")
print(value_list)

Part 2: Iris Dataset Analysis

Load Iris dataset
iris = pd.read_csv("Iris.csv")

Iris-setosa
print("\nIris-setosa")
print(iris[iris["Species"] == "Iris-setosa"].describe())

Iris-versicolor
print("\nIris-versicolor")
print(iris[iris["Species"] == "Iris-versicolor"].describe())

Iris-virginica
print("\nIris-virginica")
print(iris[iris["Species"] == "Iris-virginica"].describe())

Text Analytics and Natural Language Processing

1. Import Libraries

import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk import pos_tag
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer

2. Download NLTK Data

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

3. Load Dataset and Extract Text

df = pd.read_csv("ta1.csv")
print("First 5 Rows:")
print(df.head())

text = df['text'][0]
print("\nOriginal Text:")
print(text)

4. Text Preprocessing Steps

  • Tokenization: tokens = word_tokenize(text)
  • Stopwords Removal: stop_words = set(stopwords.words('english'))
    filtered_words = [word for word in tokens if word.lower() not in stop_words]
  • POS Tagging: pos_tags = pos_tag(filtered_words)
  • Stemming: stemmer = PorterStemmer()
    stemmed_words = [stemmer.stem(word) for word in filtered_words]
  • Lemmatization: lemmatizer = WordNetLemmatizer()
    lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]

5. TF-IDF Vectorization

documents = df['text'].astype(str)
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)
print("\nTF-IDF Matrix:")
print(tfidf_matrix.toarray())

Big Data Processing with Scala and Spark

Word Count Implementation

  1. Start Spark Shell: spark-shell (If successful: Spark context available as ‘sc’)
  2. Create Input File: nano input.txt
  3. Add Sample Data:
    ERROR Disk failure
    INFO System started
    WARN Memory low
    ERROR Disk failure
    INFO Login successful
    WARN CPU high
  4. Save and Exit: CTRL + O, ENTER, CTRL + X

Spark Scala Code

// Read input file
val inputFile = sc.textFile("input.txt")

// Word Count using MapReduce logic
val counts = inputFile
.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)

// Display and save output
counts.collect().foreach(println)
counts.saveAsTextFile("output")

Check Results

:quit
cd output
ls
cat part-00000

Basic RDD Transformations

val numbers = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10))
val evenNumbers = numbers.filter(x => x % 2 == 0)
val squares = evenNumbers.map(x => x * x)
squares.collect.foreach(println)

Data Visualization Techniques

1. Titanic Dataset Analysis

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = sns.load_dataset('titanic')

Visualizing Distributions

  • Bar Plot (Sex vs Age): sns.barplot(x='sex', y='age', data=df)
  • Count Plot (Survival): sns.countplot(x='survived', data=df)
  • Fare Histogram: sns.histplot(df['fare'], bins=30)

Advanced Statistical Plots

sns.boxplot(x='sex', y='age', data=df, hue='survived')
sns.violinplot(x='sex', y='age', hue='survived', data=df)
sns.stripplot(x='sex', y='age', data=df, hue='survived')

2. Iris Dataset Visualization

df = sns.load_dataset('iris')
df.hist(bins=20, figsize=(10,8))
sns.boxplot(data=df)
sns.pairplot(df)
sns.jointplot(df)

Machine Learning Models

Linear Regression for Housing Prices

Preprocessing:
df = pd.read_csv("boston_housing.csv")
df.fillna(df.median(numeric_only=True), inplace=True)

Model Training:
X = df.drop('MEDV', axis=1)
y = df['MEDV']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
model = LinearRegression()
model.fit(X_train, y_train)

Logistic Regression for Classification

Preprocessing and Scaling:
df = pd.read_csv("Social_Network_Ads.csv")
df.drop(['User ID', 'Gender'], axis=1, inplace=True)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)

Evaluation:
cm = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)

Naive Bayes Classification

Implementation:
model = GaussianNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

Performance Metrics:
precision = precision_score(y_test, y_pred, average='macro')
recall = recall_score(y_test, y_pred, average='macro')