Data Science and Machine Learning Implementation in Python
Statistical Analytics with Pandas
Import libraryimport pandas as pd
Part 1: Descriptive Statistics
Load datasetdf = pd.read_csv("stat.csv")
print("First 5 rows:")
print(df.head())
Group by categorical column (income) and numeric column (age)print("\nSummary Statistics (Grouped by income):")
print(df.groupby("income")["age"].describe())
Individual Statistics
print("\nMean:")
print(df.groupby("income")["age"].mean())
print("\nMedian:")
print(df.groupby("income")["age"].median())
print("\nMin:")
print(df.groupby("income")["age"].min())
print("\nMax:")
print(df.groupby("income")["age"].max())
print("\nStandard Deviation:")
print(df.groupby("income")["age"].std())
List of values for each categoryvalue_list = df.groupby("income")["age"].apply(list)
print("\nList of values:")
print(value_list)
Part 2: Iris Dataset Analysis
Load Iris datasetiris = pd.read_csv("Iris.csv")
Iris-setosaprint("\nIris-setosa")
print(iris[iris["Species"] == "Iris-setosa"].describe())
Iris-versicolorprint("\nIris-versicolor")
print(iris[iris["Species"] == "Iris-versicolor"].describe())
Iris-virginicaprint("\nIris-virginica")
print(iris[iris["Species"] == "Iris-virginica"].describe())
Text Analytics and Natural Language Processing
1. Import Libraries
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk import pos_tag
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
2. Download NLTK Data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
3. Load Dataset and Extract Text
df = pd.read_csv("ta1.csv")
print("First 5 Rows:")
print(df.head())
text = df['text'][0]
print("\nOriginal Text:")
print(text)
4. Text Preprocessing Steps
- Tokenization:
tokens = word_tokenize(text) - Stopwords Removal:
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in tokens if word.lower() not in stop_words] - POS Tagging:
pos_tags = pos_tag(filtered_words) - Stemming:
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in filtered_words] - Lemmatization:
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]
5. TF-IDF Vectorization
documents = df['text'].astype(str)
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)
print("\nTF-IDF Matrix:")
print(tfidf_matrix.toarray())
Big Data Processing with Scala and Spark
Word Count Implementation
- Start Spark Shell:
spark-shell(If successful: Spark context available as ‘sc’) - Create Input File:
nano input.txt - Add Sample Data:
ERROR Disk failure
INFO System started
WARN Memory low
ERROR Disk failure
INFO Login successful
WARN CPU high - Save and Exit: CTRL + O, ENTER, CTRL + X
Spark Scala Code
// Read input file
val inputFile = sc.textFile("input.txt")
// Word Count using MapReduce logic
val counts = inputFile
.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
// Display and save output
counts.collect().foreach(println)
counts.saveAsTextFile("output")
Check Results
:quit
cd output
ls
cat part-00000
Basic RDD Transformations
val numbers = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10))
val evenNumbers = numbers.filter(x => x % 2 == 0)
val squares = evenNumbers.map(x => x * x)
squares.collect.foreach(println)
Data Visualization Techniques
1. Titanic Dataset Analysis
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = sns.load_dataset('titanic')
Visualizing Distributions
- Bar Plot (Sex vs Age):
sns.barplot(x='sex', y='age', data=df) - Count Plot (Survival):
sns.countplot(x='survived', data=df) - Fare Histogram:
sns.histplot(df['fare'], bins=30)
Advanced Statistical Plots
sns.boxplot(x='sex', y='age', data=df, hue='survived')
sns.violinplot(x='sex', y='age', hue='survived', data=df)
sns.stripplot(x='sex', y='age', data=df, hue='survived')
2. Iris Dataset Visualization
df = sns.load_dataset('iris')
df.hist(bins=20, figsize=(10,8))
sns.boxplot(data=df)
sns.pairplot(df)
sns.jointplot(df)
Machine Learning Models
Linear Regression for Housing Prices
Preprocessing:df = pd.read_csv("boston_housing.csv")
df.fillna(df.median(numeric_only=True), inplace=True)
Model Training:X = df.drop('MEDV', axis=1)
y = df['MEDV']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
model = LinearRegression()
model.fit(X_train, y_train)
Logistic Regression for Classification
Preprocessing and Scaling:df = pd.read_csv("Social_Network_Ads.csv")
df.drop(['User ID', 'Gender'], axis=1, inplace=True)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
Evaluation:cm = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
Naive Bayes Classification
Implementation:model = GaussianNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Performance Metrics:precision = precision_score(y_test, y_pred, average='macro')
recall = recall_score(y_test, y_pred, average='macro')
