BAN668 Cheat Sheet: Data Manipulation & Analysis
BAN668 Cheat Sheet
Notes for Every Dataset
- Always look at the data info and the starting descriptive statistics.
- After every altering command, view the changes.
Loading and Describing Data
df = pd.read_csv('bank2.csv', sep='\t')
Be able to read different files and know the separator.
Add headers by adding names to the read function.
df.sort_values()
, df.describe(stats)
, df.groupby().value_counts()
(qualitative data)
Visualizing Data
import seaborn as sns
# Convert day variable
df['day'] = pd.to_datetime(df['day'])
sns.graph(data, 'X', 'Y')
Handling Missing Values
df['x'].isna().sum()
= Count missing values
df = df.dropna()
= Delete rows with missing values
df = df.drop(columns=['x'])
= Delete variables with too many missing values
df['x'] = df['saleLocation'].fillna(df['x'].mode()[0])
= Impute missing values with mode
Concatenating Data
Note: Rename columns in two datasets so the names are identical (df.rename(columns={'x'})
)
df_concat = pd.concat([df1, df2])
Merging Data
df_merged = pd.merge(df1, df2, how='inner')
= Keep rows in common (inner join)
Change code to “left” to only keep rows from one dataset or “outer” to keep all rows.
Sorting and Slicing Data
df = df.sort_values('x')
= To sort by a specific column (choose ascending type to sort the data)
pd.to_datetime
= Changing variable to date and time
For slicing, create a subset based on the desired size. Use head
for the top rows and tail
for the last rows.
Use sample
for randomizing and for specific columns use:
df_subset = df[['x', 'y']]
Filtering Data
df[df['column_name'] == value]
= Filter by column value
df[(df['column1'] == value1) & (df['column2'] < value2)]
= Filter by multiple conditions
df[df['column_name'].between(lower_bound, upper_bound)]
= Filtering between a range
Aggregating Data
For basic stats, use df['column_name'].function()
. For instance:
df['column_name'].sum()
= Aggregate by sum
df.groupby('column_name').agg({'column1': 'sum', 'column2': 'mean'})
= Applying multiple functions
Cleaning Data
# Change columns to lower case: df['column'] = df['column'].str.lower()
For upper case: .str.upper()
and .str.capitalize()
for sentence case.
df['column_name'] = df['column_name'].str.replace(r'[^A-Za-z ]', '', regex=True)
= Replace anything that’s not a letter with nothing.
df['column_name'] = df['column_name'].str.replace(r'[^0-9]', '', regex=True)
= Removing non-numeric characters.
df['column'] = pd.to_numeric(df['column'])
= Convert to a numeric variable.
df[df.duplicated()]
= Shows duplicated rows
df = df.copy().drop_duplicates()
= Removes duplicated rows.
Mutating Data
import numpy as np
= Data import.
df['new_column'] = df['column1'] + df['column2']
= Adding a new column based on existing functions (it can be subtraction, multiplication, or division).
df = pd.DataFrame({'column_name': [/* your data */]})
df['new_column'] = np.select([df['column_name'] > /* condition */, df['column_name'] <= /* condition */], ['Value1', 'Value2'])
= Adding new columns with new conditions.
Handling Extreme Outliers
Make use of NumPy (np
) and Seaborn (sns
).
Q1 = df['column_name'].quantile(0.25)
Q3 = df['column_name'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df['column_name'] < lower_bound) | (df['column_name'] > upper_bound)]
= Identifying the interquartile range
df = df[(df['column_name'] >= lower_bound) & (df['column_name'] <= upper_bound)]
= Show outliers
mean_value = df['column_name'].mean()
df['column_name'] = np.where((df['column_name'] < lower_bound) | (df['column_name'] > upper_bound), mean_value, df['column_name'])
= Removing outliers and replacing with the mean; can utilize mode too, depending on the task.
Supervised Learning: Regression & Classification
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import ConfusionMatrixDisplay, f1_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
df = pd.get_dummies(df, columns=['column'], drop_first=True)
= Dummy-encode (nominal variable)
Ordinal-Encode (Ordinal Variable)
df['column'] = df['column'].replace('', 1, regex=False)
Make as many as the variables available.
y = df['column']
= Setting target variable and drop from other variables (X)
X = StandardScaler().fit_transform(X)
= Standardize the data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
= Setting test data
Linear Regression
model = LinearRegression().fit(X_train, y_train)
y_predictions = model.predict(X_test)
mean_absolute_error(y_test, y_predictions)
df['columns'] = model.predict(X)
Logistic Regression
ConfusionMatrixDisplay.from_predictions(y_test, y_predictions)
= Confusion matrix calculation
f1_score(y_test, y_predictions)
= F1 score