BAN668 Cheat Sheet: Data Manipulation & Analysis

Posted on Oct 19, 2024 in Computers

BAN668 Cheat Sheet

Notes for Every Dataset

Always look at the data info and the starting descriptive statistics.
After every altering command, view the changes.

Loading and Describing Data

df = pd.read_csv('bank2.csv', sep='\t') Be able to read different files and know the separator.

Add headers by adding names to the read function.

df.sort_values(), df.describe(stats), df.groupby().value_counts() (qualitative data)

Visualizing Data

import seaborn as sns

# Convert day variable

df['day'] = pd.to_datetime(df['day'])

sns.graph(data, 'X', 'Y')

Handling Missing Values

df['x'].isna().sum() = Count missing values

df = df.dropna() = Delete rows with missing values

df = df.drop(columns=['x']) = Delete variables with too many missing values

df['x'] = df['saleLocation'].fillna(df['x'].mode()[0]) = Impute missing values with mode

Concatenating Data

Note: Rename columns in two datasets so the names are identical (df.rename(columns={'x'}))

df_concat = pd.concat([df1, df2])

Merging Data

df_merged = pd.merge(df1, df2, how='inner') = Keep rows in common (inner join)

Change code to “left” to only keep rows from one dataset or “outer” to keep all rows.

Sorting and Slicing Data

df = df.sort_values('x') = To sort by a specific column (choose ascending type to sort the data)

pd.to_datetime = Changing variable to date and time

For slicing, create a subset based on the desired size. Use head for the top rows and tail for the last rows.
Use sample for randomizing and for specific columns use:

df_subset = df[['x', 'y']]

Filtering Data

df[df['column_name'] == value] = Filter by column value

df[(df['column1'] == value1) & (df['column2'] < value2)] = Filter by multiple conditions

df[df['column_name'].between(lower_bound, upper_bound)] = Filtering between a range

Aggregating Data

For basic stats, use df['column_name'].function(). For instance:

df['column_name'].sum() = Aggregate by sum

df.groupby('column_name').agg({'column1': 'sum', 'column2': 'mean'}) = Applying multiple functions

Cleaning Data

# Change columns to lower case: df['column'] = df['column'].str.lower()

For upper case: .str.upper() and .str.capitalize() for sentence case.

df['column_name'] = df['column_name'].str.replace(r'[^A-Za-z ]', '', regex=True) = Replace anything that’s not a letter with nothing.

df['column_name'] = df['column_name'].str.replace(r'[^0-9]', '', regex=True) = Removing non-numeric characters.

df['column'] = pd.to_numeric(df['column']) = Convert to a numeric variable.

df[df.duplicated()] = Shows duplicated rows

df = df.copy().drop_duplicates() = Removes duplicated rows.

Mutating Data

import numpy as np = Data import.

df['new_column'] = df['column1'] + df['column2'] = Adding a new column based on existing functions (it can be subtraction, multiplication, or division).

df = pd.DataFrame({'column_name': [/* your data */]})

df['new_column'] = np.select([df['column_name'] > /* condition */, df['column_name'] <= /* condition */], ['Value1', 'Value2']) = Adding new columns with new conditions.

Handling Extreme Outliers

Make use of NumPy (np) and Seaborn (sns).

Q1 = df['column_name'].quantile(0.25)

Q3 = df['column_name'].quantile(0.75)

IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR

upper_bound = Q3 + 1.5 * IQR

outliers = df[(df['column_name'] < lower_bound) | (df['column_name'] > upper_bound)] = Identifying the interquartile range

df = df[(df['column_name'] >= lower_bound) & (df['column_name'] <= upper_bound)] = Show outliers

mean_value = df['column_name'].mean()

df['column_name'] = np.where((df['column_name'] < lower_bound) | (df['column_name'] > upper_bound), mean_value, df['column_name'])
= Removing outliers and replacing with the mean; can utilize mode too, depending on the task.

Supervised Learning: Regression & Classification

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import ConfusionMatrixDisplay, f1_score

from sklearn.tree import DecisionTreeClassifier

from sklearn.neighbors import KNeighborsClassifier

df = pd.get_dummies(df, columns=['column'], drop_first=True) = Dummy-encode (nominal variable)

Ordinal-Encode (Ordinal Variable)

df['column'] = df['column'].replace('', 1, regex=False) Make as many as the variables available.

y = df['column'] = Setting target variable and drop from other variables (X)

X = StandardScaler().fit_transform(X) = Standardize the data

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1) = Setting test data

Linear Regression

model = LinearRegression().fit(X_train, y_train)

y_predictions = model.predict(X_test)

mean_absolute_error(y_test, y_predictions)

df['columns'] = model.predict(X)

Logistic Regression

ConfusionMatrixDisplay.from_predictions(y_test, y_predictions) = Confusion matrix calculation

f1_score(y_test, y_predictions) = F1 score

BAN668 Cheat Sheet: Data Manipulation & Analysis