BAN668 Cheat Sheet: Data Manipulation & Analysis
BAN668 Cheat Sheet
Notes for Every Dataset
- Always look at the data info and the starting descriptive statistics.
- After every altering command, view the changes.
Loading and Describing Data
df = pd.read_csv('bank2.csv', sep='\t') Be able to read different files and know the separator.
Add headers by adding names to the read function.
df.sort_values(), df.describe(stats), df.groupby().value_counts() (qualitative data)
Visualizing Data
import seaborn as sns
# Convert day variable
df['day'] = pd.to_datetime(df['day'])
sns.graph(data, 'X', 'Y')
Handling Missing Values
df['x'].isna().sum() = Count missing values
df = df.dropna() = Delete rows with missing values
df = df.drop(columns=['x']) = Delete variables with too many missing values
df['x'] = df['saleLocation'].fillna(df['x'].mode()[0]) = Impute missing values with mode
Concatenating Data
Note: Rename columns in two datasets so the names are identical (df.rename(columns={'x'}))
df_concat = pd.concat([df1, df2])
Merging Data
df_merged = pd.merge(df1, df2, how='inner') = Keep rows in common (inner join)
Change code to “left” to only keep rows from one dataset or “outer” to keep all rows.
Sorting and Slicing Data
df = df.sort_values('x') = To sort by a specific column (choose ascending type to sort the data)
pd.to_datetime = Changing variable to date and time
For slicing, create a subset based on the desired size. Use head for the top rows and tail for the last rows.
Use sample for randomizing and for specific columns use:
df_subset = df[['x', 'y']]
Filtering Data
df[df['column_name'] == value] = Filter by column value
df[(df['column1'] == value1) & (df['column2'] < value2)] = Filter by multiple conditions
df[df['column_name'].between(lower_bound, upper_bound)] = Filtering between a range
Aggregating Data
For basic stats, use df['column_name'].function(). For instance:
df['column_name'].sum() = Aggregate by sum
df.groupby('column_name').agg({'column1': 'sum', 'column2': 'mean'}) = Applying multiple functions
Cleaning Data
# Change columns to lower case: df['column'] = df['column'].str.lower()
For upper case: .str.upper() and .str.capitalize() for sentence case.
df['column_name'] = df['column_name'].str.replace(r'[^A-Za-z ]', '', regex=True) = Replace anything that’s not a letter with nothing.
df['column_name'] = df['column_name'].str.replace(r'[^0-9]', '', regex=True) = Removing non-numeric characters.
df['column'] = pd.to_numeric(df['column']) = Convert to a numeric variable.
df[df.duplicated()] = Shows duplicated rows
df = df.copy().drop_duplicates() = Removes duplicated rows.
Mutating Data
import numpy as np = Data import.
df['new_column'] = df['column1'] + df['column2'] = Adding a new column based on existing functions (it can be subtraction, multiplication, or division).
df = pd.DataFrame({'column_name': [/* your data */]})
df['new_column'] = np.select([df['column_name'] > /* condition */, df['column_name'] <= /* condition */], ['Value1', 'Value2']) = Adding new columns with new conditions.
Handling Extreme Outliers
Make use of NumPy (np) and Seaborn (sns).
Q1 = df['column_name'].quantile(0.25)
Q3 = df['column_name'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df['column_name'] < lower_bound) | (df['column_name'] > upper_bound)] = Identifying the interquartile range
df = df[(df['column_name'] >= lower_bound) & (df['column_name'] <= upper_bound)] = Show outliers
mean_value = df['column_name'].mean()
df['column_name'] = np.where((df['column_name'] < lower_bound) | (df['column_name'] > upper_bound), mean_value, df['column_name'])
= Removing outliers and replacing with the mean; can utilize mode too, depending on the task.
Supervised Learning: Regression & Classification
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import ConfusionMatrixDisplay, f1_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
df = pd.get_dummies(df, columns=['column'], drop_first=True) = Dummy-encode (nominal variable)
Ordinal-Encode (Ordinal Variable)
df['column'] = df['column'].replace('', 1, regex=False) Make as many as the variables available.
y = df['column'] = Setting target variable and drop from other variables (X)
X = StandardScaler().fit_transform(X) = Standardize the data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1) = Setting test data
Linear Regression
model = LinearRegression().fit(X_train, y_train)
y_predictions = model.predict(X_test)
mean_absolute_error(y_test, y_predictions)
df['columns'] = model.predict(X)
Logistic Regression
ConfusionMatrixDisplay.from_predictions(y_test, y_predictions) = Confusion matrix calculation
f1_score(y_test, y_predictions) = F1 score
