Python Data Science Reference: Pandas, NumPy, and Files

Posted on Mar 15, 2026 in Computer Engineering

Python String and List Operations

Use s[start(in):stop(ex):step] for slicing. For string methods in Pandas, ensure you use .str.(strfunction).

s.upper() #P
s.lower() #s
s.title() #Py
s.find("P") #0
s.replace("old", "new", "count")
s.strip()
s.startswith()
s.endswith()
s.split(sep, maxsplit)
s.join(parts)
s.count(sub, start, end)

List Methods and Comprehensions

l.append(x)
l.clear()
l.copy()
l.count(x)
l.sort()
l.insert(1, "a")
l.pop(x)
l.remove(x)

List Comprehensions:

l = [expression for item in iterable]
l = [expression for item in iterable if condition]

Dictionaries and Tuples

Dictionary Operations

d.get(key) #returns value for key
d.keys() #all keys
d.values() #all values
d.items() #pairs
d.pop(item)
d[k] = v

Tuple Operations

tup = ()
tup.count(x)
tup.index(x)
len(tup), min(tup), max(tup), sum(tup), sorted(tup)

File Handling and Pathlib

f = open(filename, mode) #r: read, w: write, a: append, x: create, b: binary, t: text

Usage: with open("file.txt", "r") as f:

f.read()
f.readline()
f.readlines()
f.write()

Pathlib and Metadata

p = Path(file)

DATA.glob("*.fileextension") # find all files
p.stat() #metadata
p.name #file name with extension
p.stem #w/o extension
p.suffix #extension
p.parent #parent dir
p.exists()
p.is_file()
stat.st_size #size of file
stat.st_mtime #timestamp

Data Formats: CSV and JSON

csv.reader(file) #rows as lists
csv.DictReader(file) #reads rows as dicts
csv.writer(file)
csv.DictWriter(file)
json.load(file)
json.loads(strings)
json.dump(data, f) #write python to file

NumPy for Numerical Computing

Array Creation and Properties

a = np.array(array)
np.zeros(x, y) #of 0s
np.ones(x, y) #of 1s
np.arange(start, end, skip)
np.shape #dims
np.ndim #num of dims
np.size #total size
np.dtype #datatype
np.full(shape, value)
np.empty(shape)

Array Manipulation

a.reshape(shape)
a.flatten() #returns 1d copy
a.ravel() #returns flattened view
np.resize(a, newshape)
np.split(a, sections)
np.sort(a)
a.sort()
np.where(condition)
np.unique(a)

Mathematical and Statistical Functions

np.sqrt(a), np.exp(a), np.log(a), np.abs(a), np.round(a)
a.sum(), a.mean(), a.median(), a.std(), a.var(), a.min(), a.max(), a.argmin(), a.argmax() #all can take in axis
np.select(conditions, choices, default=0)

Pandas for Data Manipulation

Data Selection

df["col"] #put in name, get series
df[["col"]] #put in list, get dataframe (select multiple columns)
df[x:y] #put in slice, get df (multiple row selection)
filter: df[df["col"] (condition)]

Data Collection and Quality

Methods: Data Collection Methods, Surveys, Web scraping, APIs, Sensors, Logs

Data Quality Checklist

Relevance: Is it useful?
Completeness: Missing data?
Bias: Sampling issues?
Compliance: Ethical/legal?
Storage: Secure and organized?
Plausibility: Does the data make sense?

Data Cleaning and Transformation

import pandas as pd
from pathlib import Path

df.dropna()
df.fillna(0)
df.duplicated()
df.drop_duplicates()
pd.get_dummies(df["category"]) #onehot
df.func(inplace=True) #can leave standalone
pd.to_numeric(col)
col.astype(type)

Outlier Detection (IQR)

IQR = Q3 - Q1, use .quantile(0.25) and quantile(0.75)

Lower Bound: Q1 - 1.5 * IQR
Upper Bound: Q3 + 1.5 * IQR

Aggregation and Grouping

df.describe()
df.mean()
df.sum()
df.value_counts()
df.groupby('col1')['col2']
df['newcol'] = (col operations)
df["col"].apply(func) #func can be len, type, str, int, float, abs, round, sorted, sum, min, max, or lambda x eg. “lambda x: x.func()”
df.agg() #agg can be a list: sum, mean, median, min, max, count, size, std, var, prod, first, last, combined with columns e.g.: {col: func}
df["col"].agg()
df.groupby("key").agg()

Data Visualization and Utilities

location = (filename, line_number)
os.listdir(search)
pd.concat([df1, df2])
axis=0 (rows), axis=1 (cols)
df.plot(kind="", x="", y="")
plt.title(title)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
.count() returns total number of rows, sum() calculates numerical in a column