Python Data Science Reference: Pandas, NumPy, and Files
Python String and List Operations
Use s[start(in):stop(ex):step] for slicing. For string methods in Pandas, ensure you use .str.(strfunction).
s.upper()#Ps.lower()#ss.title()#Pys.find("P")#0s.replace("old", "new", "count")s.strip()s.startswith()s.endswith()s.split(sep, maxsplit)s.join(parts)s.count(sub, start, end)
List Methods and Comprehensions
l.append(x)l.clear()l.copy()l.count(x)l.sort()l.insert(1, "a")l.pop(x)l.remove(x)
List Comprehensions:
l = [expression for item in iterable]l = [expression for item in iterable if condition]
Dictionaries and Tuples
Dictionary Operations
d.get(key)#returns value for keyd.keys()#all keysd.values()#all valuesd.items()#pairsd.pop(item)d[k] = v
Tuple Operations
tup = ()tup.count(x)tup.index(x)len(tup),min(tup),max(tup),sum(tup),sorted(tup)
File Handling and Pathlib
f = open(filename, mode) #r: read, w: write, a: append, x: create, b: binary, t: text
Usage: with open("file.txt", "r") as f:
f.read()f.readline()f.readlines()f.write()
Pathlib and Metadata
p = Path(file)
DATA.glob("*.fileextension")# find all filesp.stat()#metadatap.name#file name with extensionp.stem#w/o extensionp.suffix#extensionp.parent#parent dirp.exists()p.is_file()stat.st_size#size of filestat.st_mtime#timestamp
Data Formats: CSV and JSON
csv.reader(file)#rows as listscsv.DictReader(file)#reads rows as dictscsv.writer(file)csv.DictWriter(file)json.load(file)json.loads(strings)json.dump(data, f)#write python to file
NumPy for Numerical Computing
Array Creation and Properties
a = np.array(array)np.zeros(x, y)#of 0snp.ones(x, y)#of 1snp.arange(start, end, skip)np.shape#dimsnp.ndim#num of dimsnp.size#total sizenp.dtype#datatypenp.full(shape, value)np.empty(shape)
Array Manipulation
a.reshape(shape)a.flatten()#returns 1d copya.ravel()#returns flattened viewnp.resize(a, newshape)np.split(a, sections)np.sort(a)a.sort()np.where(condition)np.unique(a)
Mathematical and Statistical Functions
np.sqrt(a),np.exp(a),np.log(a),np.abs(a),np.round(a)a.sum(),a.mean(),a.median(),a.std(),a.var(),a.min(),a.max(),a.argmin(),a.argmax()#all can take in axisnp.select(conditions, choices, default=0)
Pandas for Data Manipulation
Data Selection
df["col"]#put in name, get seriesdf[["col"]]#put in list, get dataframe (select multiple columns)df[x:y]#put in slice, get df (multiple row selection)filter: df[df["col"] (condition)]
Data Collection and Quality
Methods: Data Collection Methods, Surveys, Web scraping, APIs, Sensors, Logs
Data Quality Checklist
- Relevance: Is it useful?
- Completeness: Missing data?
- Bias: Sampling issues?
- Compliance: Ethical/legal?
- Storage: Secure and organized?
- Plausibility: Does the data make sense?
Data Cleaning and Transformation
import pandas as pdfrom pathlib import Path
df.dropna()df.fillna(0)df.duplicated()df.drop_duplicates()pd.get_dummies(df["category"])#onehotdf.func(inplace=True)#can leave standalonepd.to_numeric(col)col.astype(type)
Outlier Detection (IQR)
IQR = Q3 - Q1, use .quantile(0.25) and quantile(0.75)
- Lower Bound:
Q1 - 1.5 * IQR - Upper Bound:
Q3 + 1.5 * IQR
Aggregation and Grouping
df.describe()df.mean()df.sum()df.value_counts()df.groupby('col1')['col2']df['newcol'] = (col operations)df["col"].apply(func)#func can be len, type, str, int, float, abs, round, sorted, sum, min, max, or lambda x eg. “lambda x: x.func()”df.agg()#agg can be a list: sum, mean, median, min, max, count, size, std, var, prod, first, last, combined with columns e.g.: {col: func}df["col"].agg()df.groupby("key").agg()
Data Visualization and Utilities
location = (filename, line_number)os.listdir(search)pd.concat([df1, df2])axis=0(rows),axis=1(cols)df.plot(kind="", x="", y="")plt.title(title)plt.xlabel(xlabel)plt.ylabel(ylabel).count()returns total number of rows,sum()calculates numerical in a column
