Python Data Science Reference: Pandas, NumPy, and Files

Python String and List Operations

Use s[start(in):stop(ex):step] for slicing. For string methods in Pandas, ensure you use .str.(strfunction).

  • s.upper() #P
  • s.lower() #s
  • s.title() #Py
  • s.find("P") #0
  • s.replace("old", "new", "count")
  • s.strip()
  • s.startswith()
  • s.endswith()
  • s.split(sep, maxsplit)
  • s.join(parts)
  • s.count(sub, start, end)

List Methods and Comprehensions

  • l.append(x)
  • l.clear()
  • l.copy()
  • l.count(x)
  • l.sort()
  • l.insert(1, "a")
  • l.pop(x)
  • l.remove(x)

List Comprehensions:

  • l = [expression for item in iterable]
  • l = [expression for item in iterable if condition]

Dictionaries and Tuples

Dictionary Operations

  • d.get(key) #returns value for key
  • d.keys() #all keys
  • d.values() #all values
  • d.items() #pairs
  • d.pop(item)
  • d[k] = v

Tuple Operations

  • tup = ()
  • tup.count(x)
  • tup.index(x)
  • len(tup), min(tup), max(tup), sum(tup), sorted(tup)

File Handling and Pathlib

f = open(filename, mode) #r: read, w: write, a: append, x: create, b: binary, t: text

Usage: with open("file.txt", "r") as f:

  • f.read()
  • f.readline()
  • f.readlines()
  • f.write()

Pathlib and Metadata

p = Path(file)

  • DATA.glob("*.fileextension") # find all files
  • p.stat() #metadata
  • p.name #file name with extension
  • p.stem #w/o extension
  • p.suffix #extension
  • p.parent #parent dir
  • p.exists()
  • p.is_file()
  • stat.st_size #size of file
  • stat.st_mtime #timestamp

Data Formats: CSV and JSON

  • csv.reader(file) #rows as lists
  • csv.DictReader(file) #reads rows as dicts
  • csv.writer(file)
  • csv.DictWriter(file)
  • json.load(file)
  • json.loads(strings)
  • json.dump(data, f) #write python to file

NumPy for Numerical Computing

Array Creation and Properties

  • a = np.array(array)
  • np.zeros(x, y) #of 0s
  • np.ones(x, y) #of 1s
  • np.arange(start, end, skip)
  • np.shape #dims
  • np.ndim #num of dims
  • np.size #total size
  • np.dtype #datatype
  • np.full(shape, value)
  • np.empty(shape)

Array Manipulation

  • a.reshape(shape)
  • a.flatten() #returns 1d copy
  • a.ravel() #returns flattened view
  • np.resize(a, newshape)
  • np.split(a, sections)
  • np.sort(a)
  • a.sort()
  • np.where(condition)
  • np.unique(a)

Mathematical and Statistical Functions

  • np.sqrt(a), np.exp(a), np.log(a), np.abs(a), np.round(a)
  • a.sum(), a.mean(), a.median(), a.std(), a.var(), a.min(), a.max(), a.argmin(), a.argmax() #all can take in axis
  • np.select(conditions, choices, default=0)

Pandas for Data Manipulation

Data Selection

  • df["col"] #put in name, get series
  • df[["col"]] #put in list, get dataframe (select multiple columns)
  • df[x:y] #put in slice, get df (multiple row selection)
  • filter: df[df["col"] (condition)]

Data Collection and Quality

Methods: Data Collection Methods, Surveys, Web scraping, APIs, Sensors, Logs

Data Quality Checklist

  • Relevance: Is it useful?
  • Completeness: Missing data?
  • Bias: Sampling issues?
  • Compliance: Ethical/legal?
  • Storage: Secure and organized?
  • Plausibility: Does the data make sense?

Data Cleaning and Transformation

import pandas as pd
from pathlib import Path

  • df.dropna()
  • df.fillna(0)
  • df.duplicated()
  • df.drop_duplicates()
  • pd.get_dummies(df["category"]) #onehot
  • df.func(inplace=True) #can leave standalone
  • pd.to_numeric(col)
  • col.astype(type)

Outlier Detection (IQR)

IQR = Q3 - Q1, use .quantile(0.25) and quantile(0.75)

  • Lower Bound: Q1 - 1.5 * IQR
  • Upper Bound: Q3 + 1.5 * IQR

Aggregation and Grouping

  • df.describe()
  • df.mean()
  • df.sum()
  • df.value_counts()
  • df.groupby('col1')['col2']
  • df['newcol'] = (col operations)
  • df["col"].apply(func) #func can be len, type, str, int, float, abs, round, sorted, sum, min, max, or lambda x eg. “lambda x: x.func()”
  • df.agg() #agg can be a list: sum, mean, median, min, max, count, size, std, var, prod, first, last, combined with columns e.g.: {col: func}
  • df["col"].agg()
  • df.groupby("key").agg()

Data Visualization and Utilities

  • location = (filename, line_number)
  • os.listdir(search)
  • pd.concat([df1, df2])
  • axis=0 (rows), axis=1 (cols)
  • df.plot(kind="", x="", y="")
  • plt.title(title)
  • plt.xlabel(xlabel)
  • plt.ylabel(ylabel)
  • .count() returns total number of rows, sum() calculates numerical in a column