R Programming Essentials: Data Manipulation & Visualization
R Data Handling Essentials
Importing Data
Importing CSV Files
Use
read.csv()
to import a Comma Separated Values file:df <- read.csv("filename.csv")
Importing Stata .dta Files
Requires the
haven
package:library(haven) df <- read_dta("stata_file.dta")
Managing Your Environment
Setting and Checking the Working Directory
Set:
setwd("your/path/here")
Get:
getwd()
Data Frame Basics
Accessing a Column
Access a column using the
$
operator:df$column_name
Calculating Column Mean
Calculate the mean of a numeric column:
mean(df$column)
Creating a Frequency Table
Generate a frequency table for a column:
table(df$column)
Subsetting Rows by Condition
Filter rows based on a condition:
Base R:
df[df$state != "OR", ]
subset()
function:subset(df, state != "OR")
Subsetting for Specific Values (e.g., CA or AZ)
Select rows where a column matches specific values:
Base R:
df[df$state == "CA" | df$state == "AZ", ]
subset()
function:subset(df, state == "CA" | state == "AZ")
Subsetting with Functions (e.g., Mean Wage in CA)
Apply a function to a subset of data:
Example: Calculate the mean wage for California:
mean(df$wage[df$state == "CA"])
Excluding a Value
Remove rows where a column equals a specific value:
Base R:
df[df$col != "value", ]
subset()
function:subset(df, col != "value")
Including Specific Values
Keep rows where a column matches one of several specific values:
Base R:
df[df$col == "A" | df$col == "B", ]
subset()
function:subset(df, col == "A" | col == "B")
Viewing Dataset Portions
First 6 rows:
head(df)
Last 6 rows:
tail(df)
Dimensions (rows, columns):
dim(df)
Summary statistics:
summary(df)
Structure:
str(df)
Vector Operations
Creating and Assigning Vectors
Create a numeric vector:
x <- c(1, 2, 3)
→[1] 1 2 3
Element-wise Vector Addition
Add two vectors element by element:
a <- c(1, 2, 3) b <- c(4, 5, 6) a + b
→
[1] 5 7 9
Logical Comparisons and Vectors
Perform logical comparisons on vectors, returning a logical vector:
vec <- c(1, 10, 11, 19) vec > 4
→
FALSE TRUE TRUE TRUE
R Logical Operators
Operator | Meaning | Example | Result |
---|---|---|---|
== | Equal to | 5 == 5 | TRUE |
!= | Not equal to | 5 != 6 | TRUE |
> | Greater than | 4 > 5 | FALSE |
< | Less than | 3 < 10 | TRUE |
& | AND | TRUE & FALSE | FALSE |
| | OR | TRUE | FALSE | TRUE |
Combine logical conditions:
(5 != 6 | 4 > 5)
→TRUE
Tidyverse Data Manipulation
Loading the Tidyverse Package
Load the
tidyverse
package for data manipulation and visualization:library(tidyverse)
(Needed for:
%>%
,mutate()
,group_by()
,summarize()
)
Creating New Columns with mutate()
Add or modify columns in a data frame:
df <- df %>% mutate(new_col = ...)
Sorting Data
Sort data by column(s):
Ascending:
df %>% arrange(col)
Descending:
df %>% arrange(desc(col))
Filtering Rows
Filter rows based on a condition using
filter()
:df %>% filter(condition)
Grouping and Summarizing Data
Group data by a variable and calculate summary statistics:
df %>% group_by(group_var) %>% summarize(mean_val = mean(num_var))
Grouping by Multiple Variables
Group by multiple columns for more granular summaries:
group_by(col1, col2) %>% summarize(...)
→ Output = 1 row per unique
(col1, col2)
combination
ggplot2 Data Visualization
Creating Plots with Titles
Generate a scatter plot with a custom title:
ggplot(df) + geom_point(aes(x = gdp, y = life_expectancy)) + ggtitle("My Plot Title")
Setting X and Y Axes in Time Series Plots
Define aesthetics for time series plots:
aes(x = year, y = averageSAT)
Common ggplot2
Plot Types
Function | Plot Type |
---|---|
geom_point() | Scatter plot |
geom_line() | Line plot |
geom_histogram() | Histogram |
Essential R Utilities & Statistics
Checking Unique Values
Find all unique values in a column:
unique(df$column)
Installing R Packages
Install new packages from CRAN:
install.packages("packagename")
Linear Models in Base R
Fit a linear regression model:
lm(y ~ x, data = df)
Understanding the Intercept in Regression
The intercept represents the approximate y-value when x = 0.
It’s where the regression line crosses the y-axis.
Slope Interpretation Example
If
Z = 2X
, the slope ofY ~ Z
is ½ × the slope ofY ~ X
.
Interpreting a Flat Regression Line
A flat regression line (angle < 45°) indicates a slope less than 1.
Understanding the Interquartile Range (IQR)
IQR = Q3 – Q1
A wide box in a box plot indicates a large IQR.
A narrow box indicates a small IQR.
Solving for
Loops with Variable Updates
Find the starting value:
sum <- 10
Check the loop iteration:
for (i in 1:5)
Understand the update rule:
sum <- sum + i
Add loop values: 1 + 2 + 3 + 4 + 5 = 15
Final result: 10 + 15 = 25
Converting Stata Dates to R Calendar Dates
Stata’s date origin is January 1, 1960.
Conversion example:
as.Date(20819, origin = "1960-01-01")