R Programming Essentials: Data Manipulation & Visualization

R Data Handling Essentials

Importing Data

Importing CSV Files

  • Use read.csv() to import a Comma Separated Values file:

    df <- read.csv("filename.csv")

Importing Stata .dta Files

  • Requires the haven package:

    library(haven)
    df <- read_dta("stata_file.dta")

Managing Your Environment

Setting and Checking the Working Directory

  • Set: setwd("your/path/here")

  • Get: getwd()

Data Frame Basics

Accessing a Column

  • Access a column using the $ operator: df$column_name

Calculating Column Mean

  • Calculate the mean of a numeric column: mean(df$column)

Creating a Frequency Table

  • Generate a frequency table for a column: table(df$column)

Subsetting Rows by Condition

  • Filter rows based on a condition:

    • Base R: df[df$state != "OR", ]

    • subset() function: subset(df, state != "OR")

Subsetting for Specific Values (e.g., CA or AZ)

  • Select rows where a column matches specific values:

    • Base R: df[df$state == "CA" | df$state == "AZ", ]

    • subset() function: subset(df, state == "CA" | state == "AZ")

Subsetting with Functions (e.g., Mean Wage in CA)

  • Apply a function to a subset of data:

    • Example: Calculate the mean wage for California: mean(df$wage[df$state == "CA"])

Excluding a Value

  • Remove rows where a column equals a specific value:

    • Base R: df[df$col != "value", ]

    • subset() function: subset(df, col != "value")

Including Specific Values

  • Keep rows where a column matches one of several specific values:

    • Base R: df[df$col == "A" | df$col == "B", ]

    • subset() function: subset(df, col == "A" | col == "B")

Viewing Dataset Portions

  • First 6 rows: head(df)

  • Last 6 rows: tail(df)

  • Dimensions (rows, columns): dim(df)

  • Summary statistics: summary(df)

  • Structure: str(df)

Vector Operations

Creating and Assigning Vectors

  • Create a numeric vector: x <- c(1, 2, 3)[1] 1 2 3

Element-wise Vector Addition

  • Add two vectors element by element:

    a <- c(1, 2, 3)
    b <- c(4, 5, 6)
    a + b

    [1] 5 7 9

Logical Comparisons and Vectors

  • Perform logical comparisons on vectors, returning a logical vector:

    vec <- c(1, 10, 11, 19)
    vec > 4

    FALSE TRUE TRUE TRUE

R Logical Operators

OperatorMeaningExampleResult
==Equal to5 == 5TRUE
!=Not equal to5 != 6TRUE
>Greater than4 > 5FALSE
<Less than3 < 10TRUE
&ANDTRUE & FALSEFALSE
|ORTRUE | FALSETRUE
  • Combine logical conditions: (5 != 6 | 4 > 5)TRUE

Tidyverse Data Manipulation

Loading the Tidyverse Package

  • Load the tidyverse package for data manipulation and visualization:

    library(tidyverse)

    (Needed for: %>%, mutate(), group_by(), summarize())

Creating New Columns with mutate()

  • Add or modify columns in a data frame:

    df <- df %>% mutate(new_col = ...)

Sorting Data

  • Sort data by column(s):

    • Ascending: df %>% arrange(col)

    • Descending: df %>% arrange(desc(col))

Filtering Rows

  • Filter rows based on a condition using filter():

    df %>% filter(condition)

Grouping and Summarizing Data

  • Group data by a variable and calculate summary statistics:

    df %>% group_by(group_var) %>% summarize(mean_val = mean(num_var))

Grouping by Multiple Variables

  • Group by multiple columns for more granular summaries:

    group_by(col1, col2) %>% summarize(...)

    → Output = 1 row per unique (col1, col2) combination

ggplot2 Data Visualization

Creating Plots with Titles

  • Generate a scatter plot with a custom title:

    ggplot(df) +
      geom_point(aes(x = gdp, y = life_expectancy)) +
      ggtitle("My Plot Title")

Setting X and Y Axes in Time Series Plots

  • Define aesthetics for time series plots: aes(x = year, y = averageSAT)

Common ggplot2 Plot Types

FunctionPlot Type
geom_point()Scatter plot
geom_line()Line plot
geom_histogram()Histogram

Essential R Utilities & Statistics

Checking Unique Values

  • Find all unique values in a column: unique(df$column)

Installing R Packages

  • Install new packages from CRAN: install.packages("packagename")

Linear Models in Base R

  • Fit a linear regression model: lm(y ~ x, data = df)

Understanding the Intercept in Regression

  • The intercept represents the approximate y-value when x = 0.

  • It’s where the regression line crosses the y-axis.

Slope Interpretation Example

  • If Z = 2X, the slope of Y ~ Z is ½ × the slope of Y ~ X.

Interpreting a Flat Regression Line

  • A flat regression line (angle < 45°) indicates a slope less than 1.

Understanding the Interquartile Range (IQR)

  • IQR = Q3 – Q1

  • A wide box in a box plot indicates a large IQR.

  • A narrow box indicates a small IQR.

Solving for Loops with Variable Updates

  1. Find the starting value: sum <- 10

  2. Check the loop iteration: for (i in 1:5)

  3. Understand the update rule: sum <- sum + i

  4. Add loop values: 1 + 2 + 3 + 4 + 5 = 15

  5. Final result: 10 + 15 = 25

Converting Stata Dates to R Calendar Dates

  • Stata’s date origin is January 1, 1960.

  • Conversion example: as.Date(20819, origin = "1960-01-01")