R Crashcourse

Introduction

Today, we’ll start learning about using R and R Markdown. This will be useful for completing your problem sets.

R is a popular open-source programming language and software environment for statistical computing, data analysis, and graphics.

R Markdown provides a way of creating an easy to read report which can incorporate formatted text, tables, images, links to other documents. Basically, we write all of our code in a .qmd file and knit it to create a final document like this one!

Before we get started

Create a project in a new directory (e.g. your desktop). This will be the folder for the class. All of your work over the semester can be stored here.
Save this markdown file to this project folder.
Also inside this project folder, create another subfolder called “data”
Download gapminder.RData and save it to this data subfolder

Packages

R utilizes a lot of user-written programs called packages.

The code below will use the tidyverse package for data wrangling and visualization. Therefore, we start with installing and loading the package. To do this, you will need the install.packages() and library() commands.

You can execute code directly in your document. That’s a nice feature of R Markdown (no more copy-pasting!). R code is contained within code chunks like this:

# note that the package name has to go inside ""
install.packages("tidyverse")

R then runs the code inside these code chunks when knitting your document.

To insert a code chunk, you can use the keyboard shortcuts are Ctrl + Alt + I (Windows) or Cmd + Option + I (Mac). It’s good practice to name your code chunks so that you can easily browse them.

Note that packages need only to be installed once. Can you see why the above code chunk has the option eval=FALSE?

After installing the package, you need to still load it:

# note that we no longer need "" here
library(tidyverse)

Note that packages need to be (re)loaded every time you start your R session.

When you loaded tidyverse, there were a bunch of messages. These are probably not great to show in your final document. You can turn them off with by including message=FALSE and warning=FALSE at the top of your code chunk (or by clicking the gear icon).

Objects in R

R works with “objects”. These can be datasets, numbers, strings, vectors (e.g. lists), functions, etc.

The <- operator allows us to assign things to R objects.

When you create an object, you will notice that this object now appears in the upper right hand pane.

One nice thing about R versus Stata is that you can work with many datasets (and other types of objects such as lists and scalars) at the same time. Try to run the following code manually:

a_dataframe <- data.frame(
  x = sample(10, 100, rep = TRUE),
  y = sample(10, 100, rep = TRUE)
)

a_list <- c("bob", "mary", "simon", "george")

a_scalar <- 4

You will be assigning a lot of stuff in R. The keyboard shortcuts are Alt + - (Windows) or Option + - (Mac).

Loading Data

The data we are using today originally comes from the gapminder project.

To load the data, use the load() command.

load("data/gapminder.Rdata")

We can also import data from other formats (e.g. Excel, Stata, SPSS). Let’s try importing a .csv file.

First, download the polityIV dataset and save it to same data subfolder.

Then, we can save it to an R object using the read_csv() command.

polity4 <- read_csv("data/polity4.csv")

Note that we now have two datasets loaded at the same time.

Exploring Data

Usually, when we first load a dataset, we want to explore it. Of course, you can browse the dataset directly by typing view(dta) into the R console (or just clicking on the data object).

Other commands to explore your data include glimpse(), summary(), dim(), nrow(), ncol() and skim() (which is available after installing and loading the skimr package).

Use names() or colnames() to see which variables are included in the dataset.

And you can also use distinct() or table() to see the values of each variable.

Another useful command is class(), which tells you the type of variable (or object) you are dealing with.

Here’s a sandbox to try things out:

Cleaning and Manipulating Data (“Wrangling”)

Often, you’ll have to manipulate your datasets before running any analyses. Here you have an overview of the most commonly used tidyverse (technically, dplyr) functions that you might find useful:

mutate() adds new variables
rename() renames existing variables
select() picks variables based on their names
filter() picks observations based on their values
arrange() reorders observations
distinct() shows unique observations
summarise() reduces multiple values to a single summary
group_by() performs any operation “by group”. Often used in combination with summarise().
ungroup() removes any grouping structure

In addition, we’ll be working with something called a pipe operator |>. This basically tells R “take an object, and do X to it”.

You will be using pipes a lot. The keyboard shortcut is Ctrl + Shift + M (Windows) or Cmd + Shift + M (Mac).

In older versions of R, you will also see the pipe written as %>%.

Keeping rows and columns

The gapminder data is relatively clean, but we can still practice some data manipulation functions on it. For example, we could use select() to keep only a subset of variables (columns in the dataframe), and filter() to keep only a subset of observations (rows in the dataframe). This is pretty helpful when you are working with large datasets and we only want to keep a subset of the information.

dta_reduced <- dta |> 
    select(c(country,year)) |> 
    filter(year %in% c(1962,2002))

The above code chunk creates a new dataframe called data_reduced which includes only the variables country and year, and only the years 1962 and 2002.

The c( ) function allows us to specify a list of objects. Each object in the list should be separated by a comma.

BTW, we can equivalently select years using the | operator, which you may remember from Stata: filter(year == 1962 | year == 2002).

Question: why is it generally a good idea to create a new dataframe here, rather than just over-writing the “raw” data?

Re-ordering Rows

If you look at the first couple of rows of data_reduced, you will see that it’s arranged alphabetically by country, and then by year in ascending order. But what if we wanted to list all of the 1962 data first, and then all of the 2002 data. We can use the arrange() function to re-order the rows in our dataset.

dta_reduced <- dta_reduced |> 
    arrange(year)

head(dta_reduced)

# A tibble: 6 × 2
  country      year
  <fct>       <int>
1 Afghanistan  1962
2 Albania      1962
3 Algeria      1962
4 Angola       1962
5 Argentina    1962
6 Australia    1962

Creating new variables

Another thing we can do is to create new variables (columns) using the mutate() command. For example, let’s combine information on pop and gdpPercap to create a new variable measuring total_GDP.

dta_augmented <- dta |> 
    mutate(total_GDP = pop * gdpPercap)

Group-level operations

Finally the group_by() and summarize() commands which are often used together to create group-level variables:

dta_grouped <- dta |> 
    group_by(continent,year) |> 
    summarize(region_avg_GDPpc = mean(gdpPercap))

Question: Can you explain what the above code chunk accomplishes?

Suppose instead that we wanted to add our new group-level data back to our original dataframe:

dta_augmented_grouped <- dta |> 
    group_by(continent,year) |> 
    mutate(region_avg_GDPpc = mean(gdpPercap))

Data Visualization

After cleaning your dataset, but before doing any analyses on it, it’s a good idea to visualize your data.

For that, we’ll use ggplot(), which comes as part of the tidyverse package.

The following code will show you how ggplot builds a plot, step-by-step:

## to simplify, let's just keep years 1962 and 2002 for plotting
plot_data <- dta |> 
  filter(year %in% c(1962,2002))

## an empty plot
ggplot()

## telling ggplot what data you want to plot
ggplot(data=plot_data)

## telling ggplot which variables go on which axes
ggplot(data=plot_data, 
       mapping=aes(x=gdpPercap, 
                   y=lifeExp))

## telling ggplot what type of graphics ("geom"s) you want to plot
ggplot(data=plot_data, 
       mapping=aes(x=gdpPercap, 
                   y=lifeExp)) + 
  geom_point()

## adding stuff like colors, transparency and marker size
ggplot(data=plot_data, 
       mapping=aes(x=gdpPercap, 
                   y=lifeExp)) + 
  geom_point(color = "cornflowerblue",
             alpha = 0.5,
             size = 2)

## add a best-fit line
ggplot(data=plot_data, 
       mapping=aes(x=gdpPercap, 
                   y=lifeExp)) + 
  geom_point(color = "cornflowerblue",
             alpha = 0.5,
             size = 2) + 
  geom_smooth(method = "loess",
              se=FALSE,
              linewidth=1.5)

## Add labels and captions, and logging the x-axis
ggplot(data=plot_data, 
       mapping=aes(x=gdpPercap, 
                   y=lifeExp)) + 
  geom_point(color = "cornflowerblue",
             alpha = 0.5,
             size = 2) + 
  geom_smooth(method = "loess",
              se=FALSE,
              linewidth=1.5) + 
  scale_x_log10(breaks = c(100, 1000, 10000, 100000),
                     label = scales::dollar) + 
  labs(x = "GDP per capita", 
       y = "Life expectancy",
       title = "Wealth = Health?",
       subtitle = "gdpPercap/lifeExp",
       caption = "Gapminder dataset",
       color = "Year")

## grouping using colors
ggplot(data=plot_data, 
       mapping=aes(x=gdpPercap, 
                   y=lifeExp,
                   color=as.factor(year))) + 
  geom_point(alpha = 0.5,
             size = 2) + 
  geom_smooth(method = "loess",
              se=FALSE,
              linewidth=1.5) +
  scale_x_log10(breaks = c(100, 1000, 10000, 100000),
                     label = scales::dollar) + 
  labs(x = "GDP per capita", 
       y = "Life expectancy",
       title = "Wealth = Health?",
       subtitle = "gdpPercap/lifeExp",
       caption = "Gapminder dataset",
       color = "Year")

## grouping using facets
ggplot(data=plot_data, 
       mapping=aes(x=gdpPercap, 
                   y=lifeExp,
                   color=as.factor(year))) + 
  geom_point(alpha = 0.5,
             size = 2) + 
  geom_smooth(method = "loess",
              se=FALSE,
              linewidth=1.5) +
  scale_x_log10(breaks = c(100, 1000, 10000, 100000),
                     label = scales::dollar) + 
  labs(x = "GDP per capita", 
       y = "Life expectancy",
       title = "Wealth = Health?",
       subtitle = "gdpPercap/lifeExp",
       caption = "Gapminder dataset",
       color = "Year") +
  facet_wrap(~as.factor(year))

## Adding a theme
ggplot(data=plot_data, 
       mapping=aes(x=gdpPercap, 
                   y=lifeExp,
                   color=as.factor(year))) + 
  geom_point(alpha = 0.5,
             size = 2) + 
  geom_smooth(method = "loess",
              se=FALSE,
              linewidth=1.5) +
  scale_x_log10(breaks = c(100, 1000, 10000, 100000),
                     label = scales::dollar) + 
  labs(x = "GDP per capita", 
       y = "Life expectancy",
       title = "Wealth = Health?",
       subtitle = "gdpPercap/lifeExp",
       caption = "Gapminder dataset",
       color = "Year") + 
  theme_minimal()

You can control the size and alignment of your graphs in R Markdown by specifying options directly in the code chunk. You can also add a figure caption.

In case you want to play around with colors, here’s a cheatsheet.

Pivoting Data

Outside of this class (where I’ve already “cleaned” the data for you), working with “real” data often involves reshaping from wide to long format, and vice versa:

Question: what format is plot_data in?

For practice, let’s pivot plot_data:

# from long to wide
dta_pivot <- plot_data |> 
  select(year, lifeExp, country) |> 
  pivot_wider(names_from=year, 
              values_from = lifeExp,
              names_prefix = "Year")



# pivoting back
dta_pivot <- dta_pivot |> 
  pivot_longer(
    cols = starts_with("Year"),
    names_to = "year",
    names_prefix = "Year",
    values_to = "lifeExp"
  )

Joining Datasets (In-Class Exercise)

Finally, you may want to join (or “merge” in Stata-talk) different datasets.

Have a look at the different join options here.

Try yourself to join the polity4 data to dta.

Hint: you will need to pivot one of the two datasets.

# in class exercise

You may have noticed that your join didn’t work perfectly because of some inconsistencies in how country names are spelled in the two different datasets.

To get around this issue, researchers have come up with unique country codes.

One commonly used set of codes comes from the Correlates of War (COW) project.

There’s even a package for it. See here