Potential Outcomes and Average Treatment Effects

Let’s start off with a question that’s (hopefully) relevant to you:

What is a university education worth in terms of future earnings on the job market?

Or, to put it more bluntly, does going to university cause (future) you to earn more money?

Discuss:

Imagine you had a reasonable amount of time and money to gather data on this question.

What data would get collect, and what would you compare (i.e. what analyses would you run)?

Potential Outcomes

The potential outcomes framework provides a great way for thinking through causal claims.

Under this framework, there are two “potential states of the world”, which we can think of as parallel universes:

one universe where you go to university
a parallel universe where literally everything else is the same, except that you do not go to university

Suppose we are able to measure your income at age 30 in both universes. We therefore have two potential outcomes:

$Y_i(1)$ = your income in the universe where you go to uni
$Y_i(0)$ = your income in the universe where you do not

The causal effect of university on your income is therefore $\tau_i \equiv Y_i(1) - Y_i(0)$. This is called the individual treatment effect (ITE).

We can also calculate an average treatment effect (ATE) which is simply the average or expectation of the ITEs:

\[ ATE \equiv \mathbb{E}[Y_i(1) - Y_i(0)] \]

Let’s run a simulation where we can calculate the ITEs and ATE.

In this simulation, there are 1000 people.

Each person has some innate “ability”, which we can operationalize as a random drawn from a uniform distribution between 0 (= lowest ability person) and 100 (= highest ability person).

Each person exists in two states of the world, corresponding to our two parallel universes. In the universe where they do not go to university, income is given by the following data generating process:

\[ Y_i(0) = \text{Ability}_i^2 + \epsilon_i \] where $\epsilon_i$ is just some random “noise” drawn from a normal distribution with mean = 0 and standard deviation = 500:

\[ \epsilon_i \sim \mathcal{N}(0, 500) \]

In the universe where the person does go to uni, income is 5000 EUR higher:

\[ Y_i(1) = Y_i(0) + 5000 \]

Note that this treatment effect is identical for every single person.

Let’s make up the data:

# load packages
library(tidyverse)

# parameters
set.seed(1)
N <- 1000
ability <- runif(N, min=0, max=100)
e <- rnorm(N, mean=0, sd=500)
effect <- 5000

# creating the data
dta <- tibble(ability, e) |> 
  mutate(y0 = round(ability^2 + e, 0),
         y1 = y0 + effect,
         ite = y1 - y0)


# printing the first few rows
head(dta[,c(3,4,5)])

# A tibble: 6 × 3
     y0    y1   ite
  <dbl> <dbl> <dbl>
1   744  5744  5000
2  1236  6236  5000
3  2690  7690  5000
4  8254 13254  5000
5   903  5903  5000
6  8868 13868  5000

The Fundamental Problem of Causal Inference

Of course, if we somehow observe both potential outcomes, then causal inference is easy!

But in real life, each person either goes to university or does not. We cannot observe them in both states of the world.

Thus, one potential outcome is always missing, and we cannot calulate the ITE:

$Y_i(0)$	$Y_i(1)$	$ITE_i$
744	NA	NA
NA	6236	NA
NA	7690	NA
8254	NA	NA
NA	5903	NA
8868	NA	NA

So how do we solve this problem?

As it turns out, although we cannot obtain the ITE, we can still estimate the ATE.

The answer begins with a simple equality:

\[ ATE \equiv \mathbb{E}[Y_i(1) - Y_i(0)] = \mathbb{E}[Y_i(1)] - \mathbb{E}[Y_i(0)] \]

Again, this looks complicated, so let’s break it down with a simple example:

$Y_i(0)$	$Y_i(1)$	$ITE_i$
1	4	3
1	4	3
2	5	3
4	7	3
$\mathbb{E}[Y_i(0)]$ = 2	$\mathbb{E}[Y_i(1)]$ = 5	$\mathbb{E}[Y_i(1) - Y_i(0)]$ = 3

In this example, we can either:

going row-by-row, fill in the third column by calculating $Y_i(1)$ - $Y_i(0)$
then average down the third column

average down the $Y_i(0)$ column to get $\mathbb{E}[Y_i(0)]$,
then average down the $Y_i(1)$ column to get $\mathbb{E}[Y_i(1)]$,
calculate $\mathbb{E}[Y_i(1)] - \mathbb{E}[Y_i(0)]$ from the last row alone.

We get the same answer either way.

The first strategy is impossible in the real world (due to the fundamental problem of causal inference). So we are left with the second strategy.

More specifically, our task is to estimate $\mathbb{E}[Y_i(1)]$ and $\mathbb{E}[Y_i(0)]$.

The key question is: are these estimates biased or unbiased?

Selection Bias

Let’s assume we are working with observational data.

In observational data, we as researchers do not get to decide who goes to uni and who does not. Rather, our subjects decide for themselves.

This self-selection into treatment creates problems for causal inference. More specifically, a “naive” comparison of the incomes of university-graduates vs. people without university degrees is likely to be biased.

To see this action, let’s suppose that $Ability$ not only determined income, but also determined one’s chances of attending university:

\[ Pr(\text{Uni}) = Ability / 100 \]

Here, we have modeled the probability of attending Uni as a coin toss, where $Ability$ determines the “fairness” of a coin.

Thus, someone with $Ability_i$ = 20 has a 20% chance of going to uni, while someone else with $Ability_i$ = 80 has an 80% chance.

Let’s add this to our data:

# recreating the process by which people self-select into university
dta <- dta |> 
  mutate(uni = rbinom(n = N, size = 1, prob = ability/100),
         uni = factor(uni))

# plotting the relationship between ability and university attendance
ggplot(dta, aes(x=ability, fill=uni)) +
  geom_histogram(color="white",
                 boundary=0,
                 binwidth=5) +
  theme_minimal() +
  scale_fill_discrete(labels=c("No",
                               "Yes")) +
  labs(x = "Ability",
       y = "Count",
       fill = "Attend University?" )

Now let’s run some simulations. Each time, we will calculate the difference-in-mean incomes between people who did vs. did not attend uni. And then we will average across all of our simulations.

set.seed(2)
n_sims <- 1000

# creating empty vectors to store the simulation results
v_y0 <- c()
v_y1 <- c()
v_diff <- c()

# looping
for (i in 1:n_sims) {
  
  # self-selection into treatment and revealing outcomes
  dta <- dta |> 
  mutate(uni = rbinom(n = N, size = 1, prob = ability/100),
         y = ifelse(uni==1, y1, y0))
  
  # storing the average untreated outcomes
  v_y0[i] <- mean(dta$y[dta$uni == 0])
  
  # storing the average treated outcomes
  v_y1[i] <- mean(dta$y[dta$uni == 1])
  
  # storing the difference in means
  v_diff[i] <- mean(dta$y[dta$uni == 1]) - mean(dta$y[dta$uni == 0])

}

The average of the treated and untreated outcomes across our 1000 are:

$\widehat{Y_i(0)}$ = 1636.2
$\widehat{Y_i(1)}$ = 9993.45

And the average difference-in-means is: 8357.25.

Because we made up the data generating process, we know that the “true” quantities of interest are:

$\mathbb{E}[Y_i(0)]$ = 3315.52
$\mathbb{E}[Y_i(1)]$ = 8315.52
ATE = 5000.

So why is our estimated difference-in-means so far off from the “true” ATE?

Recall from your reading:

Difference in group means = the “true” causal effect + selection bias

What is selection bias?

It’s the difference in baseline (potential untreated) outcomes between people who did and did not attend university.

In this case, people with higher ability were also more likely to self-select into university. And they also earn higher incomes (due to their higher ability).

Since we made up the data, we have the full schedule of potential outcomes, and we can calculate the average amount of selection bias across our 1000 simulations:

set.seed(2)

# creating empty vectors to store the simulation results
v_bias <- c()

# looping
for (i in 1:n_sims) {
  
  # self-selection into treatment and revealing outcomes
  dta <- dta |> 
  mutate(uni = rbinom(n = N, size = 1, prob = ability/100))
  
  # storing the selection bias
  v_bias[i] <- mean(dta$y0[dta$uni == 1]) - mean(dta$y0[dta$uni == 0])

}

Note about the code:

The key point here is that we are able to calculate the potential untreated outcomes for people who ended up in the treatment group (dta$y0[dta$uni == 1]).

We would never get to observe this information in real life, but we can do the calculation here since we made up the data. That’s why we simulate!

The average bias is 3357.25.

And 3357.25 + 5000 = 8357.25.

Yay, our math works!

Why not just control for ability?

In this example, $Ability$ acts as a confounder because it determines both our treatment variable $Uni$ and our outcome $Y$.

In your other classes, you probably learned that we can eliminate this type of bias by controlling for confounders in a regression:

\[ Y_i = \beta_0 + \beta_1 * Uni_i + \beta_2 * Ability_i + \epsilon_i \]

Let’s try it:

set.seed(2)

# creating empty vector to store the regression coefficient
v_coef <- c()

# looping
for (i in 1:n_sims) {
  
  # self-selection into treatment and revealing outcomes
  dta <- dta |> 
  mutate(uni = rbinom(n = N, size = 1, prob = ability/100),
         y = ifelse(uni==1, y1, y0))
  
  # running the regression
  model <- lm(y ~ uni + ability, data=dta) 
  
  # storing the coefficient
  v_coef[i] <- model$coefficients[2]
  
}

The average of our 1000 regression coefficients is 4996.25. So by controlling for the confounder, we can “recover” an unbiased estimate of the ATE.

Discuss:

So why can’t we estimate causal effects using regressions with observational data?

It all depends on whether you think you have (accurately) measured all of the possible confounders.¹

But maybe there are some variables which you did not – or cannot – measure?

In this example, I made up a measure of ability. But can we really measure ability (without error) in real life?

Randomization to the Rescue

Rather than using (complex) statistical models, randomization provides a simple way to estimate causal effects which are free from selection bias even without using any controls.

Recall that our task is to arrive at unbiased estimates of $\mathbb{E}[Y_i(1)]$ and $\mathbb{E}[Y_i(0)]$.

Last week, we drew a random sample from our bowl of red and white marbles. The proportion of red marbles in our shovel constituted our “best guess” about the proportion of red marbles in the bowl.

Similarly, now we will draw a random sample of the entire set of potential untreated outcomes ($Y_i(0)$). The average outcome in our sample provides an unbiased estimate of all potential untreated outcomes, including those we cannot observe.

\[ \mathbb{E}[\overline{Y(0)}] = \mathbb{E}[Y_i(0)] \]

The same logic applies when we draw a random sample of the entire set of potential treated outcomes ($Y_i(1)$).

In practice, we create these random samples by randomly assigning individuals to either a treatment or control group.

The outcomes of the people in the treatment group constitute a random sample of all potential treated outcomes ($Y_i(1)$).

Ditto for the outcomes of the people randomly assigned to the control group.

Let’s look at this in the simulation. We will start with a single “experiment”:

# creating a treatment variable
library(randomizr)

# randomly assigning to treatment
# and revealing the observed outcome
dta <- dta |> 
  mutate(uni = complete_ra(N, prob = 0.5),
         y = ifelse(uni==1, y1, y0))

# average of the control group outcomes
mean(dta$y[dta$uni == 0])

[1] 3363.304

# average of the treatment group outcomes
mean(dta$y[dta$uni == 1])

[1] 8267.736

# estimated ATE from our single experiment
mean(dta$y[dta$uni == 1]) - mean(dta$y[dta$uni == 0])

[1] 4904.432

Recall that the “true” quantities of interest are:

$\mathbb{E}[Y_i(1)]$ = 8315.52
$\mathbb{E}[Y_i(0)]$ = 3315.52
ATE = 5000.

From our single experiment, we estimate the following:

$\widehat{Y(1)}$ = 8267.736
$\widehat{Y(0)}$ = 3363.304
$\widehat{ATE}$ = 4904.432

Our estimates are obviously different from the “truth.”

But the question is not whether any single experiment is exactly right. Rather, if our estimate is unbiased, then we should be right “on average.”

Let’s check:

# number of simulations
n_sims <- 1000

# creating empty vectors to store the simulation results
v_y0 <- c()
v_y1 <- c()
v_ate <- c()

# looping
for (i in 1:n_sims) {
  
  # re-randomizing treatment assignment
  dta <- dta |> 
  mutate(uni = complete_ra(N, prob = 0.5),
         y = ifelse(uni==1, y1, y0))
  
  # storing the experimental results
  v_y0[i] <- mean(dta$y[dta$uni == 0])
  
  v_y1[i] <- mean(dta$y[dta$uni == 1])
  
  v_ate[i] <- mean(dta$y[dta$uni == 1]) - mean(dta$y[dta$uni == 0])

}

Across 1000 experiments, we find:

the average of $\widehat{Y(1)}$ = 8309.59985
the average of $\widehat{Y(0)}$ = 3321.44015
the average of $\widehat{ATE}$ = 4988.1597

That looks pretty good!

To recap: we just did something pretty cool.

Even though we cannot observe any single ITE, randomization still allows us to form an unbiased estimate of the ATE.

And that’s because random assignment to treatment and control groups creates random samples from the full set of (observed and unobserved) potential outcomes.

So randomization solves the fundamental problem of causal inference!

Randomization Creates “Balanced” Groups (in expectation)

Experiments bypass the omitted variable bias problem by ensuring that all confounders – both observed/measured and unobserved/unmeasured – are balanced between treatment and control groups in expectation.

In plain language, it means that treatment and control groups are, on average, identical to each other except that the one group receives the treatment.

Since the treatment is the only systematic difference between the two groups, then any group differences in average outcomes must have been caused solely by the treatment itself, and not by any confounding factor.

Here’s what our new DAG looks like under randomization:

Notice that, under random assignment, $Ability$ no longer determines whether someone goes to Uni.

More formally,

\[ \{Y_i(1), Y_i(0), \text{Ability}_i\} \perp\!\!\!\perp D_i \]

where $\perp\!\!\!\perp$ means “independent of”.

Indendence means that knowing whether you are in the treatment or the control group does not tell you anything about the potential baseline outcome, the potential “response” to treatment, or other baseline characteristics such as ability.

Let’s see this via a simulation:

# creating empty vectors to store the simulation results
v_ability <- c()
v_ability_biased <- c()

# looping
for (i in 1:n_sims) {
  
  # re-randomizing treatment assignment
  dta <- dta |> 
  mutate(uni = complete_ra(N, prob = 0.5),
         y = ifelse(uni==1, y1, y0))
  
  # storing the difference in ability between treatment and control
  v_ability[i] <- mean(dta$ability[dta$uni == 1]) - mean(dta$ability[dta$uni == 0])
  
  
  # under self-selection
  dta <- dta |> 
    mutate(uni = rbinom(n = N, size = 1, prob = ability/100),
           y = ifelse(uni==1, y1, y0))

  # storing the difference in ability between treatment and control
  v_ability_biased[i] <- mean(dta$ability[dta$uni == 1]) - mean(dta$ability[dta$uni == 0])    
  
}

We observe that, under randomization, the average difference in $Ability$ across our 1000 simulations is -0.06, which is pretty close to zero.

By contrast, had we let people self-select into university based on $Ability$, the mean difference across 1000 simulations would be 33.22.

You don’t need controls in an experiment!

A key implication of the above is the following:

Since randomization balances all possible confounders across treatment and control groups (at least, with large enough samples), you do not need to control for anything when estimating treatment effects in an experiment!

Just to confirm, let’s estimate two models with- and without controls:

set.seed(3)

# randomly assign treatments and reveal outcomes
dta <- dta |> 
  mutate(uni = complete_ra(N, prob = 0.5),
         y = ifelse(uni==1, y1, y0))
  
# estimate models
mod1 <- lm(y ~ uni, data=dta)

mod2 <- lm(y ~ uni + ability, data=dta)


# showing models side by side
library(modelsummary)

models <- list(mod1, mod2)

modelsummary(models, 
             coef_map = c('uni' = 'Attend University',
                          'ability' = 'Ability',
                          '(Intercept)' = 'Constant'),
             gof_map = c("nobs", "r.squared"))

	(1)	(2)
Attend University	4896.064	4863.061
	(193.014)	(56.334)
Ability		101.172
		(0.977)
Constant	3367.488	−1671.491
	(136.482)	(62.893)
Num.Obs.	1000	1000
R2	0.392	0.948

Exercise:

Write up these results as if you were writing for your BA thesis. What does this table tell you?

As you’ve seen, the coefficients basically don’t change. And that’s because randomization has broken the correlation (or “backdoor path”) between $Ability$ and $Uni$.

However, you may also have noticed that the standard error (in parentheses) got smaller in Model 2 compared to Model 1.

We’ll talk about why in the next two sessions.

Footnotes

It also depends on whether your regression model has the correct functional form - e.g. are there any squared terms or other non-linearities?↩︎

\(Y_i(0)\)	\(Y_i(1)\)	\(ITE_i\)
1	4	3
1	4	3
2	5	3
4	7	3
\(\mathbb{E}[Y_i(0)]\) = 2	\(\mathbb{E}[Y_i(1)]\) = 5	\(\mathbb{E}[Y_i(1) - Y_i(0)]\) = 3