Potential Outcomes and Average Treatment Effects
Let’s start off with a question that’s (hopefully) relevant to you:
What is a university education worth in terms of future earnings on the job market?
Or, to put it more bluntly, does going to university cause (future) you to earn more money?
Potential Outcomes
The potential outcomes framework provides a great way for thinking through causal claims.
Under this framework, there are two “potential states of the world”, which we can think of as parallel universes:
- one universe where you go to university
- a parallel universe where literally everything else is the same, except that you do not go to university
Suppose we are able to measure your income at age 30 in both universes. We therefore have two potential outcomes:
- \(Y_i(1)\) = your income in the universe where you go to uni
- \(Y_i(0)\) = your income in the universe where you do not
The causal effect of university on your income is therefore \(\tau_i \equiv Y_i(1) - Y_i(0)\). This is called the individual treatment effect (ITE).
We can also calculate an average treatment effect (ATE) which is simply the average or expectation of the ITEs:
\[ ATE \equiv \mathbb{E}[Y_i(1) - Y_i(0)] \]
Let’s run a simulation where we can calculate the ITEs and ATE.
In this simulation, there are 1000 people.
Each person has some innate “ability”, which we can operationalize as a random drawn from a uniform distribution between 0 (= lowest ability person) and 100 (= highest ability person).
Each person exists in two states of the world, corresponding to our two parallel universes. In the universe where they do not go to university, income is given by the following data generating process:
\[ Y_i(0) = \text{Ability}_i^2 + \epsilon_i \] where \(\epsilon_i\) is just some random “noise” drawn from a normal distribution with mean = 0 and standard deviation = 500:
\[ \epsilon_i \sim \mathcal{N}(0, 500) \]
In the universe where the person does go to uni, income is 5000 EUR higher:
\[ Y_i(1) = Y_i(0) + 5000 \]
Note that this treatment effect is identical for every single person.
Let’s make up the data:
# load packages
library(tidyverse)
# parameters
set.seed(1)
<- 1000
N <- runif(N, min=0, max=100)
ability <- rnorm(N, mean=0, sd=500)
e <- 5000
effect
# creating the data
<- tibble(ability, e) |>
dta mutate(y0 = round(ability^2 + e, 0),
y1 = y0 + effect,
ite = y1 - y0)
# printing the first few rows
head(dta[,c(3,4,5)])
# A tibble: 6 × 3
y0 y1 ite
<dbl> <dbl> <dbl>
1 744 5744 5000
2 1236 6236 5000
3 2690 7690 5000
4 8254 13254 5000
5 903 5903 5000
6 8868 13868 5000
The Fundamental Problem of Causal Inference
Of course, if we somehow observe both potential outcomes, then causal inference is easy!
But in real life, each person either goes to university or does not. We cannot observe them in both states of the world.
Thus, one potential outcome is always missing, and we cannot calulate the ITE:
\(Y_i(0)\) | \(Y_i(1)\) | \(ITE_i\) |
---|---|---|
744 | NA | NA |
NA | 6236 | NA |
NA | 7690 | NA |
8254 | NA | NA |
NA | 5903 | NA |
8868 | NA | NA |
So how do we solve this problem?
As it turns out, although we cannot obtain the ITE, we can still estimate the ATE.
The answer begins with a simple equality:
\[ ATE \equiv \mathbb{E}[Y_i(1) - Y_i(0)] = \mathbb{E}[Y_i(1)] - \mathbb{E}[Y_i(0)] \]
Again, this looks complicated, so let’s break it down with a simple example:
\(Y_i(0)\) | \(Y_i(1)\) | \(ITE_i\) |
---|---|---|
1 | 4 | 3 |
1 | 4 | 3 |
2 | 5 | 3 |
4 | 7 | 3 |
\(\mathbb{E}[Y_i(0)]\) = 2 | \(\mathbb{E}[Y_i(1)]\) = 5 | \(\mathbb{E}[Y_i(1) - Y_i(0)]\) = 3 |
In this example, we can either:
- going row-by-row, fill in the third column by calculating \(Y_i(1)\) - \(Y_i(0)\)
- then average down the third column
or
- average down the \(Y_i(0)\) column to get \(\mathbb{E}[Y_i(0)]\),
- then average down the \(Y_i(1)\) column to get \(\mathbb{E}[Y_i(1)]\),
- calculate \(\mathbb{E}[Y_i(1)] - \mathbb{E}[Y_i(0)]\) from the last row alone.
We get the same answer either way.
The first strategy is impossible in the real world (due to the fundamental problem of causal inference). So we are left with the second strategy.
More specifically, our task is to estimate \(\mathbb{E}[Y_i(1)]\) and \(\mathbb{E}[Y_i(0)]\).
The key question is: are these estimates biased or unbiased?
Selection Bias
Let’s assume we are working with observational data.
In observational data, we as researchers do not get to decide who goes to uni and who does not. Rather, our subjects decide for themselves.
This self-selection into treatment creates problems for causal inference. More specifically, a “naive” comparison of the incomes of university-graduates vs. people without university degrees is likely to be biased.
To see this action, let’s suppose that \(Ability\) not only determined income, but also determined one’s chances of attending university:
\[ Pr(\text{Uni}) = Ability / 100 \]
Here, we have modeled the probability of attending Uni as a coin toss, where \(Ability\) determines the “fairness” of a coin.
Thus, someone with \(Ability_i\) = 20 has a 20% chance of going to uni, while someone else with \(Ability_i\) = 80 has an 80% chance.
Let’s add this to our data:
# recreating the process by which people self-select into university
<- dta |>
dta mutate(uni = rbinom(n = N, size = 1, prob = ability/100),
uni = factor(uni))
# plotting the relationship between ability and university attendance
ggplot(dta, aes(x=ability, fill=uni)) +
geom_histogram(color="white",
boundary=0,
binwidth=5) +
theme_minimal() +
scale_fill_discrete(labels=c("No",
"Yes")) +
labs(x = "Ability",
y = "Count",
fill = "Attend University?" )
Now let’s run some simulations. Each time, we will calculate the difference-in-mean incomes between people who did vs. did not attend uni. And then we will average across all of our simulations.
set.seed(2)
<- 1000
n_sims
# creating empty vectors to store the simulation results
<- c()
v_y0 <- c()
v_y1 <- c()
v_diff
# looping
for (i in 1:n_sims) {
# self-selection into treatment and revealing outcomes
<- dta |>
dta mutate(uni = rbinom(n = N, size = 1, prob = ability/100),
y = ifelse(uni==1, y1, y0))
# storing the average untreated outcomes
<- mean(dta$y[dta$uni == 0])
v_y0[i]
# storing the average treated outcomes
<- mean(dta$y[dta$uni == 1])
v_y1[i]
# storing the difference in means
<- mean(dta$y[dta$uni == 1]) - mean(dta$y[dta$uni == 0])
v_diff[i]
}
The average of the treated and untreated outcomes across our 1000 are:
- \(\widehat{Y_i(0)}\) = 1636.2
- \(\widehat{Y_i(1)}\) = 9993.45
And the average difference-in-means is: 8357.25.
Because we made up the data generating process, we know that the “true” quantities of interest are:
- \(\mathbb{E}[Y_i(0)]\) = 3315.52
- \(\mathbb{E}[Y_i(1)]\) = 8315.52
- ATE = 5000.
So why is our estimated difference-in-means so far off from the “true” ATE?
What is selection bias?
It’s the difference in baseline (potential untreated) outcomes between people who did and did not attend university.
In this case, people with higher ability were also more likely to self-select into university. And they also earn higher incomes (due to their higher ability).
Since we made up the data, we have the full schedule of potential outcomes, and we can calculate the average amount of selection bias across our 1000 simulations:
set.seed(2)
# creating empty vectors to store the simulation results
<- c()
v_bias
# looping
for (i in 1:n_sims) {
# self-selection into treatment and revealing outcomes
<- dta |>
dta mutate(uni = rbinom(n = N, size = 1, prob = ability/100))
# storing the selection bias
<- mean(dta$y0[dta$uni == 1]) - mean(dta$y0[dta$uni == 0])
v_bias[i]
}
The average bias is 3357.25.
And 3357.25 + 5000 = 8357.25.
Yay, our math works!
Why not just control for ability?
In this example, \(Ability\) acts as a confounder because it determines both our treatment variable \(Uni\) and our outcome \(Y\).
In your other classes, you probably learned that we can eliminate this type of bias by controlling for confounders in a regression:
\[ Y_i = \beta_0 + \beta_1 * Uni_i + \beta_2 * Ability_i + \epsilon_i \]
Let’s try it:
set.seed(2)
# creating empty vector to store the regression coefficient
<- c()
v_coef
# looping
for (i in 1:n_sims) {
# self-selection into treatment and revealing outcomes
<- dta |>
dta mutate(uni = rbinom(n = N, size = 1, prob = ability/100),
y = ifelse(uni==1, y1, y0))
# running the regression
<- lm(y ~ uni + ability, data=dta)
model
# storing the coefficient
<- model$coefficients[2]
v_coef[i]
}
The average of our 1000 regression coefficients is 4996.25. So by controlling for the confounder, we can “recover” an unbiased estimate of the ATE.
It all depends on whether you think you have (accurately) measured all of the possible confounders.1
But maybe there are some variables which you did not – or cannot – measure?
In this example, I made up a measure of ability. But can we really measure ability (without error) in real life?
Randomization to the Rescue
Rather than using (complex) statistical models, randomization provides a simple way to estimate causal effects which are free from selection bias even without using any controls.
Recall that our task is to arrive at unbiased estimates of \(\mathbb{E}[Y_i(1)]\) and \(\mathbb{E}[Y_i(0)]\).
Last week, we drew a random sample from our bowl of red and white marbles. The proportion of red marbles in our shovel constituted our “best guess” about the proportion of red marbles in the bowl.
Similarly, now we will draw a random sample of the entire set of potential untreated outcomes (\(Y_i(0)\)). The average outcome in our sample provides an unbiased estimate of all potential untreated outcomes, including those we cannot observe.
\[ \mathbb{E}[\overline{Y(0)}] = \mathbb{E}[Y_i(0)] \]
The same logic applies when we draw a random sample of the entire set of potential treated outcomes (\(Y_i(1)\)).
In practice, we create these random samples by randomly assigning individuals to either a treatment or control group.
The outcomes of the people in the treatment group constitute a random sample of all potential treated outcomes (\(Y_i(1)\)).
Ditto for the outcomes of the people randomly assigned to the control group.
Let’s look at this in the simulation. We will start with a single “experiment”:
# creating a treatment variable
library(randomizr)
# randomly assigning to treatment
# and revealing the observed outcome
<- dta |>
dta mutate(uni = complete_ra(N, prob = 0.5),
y = ifelse(uni==1, y1, y0))
# average of the control group outcomes
mean(dta$y[dta$uni == 0])
[1] 3363.304
# average of the treatment group outcomes
mean(dta$y[dta$uni == 1])
[1] 8267.736
# estimated ATE from our single experiment
mean(dta$y[dta$uni == 1]) - mean(dta$y[dta$uni == 0])
[1] 4904.432
Recall that the “true” quantities of interest are:
- \(\mathbb{E}[Y_i(1)]\) = 8315.52
- \(\mathbb{E}[Y_i(0)]\) = 3315.52
- ATE = 5000.
From our single experiment, we estimate the following:
- \(\widehat{Y(1)}\) = 8267.736
- \(\widehat{Y(0)}\) = 3363.304
- \(\widehat{ATE}\) = 4904.432
Our estimates are obviously different from the “truth.”
But the question is not whether any single experiment is exactly right. Rather, if our estimate is unbiased, then we should be right “on average.”
Let’s check:
# number of simulations
<- 1000
n_sims
# creating empty vectors to store the simulation results
<- c()
v_y0 <- c()
v_y1 <- c()
v_ate
# looping
for (i in 1:n_sims) {
# re-randomizing treatment assignment
<- dta |>
dta mutate(uni = complete_ra(N, prob = 0.5),
y = ifelse(uni==1, y1, y0))
# storing the experimental results
<- mean(dta$y[dta$uni == 0])
v_y0[i]
<- mean(dta$y[dta$uni == 1])
v_y1[i]
<- mean(dta$y[dta$uni == 1]) - mean(dta$y[dta$uni == 0])
v_ate[i]
}
Across 1000 experiments, we find:
- the average of \(\widehat{Y(1)}\) = 8309.59985
- the average of \(\widehat{Y(0)}\) = 3321.44015
- the average of \(\widehat{ATE}\) = 4988.1597
That looks pretty good!
To recap: we just did something pretty cool.
Even though we cannot observe any single ITE, randomization still allows us to form an unbiased estimate of the ATE.
And that’s because random assignment to treatment and control groups creates random samples from the full set of (observed and unobserved) potential outcomes.
So randomization solves the fundamental problem of causal inference!
Randomization Creates “Balanced” Groups (in expectation)
Experiments bypass the omitted variable bias problem by ensuring that all confounders – both observed/measured and unobserved/unmeasured – are balanced between treatment and control groups in expectation.
In plain language, it means that treatment and control groups are, on average, identical to each other except that the one group receives the treatment.
Since the treatment is the only systematic difference between the two groups, then any group differences in average outcomes must have been caused solely by the treatment itself, and not by any confounding factor.
Here’s what our new DAG looks like under randomization:
Notice that, under random assignment, \(Ability\) no longer determines whether someone goes to Uni.
More formally,
\[ \{Y_i(1), Y_i(0), \text{Ability}_i\} \perp\!\!\!\perp D_i \]
where \(\perp\!\!\!\perp\) means “independent of”.
Indendence means that knowing whether you are in the treatment or the control group does not tell you anything about the potential baseline outcome, the potential “response” to treatment, or other baseline characteristics such as ability.
Let’s see this via a simulation:
# creating empty vectors to store the simulation results
<- c()
v_ability <- c()
v_ability_biased
# looping
for (i in 1:n_sims) {
# re-randomizing treatment assignment
<- dta |>
dta mutate(uni = complete_ra(N, prob = 0.5),
y = ifelse(uni==1, y1, y0))
# storing the difference in ability between treatment and control
<- mean(dta$ability[dta$uni == 1]) - mean(dta$ability[dta$uni == 0])
v_ability[i]
# under self-selection
<- dta |>
dta mutate(uni = rbinom(n = N, size = 1, prob = ability/100),
y = ifelse(uni==1, y1, y0))
# storing the difference in ability between treatment and control
<- mean(dta$ability[dta$uni == 1]) - mean(dta$ability[dta$uni == 0])
v_ability_biased[i]
}
We observe that, under randomization, the average difference in \(Ability\) across our 1000 simulations is -0.06, which is pretty close to zero.
By contrast, had we let people self-select into university based on \(Ability\), the mean difference across 1000 simulations would be 33.22.
You don’t need controls in an experiment!
A key implication of the above is the following:
Just to confirm, let’s estimate two models with- and without controls:
set.seed(3)
# randomly assign treatments and reveal outcomes
<- dta |>
dta mutate(uni = complete_ra(N, prob = 0.5),
y = ifelse(uni==1, y1, y0))
# estimate models
<- lm(y ~ uni, data=dta)
mod1
<- lm(y ~ uni + ability, data=dta)
mod2
# showing models side by side
library(modelsummary)
<- list(mod1, mod2)
models
modelsummary(models,
coef_map = c('uni' = 'Attend University',
'ability' = 'Ability',
'(Intercept)' = 'Constant'),
gof_map = c("nobs", "r.squared"))
(1) | (2) | |
---|---|---|
Attend University | 4896.064 | 4863.061 |
(193.014) | (56.334) | |
Ability | 101.172 | |
(0.977) | ||
Constant | 3367.488 | −1671.491 |
(136.482) | (62.893) | |
Num.Obs. | 1000 | 1000 |
R2 | 0.392 | 0.948 |
As you’ve seen, the coefficients basically don’t change. And that’s because randomization has broken the correlation (or “backdoor path”) between \(Ability\) and \(Uni\).
However, you may also have noticed that the standard error (in parentheses) got smaller in Model 2 compared to Model 1.
We’ll talk about why in the next two sessions.
Footnotes
It also depends on whether your regression model has the correct functional form - e.g. are there any squared terms or other non-linearities?↩︎