Two-Way Fixed Effects and Difference-in-Differences

How would you vote if your vote weren’t secret?

So this actually happened in Spain¹: in 2019, voting the European Parliamentary Elections in the municipality of Ciutadella de Menorca was conducted in public (they ran out of private voting booths).

Spain has a stigmatized right-wing political party, the Partido Popolar (PP).

Did the public voting procedure reduce PP support?

To answer this question, we can potentially make two comparisons:

Compare PP vote share in the municipality of Ciutadella de Menorca to PP vote share in other municipalities (also in the same region) where private voting booths were in use.
Compare PP vote share in the municipality of Ciutadella de Menorca in 2019 to PP vote share in the same voting stations in 2014 (when private voting stations were used).

Discuss:

Which comparison do you prefer?

What might you still be worried about with your preferred comparison?

Differences-in-Differences (DiD)

As it turns out, we should do both comparisons simultaenously.

Comparing Ciutadella to non-Ciutadella in 2019 is perhaps the most straightforward, but introduces a lot of potential confounds — for example, for reasons completely unrelated to the availability of private voting booths, maybe the PP just less stigmatized in Ciutadella vs other municipalities (even in the same region).

It’s not guaranteed that we can control for all possible confounders, which leads to bias.

Comparing Ciutadella in 2019 vs 2014 also makes sense. However, maybe the change in PP vote share is due to the absence of private voting booths, but maybe it’s also due to other inter-temporal factors – for example, maybe the PP had a corruption scandal in 2018 that lowered its popularity across the board?

However, suppose we compare non-Ciutadella municipalities between 2014 and 2019. This allows us to estimate the effect of the corruption scandal (since it presumably affected all regions).

We then subtract off this temporal change from the over-time difference in Ciutadella to isolate the effect of public voting.

That’s difference-in-differences (DiD)!

Data Generating Process

To solidify our understanding, let’s simulate some data:

Suppose the PP vote share in 2014 from voting stations located outside Ciutadella is given:

\[ Y_{rest, 2014} = \mathcal{N}(15,2) \] Further, suppose that the effect of the corruption scandal is to lower PP vote share by 3 percentage points on average:

\[ \Delta_{corrupt} \sim \mathcal{N}(-3, 0.5) \] We can therefore write the 2019 outcomes in the non-Ciutadella regions of Spain as:

\[ Y_{rest, 2019} = Y_{rest, 2014} - \Delta_{corrupt} \]

Now suppose that the PP vote share in 2014 from voting stations located inside Ciutadella is given:

\[ Y_{ciu, 2014} = \mathcal{N}(12,2) \]

Next, imagine the effect of public voting lowers PP vote share by 2 percentage points on average:

\[ \Delta_{public} \sim \mathcal{N}(-2, 0.2) \]

So we can write the outcome for Ciutadella in 2019 as:

\[ Y_{ciu, 2019} = Y_{ciu, 2014} - \Delta_{public} - \Delta_{corrupt} \]

Let’s see what this data look like:

library(tidyverse)

set.seed(1)

# parameters
N_ciu <- 25
N_rest <- 500

# creating some "wide" data
delta_c1 <- rnorm(N_rest, -3, 0.5)
y_rest_2014 <- rnorm(N_rest, 15, 2)
y_rest_2019 <- y_rest_2014 + delta_c1

delta_c2 <- rnorm(N_ciu, -3, 0.5)
delta_p <- rnorm(N_ciu, -2, 0.2)
y_ciu_2014 <- rnorm(N_ciu, 12, 2)
y_ciu_2019 <- y_ciu_2014 + delta_c2 + delta_p


# reshaping to "long" data
rest <- tibble(y_rest_2014, y_rest_2019) |> 
  mutate(id = row_number(),
         region = "rest") |> 
  pivot_longer(
    cols = starts_with("y_rest_"),
    names_prefix = "y_rest_",
    names_to = "year",
    values_to = "pp_vote"
  )

ciu <- tibble(y_ciu_2014, y_ciu_2019) |> 
  mutate(id = row_number() + N_rest,
         region = "ciutadella") |> 
  pivot_longer(
    cols = starts_with("y_ciu_"),
    names_prefix = "y_ciu_",
    names_to = "year",
    values_to = "pp_vote"
  )

dta <- rbind(rest, ciu) |> 
  mutate(year=as.numeric(year))

Now let’s visualize the means by year and group:

Recall that we want to estimate the effect of public voting in Ciutadella in 2019. This is our observed treated outcome (\(Y_i(1)\)).

We also wanted to use the blue to simulate what would have happened in Ciutadella, had private voting booths been available. For that, we use what actually did happen in the rest of Spain.

In other words, the blue line helps us to estimate the counterfactual untreated potential outcome (\(Y_i(0)\)).:

The vertical distance between the red dot and the grey dot is our causal effect of interest.

Technically, it’s an average treatment effect on the treated (ATT).

Quiz:

Using this graph, can you point out why the method is called “differece in differences”?

Key assumption: Parallel Trends

Notice that our DiD effect depends on the slope of the grey line (which is, by construction, identical to the slope of the blue line).

This illustrates the key assumption underlying DiD: namely that time trends for the untreated units provide a good counterfactual for what would have happened in the treated units in the absence of treatment.

In other words, the observed untreated unit trend and the unobserved counterfactual trend are parallel. This is called the parallel trends assumption.

Here’s what a violation might look like:

Here, we see that the solid grey line is not parallel with the blue line.

As a result, our assumed counterfactual is “off” from the “real” counterfactual, and the DiD estimate is biased (in this case, it’s too negative).

DiD Estimation

We can estimate a simple 2x2 DiD using the following interaction model:

\[ PP\;vote_{it} = \beta_0 + \beta_1 * Ciutadella_i + \beta_2 * Year2019_t + \beta_3 * Ciutadella_i * Year2019_t + e_{it} \]

Quiz:

Can you interpret the coefficients in this model?

Importantly, \(\beta_3\) tells us how much bigger the drop in PP voteshare between 2014 and 2019 is in Ciutadella, compared to the drop in the rest of Spain. Difference-in-differences!

Let’s estimate this model. Note that we need to cluster standard errors within polling stations using estimatr::lm_robust. That’s because PP voteshare is likely to be very correlated over time within the same polling station.²

library(broom)
library(estimatr)

# change the region and year variables to dummies
dta <- dta |> 
  mutate(ciutadella = ifelse(region == "ciutadella", 1, 0),
         year = factor(year),
         treated = ifelse(ciutadella==1 & year == 2019, 1, 0))

# estimate the model
did_mod <- lm_robust(pp_vote ~ ciutadella*year, 
                     data=dta,
                     se_type="stata",
                     clusters = id)

kable(tidy(did_mod)[, 1:5], digits=2)

term	estimate	std.error	statistic
(Intercept)	14.91	0.09	157.43
ciutadella	-2.96	0.49	-6.07
year2019	-2.99	0.02	-131.90
ciutadella:year2019	-2.07	0.12	-17.76

TWFE Estimation

Note that in the simple 2x2 case, the DiD model is actually equivalent to the following two-way fixed effects (TWFE) model:

\[ PP\;vote_{it} = \alpha_{id} + \alpha_{year} + \delta_{twfe} * Treated_{it} + e_{it} \]

\[ Treated_{it} = \begin{cases} 1 & \text{if a polling station is in Ciutadella AND the year is 2019} \\ 0 & \text{otherwise} \end{cases} \]

Note that the TWFE model includes a set of fixed effects for polling station (\(\alpha_{id}\)), as well as fixed effects for time period (\(\alpha_{year}\)).³

Thinking through what’s going on:

\(\alpha_{id}\) removes the influence of baseline differences between units that stay constant over time. Think of this as controlling for all time-invariant differences across polling stations.
\(\alpha_{year}\) removes the influence of temporal disturbances that affect all units equally within the same time period. Think of this as controlling for temporal shocks that affect all polling stations equally.
Finally, PTA implies that there are no time-varying confounders missing from the model

Thus, \(\delta\) gives us our estimate of the causal effect of the policy change on the treated (it’s an ATT).

Here’s another way of looking at it:

\[ \begin{cases} Y_{it}(0) & = \alpha_i + \alpha_t + \epsilon \\ Y_{it}(1) & = Y_{it}(0) + \delta \end{cases} \]

We can estimate the model using fixest::feols():

library(fixest)

twfe_mod <- feols(pp_vote ~ treated | id + year, data=dta)

kable(tidy(twfe_mod), digits=2)

term	estimate	std.error	statistic	p.value
treated	-2.07	0.12	-17.77	0

Note that feols() automatically clusters your standard errors within polling stations.

Here, the TWFE model gives the same answer as the DiD model, but we will see below that it’s also more flexible.

Between vs. Within Effects

How do fixed effects work again?

Suppose we are interested in the relationship between hours of exercise per week and the number of colds a person gets per year.

We collect data the following data on two people:

Note that we can actually “decompose” this data into two sources of variation: between and within:

person	year	exercise	colds	mean_colds	mean_exercise	within_colds	within_exercise
Andy	2022	1	4	2.33	2.33	1.67	-1.33
Andy	2023	4	1	2.33	2.33	-1.33	1.67
Andy	2024	2	2	2.33	2.33	-0.33	-0.33
Bob	2022	5	7	5.67	6.33	1.33	-1.33
Bob	2023	8	4	5.67	6.33	-1.67	1.67
Bob	2024	6	6	5.67	6.33	0.33	-0.33

Here’s how that looks graphically:

We can see that even though the cloud is upward sloping, that’s mostly due to the between-variation.

Bob exercises more than Andy, and Bob also seems to get sick more easily.

But what’s the relationship between exercise and colds, once we strip out the average between-person differences?

So here’s what the within-variation looks like:

Here, you see the sign even flipped!

Key idea: Fixed effects achieves the exact same result as de-meaning.

Everything that stays constant within units (but which can vary a lot across units) – so stuff like “mean_exercise” – is “absorbed” by the fixed effects.⁴

So the only variation which is left to exploit is the within-variation.

Now Back to Voting: we are only looking at within-variation in PP voteshares within polling stations.

Everything which is constant within a polling station is captured by the unit fixed effects.

This also goes for variables that we didn’t / couldn’t measure (such as how much the PP is stigmatized by neighborhood residents).

And everything which is constant within time periods (like our hypothetical corruption scandal) is captured by the time fixed effects.

And that’s the power of fixed effects…we can control for stuff that we don’t even have to measure!

Checking Prior Trends

Importantly, parallel trends is an assumption. We can never know if it holds or not (since we cannot observe counterfactual untreated outcomes).

However, provided we have enough pre-treatment data, we can examine prior trends.

If prior trends appear to be parallel, then this gives us assurance that what is happening in our untreated units provides a good counterfactual for what would have happened in our treated units.

To see how to analyze prior trends, let’s make up some additional data for years 2009 and 2004.

For each of these years, we will draw outcomes randomly from the following distributions:

\[ Y_{rest, year} = \mathcal{N}(15,2) \]

\[ Y_{ciu, year} = \mathcal{N}(12,2) \]

Let’s make up that data now:

set.seed(2)

# parameters
N_ciu <- 25
N_rest <- 500

# creating some more "wide" data
delta_c1 <- rnorm(N_rest, -3, 0.5)
y_rest_2004 <- rnorm(N_rest, 15, 2)
y_rest_2009 <- rnorm(N_rest, 15, 2)
y_rest_2014 <- rnorm(N_rest, 15, 2)
y_rest_2019 <- y_rest_2014 + delta_c1
y_rest_2024 <- rnorm(N_rest, 15, 2)      # we'll use this later


delta_c2 <- rnorm(N_ciu, -3, 0.5)
delta_p <- rnorm(N_ciu, -2, 0.2)
y_ciu_2004 <- rnorm(N_ciu, 12, 2)
y_ciu_2009 <- rnorm(N_ciu, 12, 2)
y_ciu_2014 <- rnorm(N_ciu, 12, 2)
y_ciu_2019 <- y_ciu_2014 + delta_c2 + delta_p
y_ciu_2024 <- rnorm(N_ciu, 12, 2)      # we'll use this later


# reshaping to "long" data
rest <- tibble(y_rest_2004, y_rest_2009, y_rest_2014, y_rest_2019, y_rest_2024) |> 
  mutate(id = row_number(),
         region = "rest") |> 
  pivot_longer(
    cols = starts_with("y_rest_"),
    names_prefix = "y_rest_",
    names_to = "year",
    values_to = "pp_vote"
  )

ciu <- tibble(y_ciu_2004, y_ciu_2009, y_ciu_2014, y_ciu_2019, y_ciu_2024) |> 
  mutate(id = row_number() + N_rest,
         region = "ciutadella") |> 
  pivot_longer(
    cols = starts_with("y_ciu_"),
    names_prefix = "y_ciu_",
    names_to = "year",
    values_to = "pp_vote"
  )

dta <- rbind(rest, ciu) |> 
  mutate(year=as.numeric(year),
         ciutadella = ifelse(region == "ciutadella", 1, 0),
         treated = ifelse(ciutadella==1 & year == 2019, 1, 0))

Here’s what the data look like:

# collapsing
toplot <- dta |> 
  group_by(region, year) |> 
  summarise(pp = mean(pp_vote),
            sd_pp = sd(pp_vote), # standard deviation
            n = n(),  # Number of observations
            se = sd_pp / sqrt(n),  # Standard error
            lower95 = pp - 1.96 * se,  # Lower bound of 95% CI
            upper95 = pp + 1.96 * se  # Upper bound of 95% CI
            ) |> 
  ungroup()


# visualization
ggplot(data=filter(toplot, year != 2024), 
       aes(x=year, y = pp, color=region)) +
  geom_point(size=3) +
  geom_ribbon(aes(ymin = lower95, 
                  ymax = upper95,
                  fill = region), 
              alpha = 0.2,
              color = NA) +
  geom_line(lty="dashed") +
  theme_minimal() +
  labs(x = "Year",
       y = "Average PP vote share",
       color = "Region") +
  guides(fill = "none") + 
  scale_color_discrete(labels=c("Ciutadella", "Rest of Spain"),
                       guide = guide_legend(reverse = TRUE))

The first check on prior trends we can do is simply run the following regression on the pre-treatment data:

\[ PP \; vote_{it} = \beta_0 + \beta_1 * Ciutadella_i + \beta_2 * year_t + \beta_3 * Ciutadella_i * year_t + e_{it} \]

This looks like the same regression as before, but now we are treating \(year\) as a continuous variable.

Quiz:

Can you interpret the coefficients in this model?

So \(\beta_2\) tells us the slope of the time trend in the rest of Spain from 2004-2014, and \(\beta_3\) tells us whether the slope in Ciutadella over the same period is significantly different.

Let’s run the regression:

m1 <- lm_robust(pp_vote ~ ciutadella*year, 
                data=filter(dta, year < 2019),
                se_type="stata",
                clusters = id)

kable(tidy(m1)[, 1:5], digits=2)

term	estimate	std.error	statistic	p.value
(Intercept)	34.97	25.19	1.39	0.17
ciutadella	104.03	127.95	0.81	0.42
year	-0.01	0.01	-0.79	0.43
ciutadella:year	-0.05	0.06	-0.84	0.40

Placebo Tests

We could also do a placebo test: are the size of the 2004 and 2009 gaps between the red and blue lines significantly different than the size of the 2014 gap?

If the gap sizes are changing, this would also be evidence that the prior trends are not parallel.

To conduct this test, we basically run the same model as above, only now we treat \(year\) as a factor variable. Conventionally, we also use the last period prior to treatment (so: 2014) as the baseline.

\[ PP \; vote_{it} = \beta_0 + \beta_1 * Ciutadella_i + \beta_2 * Y2004_t + \beta_3 * Y2009_t + \] \[ \beta_4 * Ciutadella_i * Y2004_t + \beta_5 * Ciutadella_i * Y2004_t + e_{it} \]

Quiz:

Can you interpret the coefficients in this model?

# year as a factor variable
# reference category is 2014
dta <- dta |> 
  mutate(year = factor(year),
         year = relevel(year, 3))

# model
m2 <- lm_robust(pp_vote ~ ciutadella*year, 
                data=filter(dta, year %in% c(2004, 2009, 2014)),
                se_type="stata",
                clusters = id)

kable(tidy(m2)[, 1:5], digits=2)

term	estimate	std.error	statistic	p.value
(Intercept)	15.03	0.09	164.25	0.00
ciutadella	-3.59	0.37	-9.77	0.00
year2004	0.10	0.13	0.79	0.43
year2009	0.03	0.13	0.26	0.80
ciutadella:year2004	0.53	0.64	0.84	0.40
ciutadella:year2009	0.57	0.51	1.12	0.26

Both sets of analyses indicate that prior trends are indeed parallel.

Important:

Remember that parallel trends is an assumption about how an unobserved counterfactual will behave.

It cannot be “proven” to hold by data.

In other words, even if prior trends hold, this does not guarantee that the parallel trends will hold.

However, if prior trends fail, this is evidence that parallel trends is probably not going to hold either.

Dynamic TWFEs

Often, we also have data on multiple post-treatment periods. Thus, we can examine how treatment effects evolve – e.g. whether their are “carry-over” and / or “cumulative” effects.

To do so, we just estimate the same model as above, except now we use all of the (pre-treatment and post-treatment) data.

# model
m3 <- lm_robust(pp_vote ~ ciutadella*year, 
                data=dta,
                se_type="stata",
                clusters = id)

kable(tidy(m3)[, 1:5], digits=2)

term	estimate	std.error	statistic	p.value
(Intercept)	15.03	0.09	164.23	0.00
ciutadella	-3.59	0.37	-9.77	0.00
year2004	0.10	0.13	0.79	0.43
year2009	0.03	0.13	0.26	0.80
year2019	-2.97	0.02	-128.41	0.00
year2024	0.14	0.13	1.06	0.29
ciutadella:year2004	0.53	0.64	0.84	0.40
ciutadella:year2009	0.57	0.51	1.12	0.26
ciutadella:year2019	-1.77	0.10	-17.72	0.00
ciutadella:year2024	0.66	0.53	1.24	0.21

And here’s the same model using fixest:feols():

# model
m4 <- feols(pp_vote ~ ciutadella*year | id + year, 
                data=dta)

kable(tidy(m4), digits=2)

term	estimate	std.error	statistic	p.value
ciutadella:year2004	0.53	0.64	0.84	0.40
ciutadella:year2009	0.57	0.51	1.12	0.26
ciutadella:year2019	-1.77	0.10	-17.72	0.00
ciutadella:year2024	0.66	0.53	1.24	0.21

Of course, it’s better if we plot the coefficients, rather than reporting this ugly table.

# plot data
toplot <- tidy(m4) |> 
  mutate(election = c(2004,2009,2019,2024)) |> 
  select(election, estimate, std.error) |> 
  mutate(cil = estimate - 1.96*std.error,
         cih = estimate + 1.96*std.error)

# creating the reference category
ref_data <- c(2014, 0, NA, NA, NA)

toplot <- toplot |>
  rbind(ref_data)

# plotting
ggplot(toplot, aes(x = election, y=estimate)) +
  geom_line() +
  geom_point() +
  geom_errorbar(aes(ymin = cil, ymax = cih), 
                width = 0) + 
  geom_hline(yintercept = 0, color="red") +
  geom_vline(xintercept = 2017, linetype="dashed") + 
  labs(x = "Election",
       y = "Difference in PP Vote Share: \n Exposed vs Unexposed Islands") +
  theme_minimal(base_size = 14) +
  scale_x_continuous(breaks=c(2004,2009,2014,2019,2024))

Notice that this graph combines a (pre-treatment) placebo test with a test of post-treatment differences.

We can also see what might be called “long term” effects (or lack thereof).

While not important here, long-term effects may manifest in cases where the effects of a treatment only kick-in after some time.

More “advanced” cases

Remember when we said that TWFE models were more general?

That comes in once we move beyond the 2x2 case.

Here’s two examples of what that might look like:

The first case is called “staggered adoption.” Here, different units get treated at different times. There are also “never-treated” and “always-treated” units.

Importantly, once a unit gets treated, it stays treated.

That’s on contrast to the second case, where units can also “flip” in and out of treatment:

Both cases can be easily estimated with TWFE models. Just include a “currently in treatment” indicator that takes the value of 1 for each of the dark blue squares.

However, recent research has revealed that the TWFE estimator can be biased in some cases.

The basic problem is that the TWFE is a combination of several different DiDs:

switch into treatment vs. never treated
early adopter vs. late adopter
switch into treatment vs. always treated
late adopter vs. early adopter

The first two comparisons are fine, but the last two are “forbidden”.

Moreover, in the case of treatment reversal, we might think that the effect of going “into” treatment is not simply the mirror image of the effect of a “treatment reversal”. But that’s not captured in the standard TWFE estimator.

The basic solution is to break the panel up into “small” DiDs, estimate each “small” DiD separately, and then put these estimates back together, dropping the “forbidden” comparisons.

These techniques are beyond the scope of this course. But I just want to flag the general issue for you here.

A final word

DiD / TWFEs are powerful tools, but by themselves, they do not guarantee that the effects we estimate are causal.

Claims to causality must rest on the assumption of parallel trends.

How plausible is this assumption?

Think about the data generating process: does the allocation to treatment to some units and not others resemble an experiment?

If so, you are probably good!

Footnotes

This example is inspired by Vicente Valentim’s paper “Political Stigma and Preference Falsification”, the Journal of Politics (2024)↩︎
In general, you should always cluster your standard errors within units when estimating TWFE models.↩︎
In other words, we are including a separate dummy variable (leaving one out) for each id, and a separate dummy variable (leaving one out) for each year.↩︎
Technically, these things are collinear with the unit dummies.↩︎