Two-Way Fixed Effects and Difference-in-Differences

How would you vote if your vote weren’t secret?

So this actually happened in Spain1: in 2019, voting the European Parliamentary Elections in the municipality of Ciutadella de Menorca was conducted in public (they ran out of private voting booths).

Spain has a stigmatized right-wing political party, the Partido Popolar (PP).

Did the public voting procedure reduce PP support?

To answer this question, we can potentially make two comparisons:

  1. Compare PP vote share in the municipality of Ciutadella de Menorca to PP vote share in other municipalities (also in the same region) where private voting booths were in use.

  2. Compare PP vote share in the municipality of Ciutadella de Menorca in 2019 to PP vote share in the same voting stations in 2014 (when private voting stations were used).

Discuss:

Which comparison do you prefer?

What might you still be worried about with your preferred comparison?


Differences-in-Differences (DiD)

As it turns out, we should do both comparisons simultaenously.

Comparing Ciutadella to non-Ciutadella in 2019 is perhaps the most straightforward, but introduces a lot of potential confounds — for example, for reasons completely unrelated to the availability of private voting booths, maybe the PP just less stigmatized in Ciutadella vs other municipalities (even in the same region).

It’s not guaranteed that we can control for all possible confounders, which leads to bias.

Comparing Ciutadella in 2019 vs 2014 also makes sense. However, maybe the change in PP vote share is due to the absence of private voting booths, but maybe it’s also due to other inter-temporal factors – for example, maybe the PP had a corruption scandal in 2018 that lowered its popularity across the board?

However, suppose we compare non-Ciutadella municipalities between 2014 and 2019. This allows us to estimate the effect of the corruption scandal (since it presumably affected all regions).

We then subtract off this temporal change from the over-time difference in Ciutadella to isolate the effect of public voting.

That’s difference-in-differences (DiD)!


Data Generating Process

To solidify our understanding, let’s simulate some data:

Suppose the PP vote share in 2014 from voting stations located outside Ciutadella is given:

\[ Y_{rest, 2014} = \mathcal{N}(15,2) \] Further, suppose that the effect of the corruption scandal is to lower PP vote share by 3 percentage points on average:

\[ \Delta_{corrupt} \sim \mathcal{N}(-3, 0.5) \] We can therefore write the 2019 outcomes in the non-Ciutadella regions of Spain as:

\[ Y_{rest, 2019} = Y_{rest, 2014} - \Delta_{corrupt} \]

Now suppose that the PP vote share in 2014 from voting stations located inside Ciutadella is given:

\[ Y_{ciu, 2014} = \mathcal{N}(12,2) \]

Next, imagine the effect of public voting lowers PP vote share by 2 percentage points on average:

\[ \Delta_{public} \sim \mathcal{N}(-2, 0.2) \]

So we can write the outcome for Ciutadella in 2019 as:

\[ Y_{ciu, 2019} = Y_{ciu, 2014} - \Delta_{public} - \Delta_{corrupt} \]

Let’s see what this data look like:

library(tidyverse)

set.seed(1)

# parameters
N_ciu <- 25
N_rest <- 500

# creating some "wide" data
delta_c1 <- rnorm(N_rest, -3, 0.5)
y_rest_2014 <- rnorm(N_rest, 15, 2)
y_rest_2019 <- y_rest_2014 + delta_c1

delta_c2 <- rnorm(N_ciu, -3, 0.5)
delta_p <- rnorm(N_ciu, -2, 0.2)
y_ciu_2014 <- rnorm(N_ciu, 12, 2)
y_ciu_2019 <- y_ciu_2014 + delta_c2 + delta_p


# reshaping to "long" data
rest <- tibble(y_rest_2014, y_rest_2019) |> 
  mutate(id = row_number(),
         region = "rest") |> 
  pivot_longer(
    cols = starts_with("y_rest_"),
    names_prefix = "y_rest_",
    names_to = "year",
    values_to = "pp_vote"
  )

ciu <- tibble(y_ciu_2014, y_ciu_2019) |> 
  mutate(id = row_number() + N_rest,
         region = "ciutadella") |> 
  pivot_longer(
    cols = starts_with("y_ciu_"),
    names_prefix = "y_ciu_",
    names_to = "year",
    values_to = "pp_vote"
  )

dta <- rbind(rest, ciu) |> 
  mutate(year=as.numeric(year))

Now let’s visualize the means by year and group:

Recall that we want to estimate the effect of public voting in Ciutadella in 2019. This is our observed treated outcome (\(Y_i(1)\)).

We also wanted to use the blue to simulate what would have happened in Ciutadella, had private voting booths been available. For that, we use what actually did happen in the rest of Spain.

In other words, the blue line helps us to estimate the counterfactual untreated potential outcome (\(Y_i(0)\)).:

The vertical distance between the red dot and the grey dot is our causal effect of interest.

Technically, it’s an average treatment effect on the treated (ATT).

Quiz:

Using this graph, can you point out why the method is called “differece in differences”?


DiD Estimation

We can estimate a simple 2x2 DiD using the following interaction model:

\[ PP\;vote_{it} = \beta_0 + \beta_1 * Ciutadella_i + \beta_2 * Year2019_t + \beta_3 * Ciutadella_i * Year2019_t + e_{it} \]

Quiz:

Can you interpret the coefficients in this model?

Importantly, \(\beta_3\) tells us how much bigger the drop in PP voteshare between 2014 and 2019 is in Ciutadella, compared to the drop in the rest of Spain. Difference-in-differences!

Let’s estimate this model. Note that we need to cluster standard errors within polling stations using estimatr::lm_robust. That’s because PP voteshare is likely to be very correlated over time within the same polling station.2

library(broom)
library(estimatr)

# change the region and year variables to dummies
dta <- dta |> 
  mutate(ciutadella = ifelse(region == "ciutadella", 1, 0),
         year = factor(year),
         treated = ifelse(ciutadella==1 & year == 2019, 1, 0))

# estimate the model
did_mod <- lm_robust(pp_vote ~ ciutadella*year, 
                     data=dta,
                     se_type="stata",
                     clusters = id)

kable(tidy(did_mod)[, 1:5], digits=2)
term estimate std.error statistic p.value
(Intercept) 14.91 0.09 157.43 0
ciutadella -2.96 0.49 -6.07 0
year2019 -2.99 0.02 -131.90 0
ciutadella:year2019 -2.07 0.12 -17.76 0

TWFE Estimation

Note that in the simple 2x2 case, the DiD model is actually equivalent to the following two-way fixed effects (TWFE) model:

\[ PP\;vote_{it} = \alpha_{id} + \alpha_{year} + \delta_{twfe} * Treated_{it} + e_{it} \]

\[ Treated_{it} = \begin{cases} 1 & \text{if a polling station is in Ciutadella AND the year is 2019} \\ 0 & \text{otherwise} \end{cases} \]

Note that the TWFE model includes a set of fixed effects for polling station (\(\alpha_{id}\)), as well as fixed effects for time period (\(\alpha_{year}\)).3

Thinking through what’s going on:

  • \(\alpha_{id}\) removes the influence of baseline differences between units that stay constant over time. Think of this as controlling for all time-invariant differences across polling stations.
  • \(\alpha_{year}\) removes the influence of temporal disturbances that affect all units equally within the same time period. Think of this as controlling for temporal shocks that affect all polling stations equally.
  • Finally, PTA implies that there are no time-varying confounders missing from the model

Thus, \(\delta\) gives us our estimate of the causal effect of the policy change on the treated (it’s an ATT).

Here’s another way of looking at it:

\[ \begin{cases} Y_{it}(0) & = \alpha_i + \alpha_t + \epsilon \\ Y_{it}(1) & = Y_{it}(0) + \delta \end{cases} \]

We can estimate the model using fixest::feols():

library(fixest)

twfe_mod <- feols(pp_vote ~ treated | id + year, data=dta)

kable(tidy(twfe_mod), digits=2)
term estimate std.error statistic p.value
treated -2.07 0.12 -17.77 0

Note that feols() automatically clusters your standard errors within polling stations.

Here, the TWFE model gives the same answer as the DiD model, but we will see below that it’s also more flexible.


Between vs. Within Effects

How do fixed effects work again?

Suppose we are interested in the relationship between hours of exercise per week and the number of colds a person gets per year.

We collect data the following data on two people:

Note that we can actually “decompose” this data into two sources of variation: between and within:

person year exercise colds mean_colds mean_exercise within_colds within_exercise
Andy 2022 1 4 2.33 2.33 1.67 -1.33
Andy 2023 4 1 2.33 2.33 -1.33 1.67
Andy 2024 2 2 2.33 2.33 -0.33 -0.33
Bob 2022 5 7 5.67 6.33 1.33 -1.33
Bob 2023 8 4 5.67 6.33 -1.67 1.67
Bob 2024 6 6 5.67 6.33 0.33 -0.33

Here’s how that looks graphically:

We can see that even though the cloud is upward sloping, that’s mostly due to the between-variation.

Bob exercises more than Andy, and Bob also seems to get sick more easily.

But what’s the relationship between exercise and colds, once we strip out the average between-person differences?

So here’s what the within-variation looks like:

Here, you see the sign even flipped!


Key idea: Fixed effects achieves the exact same result as de-meaning.

Everything that stays constant within units (but which can vary a lot across units) – so stuff like “mean_exercise” – is “absorbed” by the fixed effects.4

So the only variation which is left to exploit is the within-variation.

Now Back to Voting: we are only looking at within-variation in PP voteshares within polling stations.

Everything which is constant within a polling station is captured by the unit fixed effects.

This also goes for variables that we didn’t / couldn’t measure (such as how much the PP is stigmatized by neighborhood residents).

And everything which is constant within time periods (like our hypothetical corruption scandal) is captured by the time fixed effects.

And that’s the power of fixed effects…we can control for stuff that we don’t even have to measure!


Placebo Tests

We could also do a placebo test: are the size of the 2004 and 2009 gaps between the red and blue lines significantly different than the size of the 2014 gap?

If the gap sizes are changing, this would also be evidence that the prior trends are not parallel.

To conduct this test, we basically run the same model as above, only now we treat \(year\) as a factor variable. Conventionally, we also use the last period prior to treatment (so: 2014) as the baseline.

\[ PP \; vote_{it} = \beta_0 + \beta_1 * Ciutadella_i + \beta_2 * Y2004_t + \beta_3 * Y2009_t + \] \[ \beta_4 * Ciutadella_i * Y2004_t + \beta_5 * Ciutadella_i * Y2004_t + e_{it} \]

Quiz:

Can you interpret the coefficients in this model?

# year as a factor variable
# reference category is 2014
dta <- dta |> 
  mutate(year = factor(year),
         year = relevel(year, 3))

# model
m2 <- lm_robust(pp_vote ~ ciutadella*year, 
                data=filter(dta, year %in% c(2004, 2009, 2014)),
                se_type="stata",
                clusters = id)

kable(tidy(m2)[, 1:5], digits=2)
term estimate std.error statistic p.value
(Intercept) 15.03 0.09 164.25 0.00
ciutadella -3.59 0.37 -9.77 0.00
year2004 0.10 0.13 0.79 0.43
year2009 0.03 0.13 0.26 0.80
ciutadella:year2004 0.53 0.64 0.84 0.40
ciutadella:year2009 0.57 0.51 1.12 0.26

Both sets of analyses indicate that prior trends are indeed parallel.

Important:

Remember that parallel trends is an assumption about how an unobserved counterfactual will behave.

It cannot be “proven” to hold by data.

In other words, even if prior trends hold, this does not guarantee that the parallel trends will hold.

However, if prior trends fail, this is evidence that parallel trends is probably not going to hold either.


Dynamic TWFEs

Often, we also have data on multiple post-treatment periods. Thus, we can examine how treatment effects evolve – e.g. whether their are “carry-over” and / or “cumulative” effects.

To do so, we just estimate the same model as above, except now we use all of the (pre-treatment and post-treatment) data.

# model
m3 <- lm_robust(pp_vote ~ ciutadella*year, 
                data=dta,
                se_type="stata",
                clusters = id)

kable(tidy(m3)[, 1:5], digits=2)
term estimate std.error statistic p.value
(Intercept) 15.03 0.09 164.23 0.00
ciutadella -3.59 0.37 -9.77 0.00
year2004 0.10 0.13 0.79 0.43
year2009 0.03 0.13 0.26 0.80
year2019 -2.97 0.02 -128.41 0.00
year2024 0.14 0.13 1.06 0.29
ciutadella:year2004 0.53 0.64 0.84 0.40
ciutadella:year2009 0.57 0.51 1.12 0.26
ciutadella:year2019 -1.77 0.10 -17.72 0.00
ciutadella:year2024 0.66 0.53 1.24 0.21

And here’s the same model using fixest:feols():

# model
m4 <- feols(pp_vote ~ ciutadella*year | id + year, 
                data=dta)

kable(tidy(m4), digits=2)
term estimate std.error statistic p.value
ciutadella:year2004 0.53 0.64 0.84 0.40
ciutadella:year2009 0.57 0.51 1.12 0.26
ciutadella:year2019 -1.77 0.10 -17.72 0.00
ciutadella:year2024 0.66 0.53 1.24 0.21

Of course, it’s better if we plot the coefficients, rather than reporting this ugly table.

# plot data
toplot <- tidy(m4) |> 
  mutate(election = c(2004,2009,2019,2024)) |> 
  select(election, estimate, std.error) |> 
  mutate(cil = estimate - 1.96*std.error,
         cih = estimate + 1.96*std.error)

# creating the reference category
ref_data <- c(2014, 0, NA, NA, NA)

toplot <- toplot |>
  rbind(ref_data)

# plotting
ggplot(toplot, aes(x = election, y=estimate)) +
  geom_line() +
  geom_point() +
  geom_errorbar(aes(ymin = cil, ymax = cih), 
                width = 0) + 
  geom_hline(yintercept = 0, color="red") +
  geom_vline(xintercept = 2017, linetype="dashed") + 
  labs(x = "Election",
       y = "Difference in PP Vote Share: \n Exposed vs Unexposed Islands") +
  theme_minimal(base_size = 14) +
  scale_x_continuous(breaks=c(2004,2009,2014,2019,2024))

Notice that this graph combines a (pre-treatment) placebo test with a test of post-treatment differences.

We can also see what might be called “long term” effects (or lack thereof).

While not important here, long-term effects may manifest in cases where the effects of a treatment only kick-in after some time.


More “advanced” cases

Remember when we said that TWFE models were more general?

That comes in once we move beyond the 2x2 case.

Here’s two examples of what that might look like:

Adapted from Yiqing Xu’s lecture slides

The first case is called “staggered adoption.” Here, different units get treated at different times. There are also “never-treated” and “always-treated” units.

Importantly, once a unit gets treated, it stays treated.

That’s on contrast to the second case, where units can also “flip” in and out of treatment:

Adapted from Yiqing Xu’s lecture slides

Both cases can be easily estimated with TWFE models. Just include a “currently in treatment” indicator that takes the value of 1 for each of the dark blue squares.

However, recent research has revealed that the TWFE estimator can be biased in some cases.

The basic problem is that the TWFE is a combination of several different DiDs:

  • switch into treatment vs. never treated
  • early adopter vs. late adopter
  • switch into treatment vs. always treated
  • late adopter vs. early adopter

The first two comparisons are fine, but the last two are “forbidden”.

Moreover, in the case of treatment reversal, we might think that the effect of going “into” treatment is not simply the mirror image of the effect of a “treatment reversal”. But that’s not captured in the standard TWFE estimator.

The basic solution is to break the panel up into “small” DiDs, estimate each “small” DiD separately, and then put these estimates back together, dropping the “forbidden” comparisons.

These techniques are beyond the scope of this course. But I just want to flag the general issue for you here.


A final word

DiD / TWFEs are powerful tools, but by themselves, they do not guarantee that the effects we estimate are causal.

Claims to causality must rest on the assumption of parallel trends.

How plausible is this assumption?

Think about the data generating process: does the allocation to treatment to some units and not others resemble an experiment?

If so, you are probably good!

Footnotes

  1. This example is inspired by Vicente Valentim’s paper “Political Stigma and Preference Falsification”, the Journal of Politics (2024)↩︎

  2. In general, you should always cluster your standard errors within units when estimating TWFE models.↩︎

  3. In other words, we are including a separate dummy variable (leaving one out) for each id, and a separate dummy variable (leaving one out) for each year.↩︎

  4. Technically, these things are collinear with the unit dummies.↩︎