Probability and Statistics Review

Probability – or the “doctrine of chances” – describes how often different kinds of events will happen:

What are the chances of a fair coin coming up heads 10 times in a row?
If I roll two six sided dice, how likely is it that I’ll roll two sixes?
How likely is it that five cards drawn from a perfectly shuffled deck will all be hearts?

Notice that, in each case, we start with a known model of the world.

For instance, even though we don’t know the exact outcome of a single coin toss, we know that we are equally likely to heads or tails.

Or, even though we don’t know the exact cards that I will get in poker, we know (i) exactly what cards are in the deck, and (ii) that the cards were randomly shuffled.

We can write down the model as:

\[ Pr(\text{heads}) = 0.5 \]

\[ Pr(5 \; hearts) = \frac{13}{52} \times \frac{12}{51} \times \frac{11}{50} \times \frac{10}{49} \times \frac{9}{48} \]

This known model of the world is called a data generating process.

Probability Distributions

Suppose you roll two dice and add up the sum.

How many possible outcomes can are there? And how likely is each outcome?

Here’s another way of representing the same information:

This is called a probability density. The x-axis displays each of the possible outcomes, and the y-axis displays the probability that this outcome will occur.

Note that the total probabilities (i.e. the sum of the bar heights) must add up to 1.

Expected Values

Rolling dice or tossing coins are random data generating processes.

And although we won’t know the exact outcome of a random process (before carrying it out), we can know its expected value, which we denote with the $\mathbb{E}$ operator:

If $Y_i$ is a variable generated by a random process, such as throwing a die, then $\mathbb{E}[Y_i]$ is the average of infinitely many repetitions of this process.

That’s very abstract, so let’s do a concrete example. Suppose we are tossing a coin over and over again, and we record the proportion of heads that we have obtained so far:

# load the tidyverse package
library(tidyverse)

# simulating tossing a coin 1000 times
set.seed(2)
n_tosses <- 1000

coins_faces <- c("heads", "tails")
coin_tosses <- sample(coins_faces, n_tosses, replace=TRUE)

# saving the outcomes as a data frame
dta <- tibble(coin_tosses)

# create additional variables as columns in the data frame:
# N counts the number of coin tosses so far
# Y converts heads into 1 (and tails into 0)
# C computes the number of heads so far
# P is the proportion of heads so far
dta <- dta |> 
  mutate(N = 1:n_tosses,
         Y = ifelse(coin_tosses == "heads", 1, 0),
         C = cumsum(Y),
         P = C / N)

# plotting
ggplot(data=dta, mapping=aes(x=N, y=P)) +
  geom_point(size=0.5) +
  geom_hline(yintercept = 0.5,
             color="red",
             lty="dashed") +
  theme_minimal() +
  labs(x = "Number of Coin Tosses",
       y = "Proportion of Heads")

We observe that the proportion of heads initially jumps all over the place, but then converges to P = 0.5 as we toss the coin over and over again ad infinitum.

This is what we mean when we say that $\mathbb{E}[Y_i]$ = 0.5.

Understanding Check:

If we denote $Y_i$ as the sum of two dice, then what is $\mathbb{E}[Y_i]$?

Conditional Expectations

Let’s go back to our table:

Suppose I told you that you rolled a 6 on the red die.

Understanding Check:

What is the expected value of the sum of the two dice given that you have rolled a six on one die?

This question is asking about a conditional expectation:

\[ \mathbb{E}[\text{Die1}_i + \text{Die2}_i \; | \; \text{Die1}_i = 6] \]

We can also look at the difference between two conditional expectations:

\[ \mathbb{E}[\text{Die1}_i + \text{Die2}_i \; | \; \text{Die1}_i = 6] - \mathbb{E}[\text{Die1}_i + \text{Die2}_i \; | \; \text{Die1}_i = 1] \]

Understanding Check:

Can you calculate this quantity?

Conditional expectations are important once we start talking about experiments.

In experiments, we want to compare the expected value of the outcome for subjects in the treatment group ($D=1$) vs. subjects in the control group ($D=0$).

More precisely, if we want to learn about treatment effects, we will want to estimate:

\[ \mathbb{E}[Y_i | D_i = 1] - \mathbb{E}[Y_i | D_i = 0] \]

That looks complicated, but it’s basically just the same thign we just did.

More precisely, it’s the difference-in-means between the two groups.

The notation will come up again and again in your readings, so get comfortable with it!

Inferential Statistics

In statistics we do not know the “true model” of the world.

Instead, all we have are a sample of data, and we are trying to use this information to make an inference about the true data generating process.

For example:

If I flip a coin 10 times and gets 10 heads, is the coin fair?
If five cards off the top of the deck are all hearts, how likely is it that the deck was shuffled?

In each case, I am trying to guess what model of the world (e.g. “fair” vs. “unfair”, or “shuffled” vs. “unshuffled”) is most likely to have produced the data that I have.

Hopefully, you can see why inferential statistics was once called “inverse probability”…

Now let’s start building the bridge from probability to statistics.

We’ll use a simple example from the ebook moderndive.

Imagine we have a large bowl of red and white marbles:

We want to find out: what proportion of the marbles are red?

Or, to put this in more formal math terms:

\[ Y_i = \begin{cases} 1 & \text{if marble is red} \\ 0 & \text{if marble is white} \end{cases} \] and we want to find out $\mathbb{E}[Y_i]$.

Discuss:

How would you find out this quantity?

Sampling

From before, we know that if we repeatedly draw marbles from the bowl (and put the marble back), and keep a running tally of the proportion of red marbles, we will converge on the truth.

Discuss:

Now suppose that the bowl of marbles represents the universe of German voters.

The red marbles represent SPD voters, and the white marbles represent all other voters.

You want to find out the percentage of SPD voteres in the population. What would you do?

So for sure we are not going to keep calling people ad infinitum (and maybe we would call the same person twice).

Rather, we are going to draw a sample of fixed size (without replacement), calculate the proportion of SPD supporters in that sample, and use that information to estimate the population parameter that we are interested in.

We will do the same thing here.

We have a shovel with which to scoop out 50 marbles from the bowl:

Suppose we take one shovelful of marbles from the bowl, and observe that 17 of them are red.

Discuss:

What is our best guess of the proportion of red marbles in the population?

Is this guess biased (and what does “bias” even mean)?

Unbiasedness

As it turns out, our guess actually has a really nice property: it is right “on average”.

To see what that means, suppose we use the shovel over and over again (which we of course don’t get to do in real life).

To get a feel for what happens, let’s program a simulation:

# loading packages
# this package comes with a preloaded bowl of marbles
# this object is called "bowl"
library(tidyverse)
library(moderndive)
marbles <- tibble(bowl) |> 
  mutate(y = ifelse(color == "red", 1,0))

# parameters of simulation
shovel_size = 50

# a single draw
# and calculate the mean
sample(marbles$y, shovel_size, replace = FALSE) |> 
  mean()

[1] 0.36

You’ll see that if you run the sample() function over and over again, you’ll get a different answer every time.

Suppose we draw lots of samples from the bowl and write down the proportion of red marbles from each draw.

Some of these draws will give us “too many” red marbles, and some of them will give us “too few.”

But just as in our coin-tossing example above, when we average across our large number of samples, we should approach the “truth”.

# setting seed for reproducibility
set.seed(1)

# define the number of draws
n_draws <- 1000

# an empty vector to store the results from each draw
dta <- c()

# using a "for" loop to do something lots of times
for (i in 1:n_draws) {
  
  # draw the marbles
  draw <- sample(marbles$y, shovel_size, replace = FALSE)
  
  # storing the proportion red in our vector
  dta[i] <- mean(draw)
}

# calculating the average of all of our draws
avg_shovel <- mean(dta)

# making a histogram of the distribution of our draws
# first, we need to save our vector of results to a data frame
toplot <- as_tibble(dta) |> 
  rename(prop_red = value)

hist_50 <- ggplot(data=toplot, mapping=aes(x=prop_red)) + 
  geom_histogram(color="white",
                 binwidth = 0.02,
                 fill="cornflowerblue") +
  geom_vline(xintercept = avg_shovel,
             lty="dashed",
             color="red") +
  theme_minimal() +
  labs(x="Proportion Red Marbles",
       y="Number of Shovels",
       subtitle = "Shove Size 50") +
  coord_cartesian(xlim=c(0.15, 0.65),
                  ylim=c(0, 175))


hist_50

We observe that across all of our 1000 draws, the “average” shovel contained 37.92% red marbles.

We calculated this above using the code mean(dta).

But we also see from the histogram that some shovels had less than 30% red marbles, and some had over 50%.

The nice thing about simulations is that we also know the “true” proportion of red marbles in the bowl. It’s 37.5%. We calculate this using the code mean(marbles$y).

So we see that our “average shovel” is not far off.

In other words, even though each of our individual estimates could be far from the mark, the average of our estimates approaches the “right” answer”.

This property is called unbiasedness.

More formally, if we denote the sample average from a single draw as $\bar{Y}$, unbiasedness means that

\[ \mathbb{E}[\bar{Y}] = \mathbb{E}[Y_i] \]

Or, in English: the average of repeated samples from the bowl is equal to the average of all of the marbles in the bowl.

An Example with Bias

Above, we got an unbiased estimate because we drew a random sample every time.

But there are lots of estimation procedures that will produced biased estimates instead.

Suppose we did the following:

draw a sample of 50 marbles and count the proportion of red ones
if the proportion is $>= 0.4$, record the data
if the proportion is $< 0.4$, throw away the sample and redraw it

Let’s see this in action:

# an empty vector to store the results from each draw
dta_biased <- c()

# using a "for" loop to do something lots of times
for (i in 1:n_draws) {
  
  # draw the marbles
  draw <- sample(marbles$y, shovel_size, replace = FALSE)
  
  # checking the proportion red
  check <- mean(draw)
  
  # using a "while" loop to keep doing something until a condition is met
  while (check < 0.4) {
    
    # draw the marbles again
    draw <- sample(marbles$y, shovel_size, replace = FALSE)
  
    # checking the proportion red
    check <- mean(draw)
    
  }
  
  # storing the proportion red from the final kept draw in our vector
  dta_biased[i] <- mean(draw)
}


# calculating the average of all of our draws
avg_biased_shovel <- mean(dta_biased)

Practice:

Now try to make the histogram yourself.

Be sure to include (i) a vertical line for the “truth”, and (ii) a vertical line for the average of your samples.

# making a histogram of the distribution of our draws
toplot_biased <- as_tibble(dta_biased) |> 
  rename(prop_red = value)

hist_biased <- ggplot(data=toplot_biased, mapping=aes(x=prop_red)) + 
  geom_histogram(color="white",
                 binwidth = 0.02,
                 fill="cornflowerblue") +
  geom_vline(xintercept = avg_biased_shovel,
             lty="dashed",
             color="red") +
  geom_vline(xintercept = mean(marbles$y),
             lty="solid",
             color="green") +
  theme_minimal() +
  labs(x="Proportion Red Marbles from Biased Procedure",
       y="Number of Shovels") +
  coord_cartesian(xlim=c(0.15, 0.65),
                  ylim=c(0, 175))


hist_biased

Here, the average across all of our samples is 43.84%.

Thus, even though we drew lots of samples, the average of our estimates does not converge to the truth.

\[ \mathbb{E}[\bar{Y}] \neq \mathbb{E}[Y_i] \]

We will come back to the problem of bias throughout the course.

Uncertainty

Let’s look again at our unbiased distribution of estimates:

Even though the histogram “averages out” to the truth, there’s a lot of variance in our estimates.

You may recall that variance provides a way to describe how “noisy” a list of numbers is:

\[ \frac{1}{N}\sum{(X_i - \bar{X})^2} \]

Review:

To refresh our memory, let’s calculate and compare the variances of the following list of numbers:

X	Y
1	2
1	2
3	3
3	3
5	4
5	4

In addition to the variance, we often use its square root: the standard deviation.

Here’s the standard deviation of our estimated proportion of red marbles from each of our 1000 unbiased samples:

round(sd(toplot$prop_red),4)

[1] 0.0695

You may also recall that 95% of the data are supposed to fall within $\pm 1.96$ standard deviations of the mean.

In other words, out of our 1000 estimates, about 950 of them will be between 37.92% $\pm 1.96 \times$ 6.95% = [24.3% and 51.53%].

# calculating the actual number of shovels which fall within the 95% range
hi <- mean(toplot$prop_red) + 1.96*sd(toplot$prop_red)
lo <- mean(toplot$prop_red) - 1.96*sd(toplot$prop_red)

toplot <-  toplot |> 
  mutate(in_ci = ifelse(prop_red > lo & prop_red < hi,
                         1, 0))

# counting up the shovels that fall within this range:
sum(toplot$in_ci)

[1] 940

In fact, 940 draws actually fall within this range.

Precision

Why do we care about this?

Because in real life, we cannot keep repeating studies ad infinitum and then average up of our results.

Rather, we usually run a study (“use the shovel”) once.

Thus, we only have one estimate of the “truth”.

And even though this estimate may be unbiased (that is, our estimation procedure yields the correct answer “on average”), we still would prefer our estimate to be closer to the truth than farther away.

Put differently, we want to histograms of estimates to be “skinnier” rather than “fatter”.

Skinnier histograms are composed of more precise estimates.

A Larger Shovel

As it turns out, there are lots of different ways to achieve this goal.

We’ll talk about a few throughout this course, but one intuitive method is to increase the size (N) of our study.

Think about this has equivalent to using a larger shovel.

Practice:

Programme a simulation where we now have a shovel that holds 100 marbles (instead of 50).

Make a histogram showing the average and the distribution of 1000 draws.

# an empty vector to store the results from each draw
dta <- c()

# parameters of simulation
shovel_size = 100

# using a "for" loop to do something lots of times
for (i in 1:n_draws) {
  
  # draw the marbles
  draw <- sample(marbles$y, shovel_size, replace = FALSE)
  
  # storing the proportion red in our vector
  dta[i] <- mean(draw)
}

# calculating the average of all of our draws
avg_shovel_large <- mean(dta)

# making a histogram of the distribution of our draws
# first, we need to save our vector of results to a data frame
toplot_large <- as_tibble(dta) |> 
  rename(prop_red = value)

hist_100 <- ggplot(data=toplot_large, mapping=aes(x=prop_red)) + 
  geom_histogram(color="white",
                 binwidth = 0.02,
                 fill="cornflowerblue") +
  geom_vline(xintercept = avg_shovel,
             lty="dashed",
             color="red") +
  theme_minimal() +
  labs(x="Proportion Red Marbles",
       y="Number of Shovels",
       subtitle = "Shove Size 100") +
  coord_cartesian(xlim=c(0.15, 0.65),
                  ylim=c(0, 175))

# placing the two histograms on top of each other
library(patchwork)

hist_50 / hist_100 +
   plot_layout(axes = "collect")

Understanding check:

Can you explain the intuition behind this result?

Summary of Key Ideas

Statistics is about using a limited set of data to make inferences about a larger population.

We want our inferences to be unbiased: that is, if we ran our study a large number of times, we want our estimate to be right “on average”.

But, in reality, we can only run our study once. Our one estimate therefore forms our “best guess” about the true data generating process in the population.

Provided we designed and implemented our study correctly (and we’ll talk about what “correctly” means throughout the whole course), our estimate will be unbiased, meaning that it recovers the “truth” on average.

However, even though our estimate may be unbiased, it is still subject to uncertainty.

Of course, we want that uncertainty to be as small as possible – that is, we would prefer if our estimates are more precise.

In all cases though, it is important to understand and communicate not only the estimate itself, but also the uncertainty surrounding it.

Note that precision and bias should be thought about as separate concepts. Here’s a diagram illustration this distinction: