BDA3 Chapter 5 Exercise 3

Here’s my solution to exercise 3, chapter 5, of Gelman’s Bayesian Data Analysis (BDA), 3rd edition. There are solutions to some of the exercises on the book’s webpage.

\(\DeclareMathOperator{\dbinomial}{Binomial} \DeclareMathOperator{\dbern}{Bernoulli} \DeclareMathOperator{\dpois}{Poisson} \DeclareMathOperator{\dnorm}{Normal} \DeclareMathOperator{\dt}{t} \DeclareMathOperator{\dcauchy}{Cauchy} \DeclareMathOperator{\dexponential}{Exp} \DeclareMathOperator{\duniform}{Uniform} \DeclareMathOperator{\dgamma}{Gamma} \DeclareMathOperator{\dinvgamma}{InvGamma} \DeclareMathOperator{\invlogit}{InvLogit} \DeclareMathOperator{\logit}{Logit} \DeclareMathOperator{\ddirichlet}{Dirichlet} \DeclareMathOperator{\dbeta}{Beta}\)

We’ll reproduce some of the calculations with different priors for the eight schools example. Here is the eight schools dataset.

df <- read_csv('data/eight_schools.csv') %>% 
  mutate(school = factor(school))
school y std
A 25 15
B 8 10
C -3 16
D 7 11
E -1 19
F 1 11
G 18 10
H 12 18

Uniform priors

We’ll use Stan to calculate the correct posterior for us. Note that Stan will assume a uniform prior (on the domain of the parameter) unless otherwise specified.

model <- rstan::stan_model('src/ex_05_03.stan')
S4 class stanmodel 'ex_05_03' coded as follows:
data {
  int<lower = 0> J; // number of schools 
  vector[J] y; // estimated treatment effects
  vector<lower = 0>[J] sigma; // standard errors
}

parameters {
  real mu; // pop mean
  real<lower = 0> tau; // pop std deviation
  vector[J] eta; // school-level errors
}

transformed parameters {
  vector[J] theta = mu + tau * eta; // school effects
}

model {
  eta ~ normal(0, 1);
  y ~ normal(theta, sigma);
} 

We fit the model with the sampling function.

fit <- model %>% 
  rstan::sampling(
    data = list(
      J = nrow(df),
      y = df$y,
      sigma = df$std
    ),
    warmup = 1000,
    iter = 5000,
    chains = 4
  )

The tidybayes package is super useful for custom calculations from the posterior draws. We’ll also add in the original school labels.

draws <- fit %>% 
  tidybayes::spread_draws(mu, tau, eta[school_idx]) %>% 
  mutate(
    theta = mu + tau * eta,
    school = levels(df$school)[school_idx]
  ) 

We have 4 chains, each with 4000 (post-warmup) iterations, with a draw for each school parameter. Each draw is one sample from the posterior.

.chain .iteration .draw mu tau school_idx eta theta school
1 1 1 7.807863 3.836428 1 -0.7897487 4.778049 A
1 1 1 7.807863 3.836428 2 -1.2770111 2.908702 B
1 1 1 7.807863 3.836428 3 -0.9996711 3.972697 C
1 1 1 7.807863 3.836428 4 -0.5159778 5.828352 D
1 1 1 7.807863 3.836428 5 1.0371925 11.786978 E
1 1 1 7.807863 3.836428 6 -0.8692724 4.472962 F

Tidybayes also gives us convenient ggplot geoms for plotting the posterior distributions.

comparisons <- draws %>% 
  group_by(school) %>% 
  tidybayes::compare_levels(theta, by = school) %>% 
  tidybayes::mean_qi()

We can also see how the estimated treatment effect varies as a function of the population variation. The curves are noiser than in the book because we are using our posterior draws to approximate the shape and there are relatively fewer draws for larger values of \(\tau\).

Here’s a simple histogram of the posterior draws for school A.

To estimate the posterior for the maximum effect, we can simply calculate the maximum effect across all schools for each posterior draw.

max_theta <- draws %>% 
  group_by(.chain, .iteration, .draw) %>% 
  slice(which.max(theta)) %>% 
  ungroup()

The probability that the maximum effect is larger than 28.4 can then be approximated by the fraction of draws larger than 28.4.

p_max_theta <- max_theta %>% 
  mutate(larger = theta > 28.4) %>% 
  summarise(p_larger = sum(larger) / n()) %>% 
  pull() %>% 
  percent()

p_max_theta
[1] "6.94%"

To estimate the probability than the effect in school A is larger than the effect in school C, we first have to spread the data so that there is one draw per row.

a_better_c <- draws %>% 
  ungroup()  %>% 
  select(.chain, .iteration, school, theta) %>% 
  spread(school, theta) %>% 
  mutate(a_minus_c = A - C) 

The probability is then just the fraction of draws where A - C > 0.

prob_a_better_c <- a_better_c %>% 
  summarise(mean(a_minus_c > 0)) %>% 
  pull() %>% 
  percent()

prob_a_better_c
[1] "65.4%"

Infinite population variance

With \(\tau = \infty\), we would expect there to be no shrinkage. From equation 5.17 (page 116), the posteriors of the school effects with \(\tau \to \infty\) are

\[ \begin{align} \theta_j \mid \mu, \tau = \infty, y \sim \dnorm\left( \bar y_{\cdot j}, \sigma_j^2 \right) \end{align} \]

since \(\frac{1}{\tau} \to 0\) as \(\tau \to \infty\).

iters <- 16000

draws_infty <- df %>% 
  transmute(
    school,
    draws = map2(
      y, std, 
      function(mu, sigma) {
        tibble(
          iteration = 1:iters,
          theta = rnorm(iters, mu, sigma)
        )
      }
    )
  ) %>% 
  unnest(draws) %>% 
  arrange(iteration)
school iteration theta
A 1 13.4118572
B 1 25.9980867
C 1 2.3087061
D 1 13.3842454
E 1 8.0263868
F 1 -0.7990375

We calculate the maximum effect just as before. The histogram shows that there is a higher probability of higher treatment effects than under the hierarchical model.

max_theta_infty <- draws_infty %>% 
  group_by(iteration) %>% 
  slice(which.max(theta))
p_max_theta_infty <- max_theta_infty %>% 
  ungroup() %>% 
  mutate(larger = theta > 28.4) %>% 
  summarise(p_larger = sum(larger) / n()) %>% 
  pull() %>% 
  percent()

p_max_theta_infty
[1] "64.8%"

There is now a 64.8% probability of an extreme effect under the unpooled model, which is a lot larger than 6.94% under the hierarchical model.

For the pairwise differences, both the point estimates and the credible intervals are more extreme.

comparisons_infty <- draws_infty %>% 
  group_by(school) %>% 
  compare_levels(theta, by = school, draw_indices = c('iteration')) %>% 
  select(-starts_with('iter')) %>% 
  mean_qi()

Zero population variance

With \(\tau = 0\), we would expect the estimates of school effects to all be equal to the population effect. Letting \(\tau \to 0\) in equation 5.17 (page 116), we see that \(\theta_j \mid \mu, \tau, y\) gets a point mass at $. This follows from the fact that

\[ \frac{\frac{1}{\tau}}{c + \frac{1}{\tau}} \to 1 \to \infty \]

for any fixed \(c\) as \(\tau \to 0\). Thus,

\[ \begin{align} \hat \theta_j &= \frac{\frac{\bar y_{\cdot j}}{\sigma_j}}{\frac{1}{\sigma_j} + \frac{1}{\tau^2}} + \frac{\frac{1}{\tau^2}}{\frac{1}{\sigma_j} + \frac{1}{\tau^2}}\mu \to 0 + \mu \\ V_j &\to 0 . \end{align} \]

It follows that \(p(\theta \mid \mu, \tau, y) \to p(\mu \mid \tau, y)\) as \(\tau \to 0\). From equation 5.20 (page 117), the distribution of \(\mu \mid \tau, y\) is \(\dnorm(\hat\mu, V_\mu)\) with

\[ \begin{align} \hat \mu &= \frac{\sum_1^J \frac{1}{\sigma_j^2} \bar y_{\cdot j}}{\sum_1^J \frac{1}{\sigma_j^2}} = \bar y_{\cdot \cdot} \\ V_\mu^{-1} &= \sum_1^J \frac{1}{\sigma_j^2} . \end{align} \]