BDA3 Chapter 3 Exercise 8

Here’s my solution to exercise 8, chapter 3, of Gelman’s Bayesian Data Analysis (BDA), 3rd edition. There are solutions to some of the exercises on the book’s webpage.

\(\DeclareMathOperator{\dbinomial}{Binomial} \DeclareMathOperator{\dbern}{Bernoulli} \DeclareMathOperator{\dpois}{Poisson} \DeclareMathOperator{\dnorm}{Normal} \DeclareMathOperator{\dt}{t} \DeclareMathOperator{\dcauchy}{Cauchy} \DeclareMathOperator{\dexponential}{Exp} \DeclareMathOperator{\duniform}{Uniform} \DeclareMathOperator{\dgamma}{Gamma} \DeclareMathOperator{\dinvgamma}{InvGamma} \DeclareMathOperator{\invlogit}{InvLogit} \DeclareMathOperator{\logit}{Logit} \DeclareMathOperator{\ddirichlet}{Dirichlet} \DeclareMathOperator{\dbeta}{Beta}\)

You can download the full dataset shown in table 3.3. Let’s load it into a dataframe and select just the residential data, as suggested.

df0 <- read_csv('data/chapter_03_exercise_08.csv') %>% 
  mutate(
    type = as_factor(
      type, 
      levels = c('residential', 'fairly_busy', 'busy'), 
      ordered = TRUE
    ),
    bikes = as.integer(bikes),
    other = as.integer(other)
  )

df <- df0 %>% 
  filter(type == 'residential') %>% 
  mutate(
    total = bikes + other,
    bike_fraction = bikes / total,
    other_fraction = other / total
  )

Here are the first few rows with each value of bike_route.

type bike_route bikes other total bike_fraction other_fraction
residential TRUE 16 58 74 0.2162162 0.7837838
residential TRUE 9 90 99 0.0909091 0.9090909
residential TRUE 10 48 58 0.1724138 0.8275862
residential FALSE 12 113 125 0.0960000 0.9040000
residential FALSE 1 18 19 0.0526316 0.9473684
residential FALSE 2 14 16 0.1250000 0.8750000

We’ll use an uninformative gamma prior with a Poisson likelihood for the counts. The posterior can then be calculated as follows.

draws <- 10000

shape_prior <- 2
rate_prior <- 0

posterior <- function(data, draws = 10000) {
  
  bikes <- data %>% pull(bikes)
  other <- data %>% pull(other)
  n <- data %>% pull(n)
  
  tibble(draw = 1:draws) %>%
    mutate(
      theta_bike = rgamma(draws, bikes, n),
      theta_other = rgamma(draws, other, n),
      mu = rpois(draws, theta_bike),
      p = theta_bike / (theta_bike + theta_other)
    )
  
}

posterior_draws <- df %>% 
  group_by(bike_route) %>% 
  summarise(
    bikes = sum(bikes) + shape_prior,
    other = sum(other) + shape_prior,
    n = n() + rate_prior
  ) %>% 
  nest(-bike_route) %>% 
  mutate(draws = map(data, posterior, draws)) %>% 
  unnest(draws)

Plotting posterior predictive draws of \(\theta_y\) and \(\theta_z\), we can see that there seems to be quite a difference.

posterior_draws %>% 
  ggplot() +
  aes(mu, fill = bike_route) +
  geom_bar(position = 'identity', alpha = 0.75) +
  labs(
    x = 'Bike count',
    y = 'Count',
    fill = 'Has bike route?',
    title = 'Posterior expectation of bike count'
  )

To quantify this difference, we’ll have to match up our posterior draws for \(\theta_y\) and \(\theta_z\).

difference <- posterior_draws %>% 
  select(draw, bike_route, mu) %>% 
  spread(bike_route, mu) %>% 
  mutate(difference = `TRUE` - `FALSE`) 

The difference \(\mu_y - \mu_z\) has the following 95% credible interval:

difference %>% 
  pull(difference) %>% 
  quantile(probs = c(0.05, 0.5, 0.95))
 5% 50% 95% 
  6  15  24