BDA3 Chapter 2 Exercise 13
Here’s my solution to exercise 13, chapter 2, of Gelman’s Bayesian Data Analysis (BDA), 3rd edition. There are solutions to some of the exercises on the book’s webpage.
We are given data on airline deaths and asked to fit various models to that data.
We are given the data shown below. The data didn’t seem to be available anywhere so I created the csv file myself.
Let’s get acquainted with the data by plotting it as a timeseries.
We model the number of fatal accidents as poisson \(y \mid \theta \sim \dpois(\theta)\), where we put a \(\theta \sim \dgamma(\alpha, \beta)\) prior on the parameter. I don’t really have any strong prior knowledge about the number of annual fatal flight accidents. I’ll use the gamma approximation to Jeffrey’s prior from the previous exercise, even though it places probability on very extreme values. We’ll stick with this prior throughout.
shape <- 0.5 rate <- .Machine$double.xmin
The posterior is \(\dgamma(0.5 + n\bar y, n) = \dgamma(0.5 + 238, 10)\).
To obtain a 95% posterior predictive interval, we draw \(\theta\) from its posterior, then draw \(y\) from the corresponding Poisson distribution. With these draws, we can obtain the necessary quantiles.
n_draws <- 50000 theta_posterior_a <- rgamma(n_draws, shape + sum_fatal_accidents, rate + n_observations ) y_pp_a <- rpois(n_draws, theta_posterior_a) mu_a <- mean(y_pp_a) ci_a <- quantile(y_pp_a, c(0.05, 0.95)) ci_a
5% 95% 16 33
In part a, we ignored how many flights there are. We can incorporate this information into our model by using
passenger_miles as a measure of exposure. The parameter \(\theta\) is now the rate of fatal accidents per year per 100 million passenger miles. Note that this rate is over an order of magnitude smaller than the death rate in the table because the number of fatal accidents is an order of magnitude smaller than the number of passenger deaths. The posterior is \(\theta \mid y \sim \dgamma(0.5 + 238, 57158.69)\).
The 95% posterior predictive interval seems to be shifted upwards compared to the interval in part a.
theta_posterior_b <- rgamma(n_draws, shape + sum_fatal_accidents, rate + sum_passenger_miles ) y_pp_b <- rpois(n_draws, theta_posterior_b * 8000) mu_b <- mean(y_pp_b) ci_b <- quantile(y_pp_b, c(0.05, 0.95)) ci_b
5% 95% 24 44
Here we use the same model as in part a but for the number of passenger deaths instead of fatal accidents.
Only 1 of the 10 observations in the dataset lie within the 95% posterior predictive interval.
theta_posterior_c <- rgamma(n_draws, shape + sum_passenger_deaths, rate + n_observations ) y_pp_c <- rpois(n_draws, theta_posterior_c) mu_c <- mean(y_pp_c) ci_c <- quantile(y_pp_c, c(0.05, 0.95)) ci_c
5% 95% 647 737
Now we use the same model as in part b but for passenger deaths instead of fatal accidents. The posterior is \(\dgamma(0.5 + 238, 57158.69)\).
None of the observed values falls into the 95% posterior predictive interval.
theta_posterior_d <- rgamma(n_draws, shape + sum_passenger_deaths, rate + sum_passenger_miles ) y_pp_d <- rpois(n_draws, theta_posterior_d * 8000) mu_d <- mean(y_pp_d) ci_d <- quantile(y_pp_d, c(0.05, 0.95)) ci_d
5% 95% 914 1023
There are a number of issues to consider that are not mentioned in the question or suggested by the data. The number of fatal accidents depends on the number of miles flown by airplanes: if there are more flights, there will likely be more accidents. However, the number of flights isn’t directly accounted for in the number of passenger miles since the number of passengers per flight can vary from year to year. In any case, the number of passenger deaths per year is not independent because passengers on the same flight will have more similar survival chances.