In a previous post, we described how a model of customer lifetime value (CLV) works, implemented it in Stan, and fit the model to simulated data. In this post, we’ll extend the model to use hierarchical priors in two different ways: centred and non-centred parameterisations. I’m not aware of any other HMC-based implementations of this hierarchical CLV model, so we’ll run some basic tests to check it’s doing the right thing. More specifically, we’ll fit it to a dataset drawn from the prior predictive distribution. The resulting fits pass the main diagnostic tests and the 90% posterior intervals capture about 91% of the true parameter values.

A word of warning: I came across a number of examples where the model showed severe E-BFMI problems. This seems to happen when the parameters \(\mu\) (inverse expected lifetime) and \(\lambda\) (expected purchase rate) are fairly similar, but I haven’t pinned down a solid reason for these energy problems yet. I suspect this has something to do with the difficulty in distinguishing a short lifetime from a low purchase rate. This is a topic we’ll leave for a future post.

We’ll work with raw stan, which can be a bit fiddly sometimes. To keep this post within the limits of readability, I’ll define some custom functions to simplify the process. Check out the full code to see the details of these functions.

\(\DeclareMathOperator{\dbinomial}{Binomial}\DeclareMathOperator{\dbernoulli}{Bernoulli}\DeclareMathOperator{\dpoisson}{Poisson}\DeclareMathOperator{\dnormal}{Normal}\DeclareMathOperator{\dt}{t}\DeclareMathOperator{\dcauchy}{Cauchy}\DeclareMathOperator{\dexponential}{Exp}\DeclareMathOperator{\duniform}{Uniform}\DeclareMathOperator{\dgamma}{Gamma}\DeclareMathOperator{\dinvgamma}{InvGamma}\DeclareMathOperator{\invlogit}{InvLogit}\DeclareMathOperator{\logit}{Logit}\DeclareMathOperator{\ddirichlet}{Dirichlet}\DeclareMathOperator{\dbeta}{Beta}\)

Let’s recap on the story from last time. We have a 2-year old company that has grown linearly over that time to gain a total of 1000 customers.

```
set.seed(65130) # https://www.random.org/integers/?num=2&min=1&max=100000&col=5&base=10&format=html&rnd=new
customers <- tibble(id = 1:1000) %>%
mutate(
end = 2 * 365,
start = runif(n(), 0, end - 1),
T = end - start
)
```

Within a customer’s lifetime \(\tau\), they will purchase with Poisson-rate \(\lambda\). We can simulate the time \(t\) till last observed purchase and number of purchases \(k\) with `sample_conditional`

.

```
sample_conditional <- function(T, tau, lambda) {
# start with 0 purchases
t <- 0
k <- 0
# simulate time till next purchase
wait <- rexp(1, lambda)
# keep purchasing till end of life/observation time
while(t + wait <= pmin(T, tau)) {
t <- t + wait
k <- k + 1
wait <- rexp(1, lambda)
}
# return tabular data
tibble(
t = t,
k = k
)
}
s <- sample_conditional(300, 200, 1)
```

t | k |
---|---|

198.2929 | 169 |

In the above example, even though the observation time is \(T = 300\), the time \(t\) till last purchase will always be below the lifetime \(\tau = 200\). With a purchase rate of 1 per unit time, we expect around \(k = 200\) purchases.

We’ll use the same likelihood as before, which says that the probability of customer \(i\)’s data given their parameters is

\[ \begin{align} \mathbb P(k, t, T \mid \mu_i, \lambda_i) &= \frac{\lambda_i^k}{\lambda_i + \mu_i} \left( \mu_i e^{-t(\lambda_i + \mu_i)} + \lambda_i e^{-T(\lambda_i + \mu_i)} \right) \\ &\propto p \dpoisson(k \mid t\lambda_i)S(t \mid \mu_i) \\ &\hphantom{\propto} + (1 - p) \dpoisson(k \mid t\lambda_i)\dpoisson(0 \mid (T-t)\lambda_i)S(T \mid \mu_i) , \\ p &:= \frac{\mu_i}{\lambda_i + \mu_i} , \end{align} \]

where \(S\) is the exponential survival function.

To turn this into a Bayesian model, we’ll need priors for the parameters. The last time, we put simple gamma priors on the parameters \(\mu_i\) and \(\lambda_i\). For example, we could choose \(\lambda_i \sim \dgamma(2, 28)\) if we were to use the simple model from last time (similarly for \(\mu_i\)). This time we’re going hierarchical. There are various ways to make this hierarchical. Let’s look at two of them.

One method arises directly from the difficulty of specifying the gamma-prior parameters. It involves just turning those parameters into random variables to be simultaneously estimated along with \(\mu_i\) and \(\lambda_i\). For example, we say \(\lambda_i \sim \dgamma(\alpha, \beta)\), where \(\alpha_i \sim \dgamma(\alpha_\alpha, \beta_\alpha)\) and \(\beta_i \sim \dgamma(\alpha_\beta, \beta_\beta)\) (and similarly for \(\mu_i\)). We eventually want to incorporate covariates, which is difficult with this parameterisation, so let’s move onto a different idea.

Another solution is to use log-normal priors. This means setting

\[ \begin{align} \lambda_i &:= \exp(\alpha_i) \\ \alpha_i &\sim \dnormal(\beta, \sigma) \\ \beta &\sim \dnormal(m, s_m) \\ \sigma &\sim \dnormal_+(0, s) , \end{align} \]

where \(m\), \(s_m\), and \(s\) are constants specified by the user (and similarly for \(\mu_i\)). This implies

- there is an overall mean value \(e^\beta\),
- the customer-level effects \(\alpha_i\) are deviations from the overall mean, and
- the extent of these deviations is controlled by the magnitude of \(\sigma\).

With \(\sigma \approx 0\), there can be very little deviation from the mean, so most customers would be the same. On the other hand, large values of \(\sigma\) allow for customers to be (almost) completely unrelated to each other. This means that \(\sigma\) is helping us to regularise the model.

The above parameterisation is called “centred”, which basically means the prior for \(\alpha_i\) is expressed in terms of other parameters (\(\beta\), \(\sigma\)). This can be rewritten as a “non-centred” parameterisation as

\[ \begin{align} \lambda_i &:= \exp(\beta + \sigma \alpha_i) \\ \alpha_i &\sim \dnormal(0, 1) \\ \beta &\sim \dnormal(m, s_m) \\ \sigma &\sim \dnormal_+(0, s) . \end{align} \]

Notice the priors now contain no references to any other parameters. This is equivalent to the centred parameterisation because \(\beta + \sigma \alpha_i \sim \dnormal(\beta, \sigma)\). The non-centred parameterisation is interesting because it is known to increase the sampling efficiency of HMC-based samplers (such as Stan’s) in some cases.

Here is a centred stan implementation of our log-normal hierarchical model.

```
centred <- here::here('models/rf_centred.stan') %>%
stan_model()
```

Note that we have introduced the `prior_only`

flag. When we specify that we want `prior_only`

, then stan will not consider the likelihood and will instead just draw from the priors. This allows us to make prior-predictive simulations. We’ll generate a dataset using the prior-predictive distribution, then fit our model to that dataset. The least we can expect from a model is that it fits well to data drawn from its prior distribution.

To simulate datasets we’ll use hyperpriors that roughly correspond to the priors from the previous post. In particular, the expected lifetime is around 31 days, and the expected purchase rate around once per fortnight.

```
data_hyperpriors <- list(
log_life_mean_mu = log(31),
log_life_mean_sigma = 0.7,
log_life_scale_sigma = 0.8,
log_lambda_mean_mu = log(1 / 14),
log_lambda_mean_sigma = 0.3,
log_lambda_scale_sigma = 0.5
)
data_prior <- customers %>%
mutate(t = 0, k = 0) %>%
tidybayes::compose_data(data_hyperpriors, prior_only = 1)
data_prior %>% str()
```

```
List of 14
$ id : int [1:1000(1d)] 1 2 3 4 5 6 7 8 9 10 ...
$ end : num [1:1000(1d)] 730 730 730 730 730 730 730 730 730 730 ...
$ start : num [1:1000(1d)] 328 707 408 342 666 ...
$ T : num [1:1000(1d)] 401.6 22.6 322.2 388.5 64.5 ...
$ t : num [1:1000(1d)] 0 0 0 0 0 0 0 0 0 0 ...
$ k : num [1:1000(1d)] 0 0 0 0 0 0 0 0 0 0 ...
$ n : int 1000
$ log_life_mean_mu : num 3.43
$ log_life_mean_sigma : num 0.7
$ log_life_scale_sigma : num 0.8
$ log_lambda_mean_mu : num -2.64
$ log_lambda_mean_sigma : num 0.3
$ log_lambda_scale_sigma: num 0.5
$ prior_only : num 1
```

Let’s simulate 8 possible datasets from our priors. Notice how the centres and spreads of the datasets can vary.

```
centred_prior <- centred %>%
fit( # a wrapper around rstan::sampling to allow caching
file = here::here('models/rf_centred_prior.rds'), # cache
data = data_prior,
pars = c('customer'), # ignore this parameter
include = FALSE,
chains = 8,
cores = 4,
warmup = 1000, # not sure why this needs to be so high
iter = 1001, # one more than warmup because we just want one dataset per chain
seed = 3901 # for reproducibility
)
centred_prior_draws <- centred_prior %>%
get_draws( # rstan::extract but also with energy
pars = c(
'lp__', 'energy__',
'theta',
'log_centres',
'scales'
)
) %>%
name_parameters() # add customer id, and idx = 1 (mu) or 2 (lambda)
```

Here are the exact hyperparameters used.

```
hyper <- centred_prior_draws %>%
filter(str_detect(parameter, "^log_|scales")) %>%
select(chain, parameter, value) %>%
spread(parameter, value)
```

chain | log_centres[1] | log_centres[2] | scales[1] | scales[2] |
---|---|---|---|---|

1 | 3.643546 | -2.057017 | 1.0866138 | 1.0874820 |

2 | 4.586349 | -2.231530 | 1.2064439 | 0.2725037 |

3 | 2.673924 | -2.475513 | 0.8847336 | 0.9429385 |

4 | 3.490525 | -2.557564 | 0.7708666 | 1.0124164 |

5 | 3.422691 | -2.877842 | 1.3232360 | 0.2920695 |

6 | 4.205884 | -3.196397 | 1.9217956 | 0.5307348 |

7 | 3.381881 | -2.995128 | 1.0299251 | 0.7266123 |

8 | 3.046531 | -2.328561 | 1.3993460 | 0.4751674 |

We’ll add the prior predictive parameter draws from chain 1 to our customers dataset.

```
set.seed(33194)
df <- centred_prior_draws %>%
filter(chain == 1) %>%
filter(name == 'theta') %>%
transmute(
id = id %>% as.integer(),
parameter = if_else(idx == '1', 'mu', 'lambda'),
value
) %>%
spread(parameter, value) %>%
mutate(tau = rexp(n(), mu)) %>%
inner_join(customers, by = 'id') %>%
group_by(id) %>%
group_map(~sample_conditional(.$T, .$tau, .$lambda) %>% bind_cols(.x))
data_df <- data_hyperpriors %>%
tidybayes::compose_data(df, prior_only = 0)
```

id | t | k | lambda | mu | tau | end | start | T |
---|---|---|---|---|---|---|---|---|

1 | 44.453109 | 44 | 0.8508277 | 0.0383954 | 44.669086 | 730 | 328.4312 | 401.56876 |

2 | 21.237790 | 5 | 0.2202757 | 0.0091300 | 263.541504 | 730 | 707.3906 | 22.60937 |

3 | 0.000000 | 0 | 0.0285432 | 0.0583523 | 8.405014 | 730 | 407.8144 | 322.18558 |

4 | 61.946272 | 7 | 0.0970620 | 0.0424386 | 71.617849 | 730 | 341.5465 | 388.45349 |

5 | 8.273831 | 3 | 0.1732173 | 0.0608967 | 8.747799 | 730 | 665.5210 | 64.47898 |

6 | 107.182661 | 19 | 0.1224131 | 0.0159025 | 113.792604 | 730 | 388.0271 | 341.97287 |

Now we can fit the model to the prior-predictive data `df`

.

```
centred_fit <- centred %>%
fit( # like rstan::sampling but with file-caching as in brms
file = here::here('models/rf_centred_fit.rds'), # cache
data = data_df,
chains = 4,
cores = 4,
warmup = 2000,
iter = 3000,
control = list(max_treedepth = 12),
seed = 24207,
pars = c('customer'),
include = FALSE
)
centred_fit %>%
check_hmc_diagnostics()
```

```
Divergences:
0 of 4000 iterations ended with a divergence.
Tree depth:
0 of 4000 iterations saturated the maximum tree depth of 12.
Energy:
E-BFMI indicated no pathological behavior.
```

The HMC diagnostics pass. However, in some of the runs not shown here, there were pretty severe problems with the E-BFMI diagnostic (~0.01) and I’ve yet to figure out exactly which kinds of situations cause these energy problems. Let’s check out the pairwise posterior densities of energy with the hyperparameters.

The scale parameter for the expected lifetime (`scales[1]`

) is correlated with energy, which is associated with the energy problems described above. I’m not sure how much of a problem this poses, so let’s check out some more diagnostics.

```
neff <- centred_fit %>%
neff_ratio() %>%
tibble(
ratio = .,
parameter = names(.)
) %>%
filter(ratio < 0.5) %>%
arrange(ratio) %>%
head(20)
```

ratio | parameter |
---|---|

0.0937772 | scales[1] |

0.1195873 | lp__ |

0.2592819 | log_centres[1] |

0.2748369 | scales[2] |

0.3228056 | log_centres[2] |

0.4340080 | theta[574,2] |

0.4381685 | theta[169,1] |

0.4800795 | theta[38,1] |

Both the `lp__`

and `scales[1]`

parameters have low effective sample sizes. The rhat values seem fine though.

```
centred_rhat <- centred_fit %>%
rhat() %>%
tibble(
rhat = .,
parameter = names(.)
) %>%
summarise(
min_rhat = min(rhat, na.rm = TRUE),
max_rhat = max(rhat, na.rm = TRUE)
)
```

min_rhat | max_rhat |
---|---|

0.9990247 | 1.005071 |

Let’s now compare the 90% posterior intervals with the true values. Ideally close to 90% of the 90% posterior intervals capture their true value.

```
centred_cis <- centred_draws %>%
group_by(parameter) %>%
summarise(
lo = quantile(value, 0.05),
point = quantile(value, 0.50),
hi = quantile(value, 0.95)
) %>%
filter(!str_detect(parameter, '__')) # exclude diagostic parameters
```

The table below shows we managed to recover three of the hyperparameters. The `scales[2]`

parameter was estimated slightly too high.

```
calibration_hyper <- hyper %>%
filter(chain == 1) %>%
gather(parameter, value, -chain) %>%
inner_join(centred_cis, by = 'parameter') %>%
mutate(hit = lo <= value & value <= hi)
```

parameter | value | lo | point | hi | hit |
---|---|---|---|---|---|

log_centres[1] | 3.643546 | 3.6280113 | 3.733964 | 3.831287 | TRUE |

log_centres[2] | -2.057017 | -2.1786705 | -2.090098 | -2.008663 | TRUE |

scales[1] | 1.086614 | 0.9988373 | 1.119258 | 1.242780 | TRUE |

scales[2] | 1.087482 | 1.0938685 | 1.160415 | 1.232225 | FALSE |

We get fairly close to 90% of the customer-level parameters.

```
true_values <- df %>%
select(id, mu, lambda) %>%
gather(parameter, value, -id) %>%
mutate(
idx = if_else(parameter == 'mu', 1, 2),
parameter = str_glue("theta[{id},{idx}]")
)
centred_calibration <- centred_cis %>%
inner_join(true_values, by = 'parameter') %>%
ungroup() %>%
summarise(mean(lo <= value & value <= hi)) %>%
pull() %>%
percent()
centred_calibration
```

`[1] "91.1%"`

This is slightly higher than the ideal value of 90%.

Here is a non-centred stan implementation of our log-normal hierarchical model. The important difference is in the expression for \(\theta\) and in the prior for \(\text{customer}\).

```
noncentred <- here::here('models/rf_noncentred.stan') %>%
stan_model()
```

Since the non-centred and centred models are equivalent, we can also consider `df`

as a draw from the non-centred prior predictive distribution.

```
noncentred_fit <- noncentred %>%
fit( # like rstan::sampling but with file-caching as in brms
file = here::here('models/rf_noncentred_fit.rds'), # cache
data = data_df,
chains = 4,
cores = 4,
warmup = 2000,
iter = 3000,
control = list(max_treedepth = 12),
seed = 1259,
pars = c('customer'),
include = FALSE
)
noncentred_fit %>%
check_hmc_diagnostics()
```

```
Divergences:
0 of 4000 iterations ended with a divergence.
Tree depth:
0 of 4000 iterations saturated the maximum tree depth of 12.
Energy:
E-BFMI indicated no pathological behavior.
```

Again, the HMC diagnostics indicate no problems. Let’s check the pairwise densities anyway.

The correlation between `scales[1]`

and `energy__`

is smaller with the non-centred parameterisation. This is reflected in the higher effective sample size for `scales[1]`

below. Unfortunately, the effective sample size for the purchase rate hyperpriors has gone down.

```
neff <- noncentred_fit %>%
neff_ratio() %>%
tibble(
ratio = .,
parameter = names(.)
) %>%
filter(ratio < 0.5) %>%
arrange(ratio) %>%
head(20)
```

ratio | parameter |
---|---|

0.0631821 | log_centres[2] |

0.0681729 | scales[2] |

0.1579867 | lp__ |

0.1776186 | scales[1] |

0.2217369 | log_centres[1] |

0.3910369 | theta[830,1] |

0.4767405 | theta[639,1] |

0.4792693 | theta[250,2] |

0.4847856 | theta[41,1] |

0.4978518 | theta[231,1] |

Again, the rhat values seem fine.

```
noncentred_rhat <- noncentred_fit %>%
rhat() %>%
tibble(
rhat = .,
parameter = names(.)
) %>%
summarise(
min_rhat = min(rhat, na.rm = TRUE),
max_rhat = max(rhat, na.rm = TRUE)
)
```

min_rhat | max_rhat |
---|---|

0.9990501 | 1.013372 |

Let’s check how many of the 90% posterior intervals contain the true value.

```
noncentred_cis <- noncentred_draws %>%
group_by(parameter) %>%
summarise(
lo = quantile(value, 0.05),
point = quantile(value, 0.50),
hi = quantile(value, 0.95)
) %>%
filter(!str_detect(parameter, '__'))
```

The hyperparameter estimates are much the same as with the centred parameterisation.

```
noncentred_calibration_hyper <- hyper %>%
filter(chain == 1) %>%
gather(parameter, value, -chain) %>%
inner_join(noncentred_cis, by = 'parameter') %>%
mutate(hit = lo <= value & value <= hi)
```

parameter | value | lo | point | hi | hit |
---|---|---|---|---|---|

log_centres[1] | 3.643546 | 3.6325159 | 3.735363 | 3.838160 | TRUE |

log_centres[2] | -2.057017 | -2.1770703 | -2.092537 | -2.014462 | TRUE |

scales[1] | 1.086614 | 0.9970698 | 1.109297 | 1.235166 | TRUE |

scales[2] | 1.087482 | 1.0963503 | 1.159626 | 1.234190 | FALSE |

About 91% of customer-level posterior intervals contain the true value.

```
noncentred_calibration <- noncentred_cis %>%
inner_join(true_values, by = 'parameter') %>%
summarise(mean(lo <= value & value <= hi)) %>%
pull() %>%
percent()
noncentred_calibration
```

`[1] "91.0%"`

Both centred and non-centred models performed reasonably well on the dataset considered. The non-centred model showed slightly less correlation between `scales`

and `energy__`

, suggesting it might be the better one to tackle the low E-BFMI problems. Since we only checked the fit on one prior-predictive draw, it would be a good idea to check out the fit to more draws. Some casual attempts of mine (not shown here) suggest there are situations that cause severe E-BFMI problems. Identifying these situations would be an interesting next step. It would also be great to see how it performs on some of the benchmarked datasets mentioned in the BTYDPlus package.

Here’s my solution to exercise 9, chapter 1, of Gelman’s *Bayesian Data Analysis* (BDA), 3rd edition. There are solutions to some of the exercises on the book’s webpage.

Suppose there 3 doctors, who open their practice at 09:00 and stop accepting patients at 16:00. If customers arrive in exponentially distributed intervals with mean 10 minutes, and appointment duration is uniformly distributed between 5 and 10 minutes, we want to know:

- how many patients arrive per day?
- how many patients have to wait for their appointment?
- how long do patients have to wait?
- when does the last patient leave the practice?

We do this by simulation. The `arrivals`

function will simulate the arrival times of the patients, in minutes after 09:00. In principle, we should simulate draws from the exponential distribution until the sum of all draws is above 420, the number of minutes the practice accepts patients. However, I couldn’t find any efficient way to run this in R. Instead we’ll draw so many variables such that is is highly unlikely that we have too few, then just filter out what we don’t need.

To calculate a suitably large number, note that the number of patients in one day is \(\dpois(\frac{1}{10} \cdot (16 - 9) \cdot 60)\)-distributed. The 99.99999% percentile of this distribution is `qpois(0.9999999, (16 - 9) * 6) =`

80. We’ll err on the safe side and use \(n=100\).

```
arrivals <- function(λ, t, n=100) {
rexp(n, λ) %>%
tibble(
delay = .,
time = cumsum(delay)
) %>%
filter(time <= t) %>%
pull(time) %>%
return()
}
λ <- 1 / 10
t <- (16 - 9) * 60
arrivals(λ, t)
```

```
[1] 4.854265 4.963889 6.651014 9.518990 17.415709 20.110178
[7] 28.852188 34.862538 35.970215 44.205468 48.342152 52.934693
[13] 83.072579 86.746318 117.586517 122.811176 133.662687 142.016603
[19] 170.935913 190.999325 202.511439 204.915770 205.191951 208.422873
[25] 219.437526 225.162971 233.122550 235.351649 253.558658 254.097711
[31] 255.639118 256.270049 277.905899 291.055504 291.737173 294.688419
[37] 300.949679 302.681417 329.751112 335.998940 355.506712 361.162687
[43] 381.436543 388.767558 393.072689 393.088807 395.570811 401.669343
[49] 401.911643 408.650710 418.291196
```

Given the patients that arrive in a day, we now need a function to simulate the appointments. Let’s assume the patients get seen in the order they arrive. As we cycle through the patients, we’ll keep track of

`n_waited`

, the number of patients who have had to wait for their appointment so far;`time_waiting`

, the sum of all waiting times of the patients so far; and`doctors`

, the next time at which each doctor is free to see another patient.

The `doctors`

variable starts at `c(0, 0, 0)`

because they are immediately availble to see patients. The doctor with the smallest availability time is the next doctor to see a patient. The start of the appointment is either the doctor’s availability time or the arrival time of the patient, whichever is greater. The end of the appointment is \(\duniform(5, 20)\)-minutes after the start of the appointment. The doctor’s availability time is then set to the end of the appointment. Once all patients have been given an appointment, the closing time is the maximum of the doctors’ next availability times or the closing time `(16 - 9) * 60`

, whichever is greater.

```
process <- function(arrivals, t=0) {
n_waited <- 0 # number of patients who have had to wait so far
time_waiting <- 0 # total waiting time so far
doctors <- c(0, 0, 0) # next time at which each doctor is free to see another patient
for(i in (1:length(arrivals))) {
wait <- pmax(min(doctors) - arrivals[i], 0) # waiting time of patient i
time_waiting <- time_waiting + wait
n_waited <- n_waited + (wait > 0)
appointment_start <- max(c(min(doctors), arrivals[i]))
appointment_end <- appointment_start + runif(1, 5, 20)
doctors[which.min(doctors)] <- appointment_end
}
list(
n_patients = length(arrivals),
n_waited = n_waited,
time_waiting = time_waiting,
time_waiting_per_patient = time_waiting / length(arrivals),
time_waiting_per_waiting_patient = time_waiting / n_waited,
closing_time = pmax(max(doctors), t)
) %>% return()
}
arrivals(λ, t) %>%
process(t)
```

```
$n_patients
[1] 39
$n_waited
[1] 0
$time_waiting
[1] 0
$time_waiting_per_patient
[1] 0
$time_waiting_per_waiting_patient
[1] NaN
$closing_time
[1] 426.9273
```

To simulate the above many times, we’ll use the `replicate`

function. For convenience, we’ll turn this into a `tibble`

.

```
simulate <- function(iters, λ, t, n=100) {
iters %>%
replicate(process(arrivals(λ, t, n), t)) %>%
t() %>%
as_tibble() %>%
mutate_all(unlist)
}
sims <- simulate(1000, λ, t)
```

n_patients | n_waited | time_waiting | time_waiting_per_patient | time_waiting_per_waiting_patient | closing_time |
---|---|---|---|---|---|

45 | 8 | 10.493347 | 0.2331855 | 1.311668 | 421.2221 |

44 | 6 | 13.226377 | 0.3005995 | 2.204396 | 435.4524 |

29 | 1 | 1.174172 | 0.0404887 | 1.174172 | 435.9763 |

31 | 1 | 9.413756 | 0.3036696 | 9.413756 | 440.1210 |

40 | 5 | 16.692607 | 0.4173152 | 3.338521 | 431.8950 |

37 | 2 | 11.329963 | 0.3062152 | 5.664981 | 420.0000 |

Finally, we can calculate the 50% intervals by applying the `quantile`

function to each summary.

```
sims_summary <- sims %>%
gather(variable, value) %>%
group_by(variable) %>%
summarise(
q25 = quantile(value, 0.25, na.rm=TRUE),
q50 = quantile(value, 0.5, na.rm=TRUE),
q75 = quantile(value, 0.75, na.rm=TRUE),
simulations = n()
)
```

variable | q25 | q50 | q75 | simulations |
---|---|---|---|---|

closing_time | 420.000000 | 424.9488049 | 430.7023807 | 1000 |

n_patients | 38.000000 | 42.0000000 | 46.2500000 | 1000 |

n_waited | 3.000000 | 5.0000000 | 9.0000000 | 1000 |

time_waiting | 8.266744 | 19.3983242 | 38.9367142 | 1000 |

time_waiting_per_patient | 0.212705 | 0.4685262 | 0.8622254 | 1000 |

time_waiting_per_waiting_patient | 2.660497 | 3.8266619 | 5.1309744 | 1000 |

Suppose you have a bunch of customers who make repeat purchases - some more frequenty, some less. There are a few things you might like to know about these customers, such as

- which customers are still active (i.e. not yet churned) and likely to continue purchasing from you?; and
- how many purchases can you expect from each customer?

Modelling this directly is more difficult than it might seem at first. A customer that regularly makes purchases every day might be considered at risk of churning if they haven’t purchased anything in the past week, whereas a customer that regularly puchases once per month would not be considered at risk of churning. That is, churn and frequency of purchasing are closely related. The difficulty is that we don’t observe the moment of churn of any customer and have to model it probabilistically.

There are a number of established models for estimating this, the most well-known perhaps being the SMC model (a.k.a pareto-nbd model). There are already some implementations using maximum likelihood or Gibbs sampling. In this post, we’ll explain how the model works, make some prior predictive simulations, and fit a version implemented in Stan.

\(\DeclareMathOperator{\dbinomial}{Binomial} \DeclareMathOperator{\dbern}{Bernoulli} \DeclareMathOperator{\dpois}{Poisson} \DeclareMathOperator{\dnorm}{Normal} \DeclareMathOperator{\dt}{t} \DeclareMathOperator{\dcauchy}{Cauchy} \DeclareMathOperator{\dexp}{Exp} \DeclareMathOperator{\duniform}{Uniform} \DeclareMathOperator{\dgamma}{Gamma} \DeclareMathOperator{\dinvgamma}{InvGamma} \DeclareMathOperator{\invlogit}{InvLogit} \DeclareMathOperator{\logit}{Logit} \DeclareMathOperator{\ddirichlet}{Dirichlet} \DeclareMathOperator{\dbeta}{Beta}\)

Let’s describe the model first by simulation. Suppose we have a company that is 2 years old and a total of 2000 customers, \(C\), that have made at least one purchase from us. We’ll assume a linear rate of customer acquisition, so that the first purchase date is simply a uniform random variable over the 2 years of the company existance. These assumptions are just to keep the example concrete, and are not so important for understanding the model.

```
customers <- tibble(id = 1:1000) %>%
mutate(
end = 2 * 365,
start = runif(n(), 0, end - 1),
T = end - start
)
```

The \(T\)-variable is the total observation time, counted from the date of first joining to the present day.

First the likelihood. Each customer \(c \in C\) is assumed to have a certain lifetime, \(\tau_c\), starting on their join-date. During their lifetime, they will purchase at a constant rate, \(\lambda_c\), so that they will make \(k \sim \dpois(t\lambda_c)\) purchases over a time-interval \(t\). Once their lifetime is over, they will stop purchasing. We only observe the customer for \(T_c\) units of time, and this observation time can be either larger or smaller than the lifetime, \(\tau_c\). Since we don’t observe \(\tau_c\) itself, we will assume it follows an exponential distribution, i.e. \(\tau_c \sim \dexp(\mu_c)\).

The following function generates possible observations given \(\mu\) and \(\lambda\).

```
sample_conditional <- function(mu, lambda, T) {
# lifetime
tau <- rexp(1, mu)
# start with 0 purchases
t <- 0
k <- 0
# simulate time till next purchase
wait <- rexp(1, lambda)
# keep purchasing till end of life/observation time
while(t + wait <= pmin(T, tau)) {
t <- t + wait
k <- k + 1
wait <- rexp(1, lambda)
}
# return tabular data
tibble(
mu = mu,
lambda = lambda,
T = T,
tau = tau,
k = k,
t = t
)
}
s <- sample_conditional(0.01, 1, 30)
```

mu | lambda | T | tau | k | t |
---|---|---|---|---|---|

0.01 | 1 | 30 | 49.63373 | 39 | 29.21926 |

Given \(\mu\) and \(\lambda\), the CLV is calculated as follows. The remaining lifetime is the lifetime minus the age of the customer. So if the customer is estimated to have a lifetime of 1 year and has been a customer for 3 months already, then the remaining lifetime will be 9 months.

```
lifetime <- function(n, mu, age=0) {
rexp(n, mu) %>%
`-`(age) %>%
pmax(0) # remaining lifetime always >= 0
}
```

The number of purchases in a given timeframe (within the customer’s lifetime) is simply a poisson random variable.

```
purchases <- function(n, lambda, time) {
rpois(n, lambda * time)
}
```

To simulate the CLV, we just simulate a possible lifetime remaining, then simulate the number of puchases in that timeframe. Repeating many times gives us the distribution of the total number of purchases the customer is expected to make.

```
clv <- function(n, mu, lambda, age=0) {
lifetime(n, mu, age) %>%
purchases(n, lambda, .)
}
```

The probability of churning can be estimated by the fraction of `lifetime`

draws that are above 0. For example, for a customer with an expected lifetime of 10 (i.e. \(\mu = 0.1\)) and a current age of 10, the probability of still being active is

`mean(lifetime(100000, 0.1, 10) > 0)`

`[1] 0.36881`

which is roughly \(\exp(-0.1 * 10)\), the survival function of the exponential distribution.

Now the priors. Typically, \(\mu\) and \(\lambda\) are given gamma priors, which we’ll use too. However, the expected mean lifetime \(\mathbb E (\tau) = \frac{1}{\mu}\) is easier to reason about than \(\mu\), so we’ll put an inverse gamma distribution on \(\frac{1}{\mu}\). The reciprocal of an inverse gamma distribution has a gamma distribution, so \(\mu\) will still end up with a gamma distribution.

The mean expected lifetime in our simulated example will be ~2 months, with a standard deviation of 30. The mean purchase rate will be once a fortnight, with a standard deviation around 0.05.

```
set.seed(2017896)
etau_mean <- 60
etau_variance <- 30^2
etau_beta <- etau_mean^3 / etau_variance + etau_mean
etau_alpha <- etau_mean^2 / etau_variance + 2
lambda_mean <- 1 / 14
lambda_variance <- 0.05^2
lambda_beta <- lambda_mean / lambda_variance
lambda_alpha <- lambda_mean * lambda_beta
df <- customers %>%
mutate(
etau = rinvgamma(n(), etau_alpha, etau_beta),
mu = 1 / etau,
lambda = rgamma(n(), lambda_alpha, lambda_beta)
) %>%
group_by(id) %>%
group_map(~sample_conditional(.$mu, .$lambda, .$T))
```

id | mu | lambda | T | tau | k | t |
---|---|---|---|---|---|---|

1 | 0.0241091 | 0.2108978 | 295.3119 | 32.2814622 | 6 | 29.46052 |

2 | 0.0122084 | 0.0135551 | 673.2100 | 11.5250690 | 0 | 0.00000 |

3 | 0.0032994 | 0.0789800 | 357.1805 | 4.7921238 | 0 | 0.00000 |

4 | 0.0227431 | 0.0980176 | 270.0511 | 141.4766791 | 10 | 125.60765 |

5 | 0.0270742 | 0.0429184 | 608.9049 | 5.7293256 | 0 | 0.00000 |

6 | 0.0208168 | 0.0661296 | 666.1305 | 0.9481004 | 0 | 0.00000 |

The lifetimes are mostly under 3 months, but also allow some more extreme values up to around a year.

The purchase rates are mostly around once a fortnight, but there are also rates as high as 4 purchases per week and ras low as one per quarter.

The likelihood is somewhat complicated, so we’ll derive a more concise expression for it. Knowing the lifetime simplifies the probabilities, so we’ll marginalise the liklihood over \(\tau\).

\[ \begin{align} \mathbb P (k, t \mid \mu, \lambda) &= \int_{\tau = t}^\infty \mathbb P (k, t \mid \mu, \lambda, \tau) \cdot \mathbb P(\tau \mid \mu, \lambda) d\tau \\ &= \int_{\tau = t}^T \mathbb P (k, t \mid \mu, \lambda, \tau) \cdot \mathbb P(\tau \mid \mu, \lambda) + \int_{\tau = T}^\infty \mathbb P (k, t \mid \mu, \lambda, \tau) \cdot \mathbb P(\tau \mid \mu, \lambda) \\ &= \int_{\tau = t}^T \dpois(k \mid t\lambda) \cdot \dpois(0 \mid (\tau-t)\lambda) \cdot \dexp(\tau \mid \mu) d\tau \\ &\hphantom{=} + \int_{\tau = T}^\infty \dpois(k \mid t\lambda) \cdot \dpois(0 \mid (T-t)\lambda) \cdot \dexp(\tau \mid \mu) d\tau \end{align} \]

The right-hand side is straight forward. The Poisson probabilities can be pulled out of the integral since they are independent of \(\tau\), turning the remaining integral into the survival function of the exponential distribution.

\[ \begin{align} \text{RHS} &= \int_{\tau = T}^\infty \dpois(k \mid t\lambda) \cdot \dpois(0 \mid (T - t)\lambda) \cdot\dexp(\tau \mid \mu) d\tau \\ &= \frac{(t\lambda)^k e^{-t\lambda}}{k!} e^{-(T-t)\lambda}\int_T^\infty \dexp(\tau \mid \mu) d\tau \\ &= \frac{(t\lambda)^k e^{-T\lambda}}{k!} e^{-T\mu} \\ &= \frac{(t\lambda)^k e^{-T(\lambda + \mu)}}{k!} \end{align} \]

The left-hand side is a little more involved.

\[ \begin{align} \text{LHS} &= \int_{\tau = t}^T \dpois(k \mid t\lambda) \cdot \dpois(0 \mid (\tau-t)\lambda) \cdot \dexp(\tau \mid \mu) d\tau \\ &= \frac{(t\lambda)^k e^{-t\lambda} }{k!} \int_t^T e^{-(\tau - t)\lambda} \mu e^{-\tau\mu} d\tau \\ &= \frac{(t\lambda)^k e^{-t\lambda} }{k!} e^{t\lambda} \mu \int_t^T e^{-\tau(\lambda + \mu)} d\tau \\ &= \frac{(t\lambda)^k }{k!} \mu \left. \frac{ e^{-\tau(\lambda + \mu)}}{-(\lambda + \mu)} \right|_t^T \\ &= \frac{(t\lambda)^k }{k!} \mu \frac{ e^{-t(\lambda + \mu)} - e^{-T(\lambda + \mu)}}{\lambda + \mu} \end{align} \]

Adding both expressions gives our final expression for the likelihood

\[ \begin{align} \mathbb P (k, t \mid \mu, \lambda) &= \frac{(t\lambda)^k e^{-T(\lambda + \mu)}}{k!} + \frac{(t\lambda)^k }{k!} \mu \frac{ e^{-t(\lambda + \mu)} - e^{-T(\lambda + \mu)}}{\lambda + \mu} \\ &\propto \lambda^k e^{-T(\lambda + \mu)} + \lambda^k \mu \frac{ e^{-t(\lambda + \mu)} - e^{-T(\lambda + \mu)}}{\lambda + \mu} \\ &= \frac{\lambda^k}{\lambda + \mu} \left( \mu e^{-t(\lambda + \mu)} - \mu e^{-T(\lambda + \mu)} + \mu e^{-T(\lambda + \mu)} + \lambda e^{-T(\lambda + \mu)} \right) \\ &= \frac{\lambda^k}{\lambda + \mu} \left( \mu e^{-t(\lambda + \mu)} + \lambda e^{-T(\lambda + \mu)} \right) , \end{align} \]

where we dropped any factors independent of the parameters, \(\lambda, \mu\). This expression agrees with equation 2 in ML07.

Another way to view this likelihood is as a mixture of censored observations, but where the mixture probability \(p(\mu, \lambda) := \frac{\mu}{\lambda + \mu}\) depends on the parameters. We can write this alternative interpretation as

\[ \begin{align} \mathbb P(k, t \mid \mu, \lambda) &\propto p \dpois(k \mid t\lambda)S(t \mid \mu) \\ &\hphantom{\propto}+ (1 - p) \dpois(k \mid t\lambda)\dpois(0 \mid (T-t)\lambda)S(T \mid \mu) , \end{align} \]

where \(S\) is the survival function of the exponential distribution. In other words, either we censor at \(t\) with probability \(p\), or we censor at \(T\) with probability \((1 - p)\). Note that

- either decreasing the expected lifetime (i.e. increasing \(\mu\)) or decreasing the purchase rate increases \(p\);
- if \(t \approx T\), then the censored distributions are approximately equal. The smaller \(\lambda\) is, the closer the approximation has to be for this to hold.

To implement this in stan, we’ll need the log-likelihood, which is given by

\[ \log\mathbb P (k, t \mid \mu, \lambda) = k \log\lambda - \log(\lambda + \mu) + \log\left(\mu e^{-t(\lambda + \mu)} + \lambda e^{-T(\lambda + \mu)} \right) . \]

Let’s plot the likelihood to see how it changes as we vary \(k\), \(t\), and \(T\). We’ll use the following functions to do this.

```
# calculate the likelihood
likelihood <- function(mu, lambda, k, t, T) {
log_likelihood <- k * log(lambda) - log(lambda + mu) + log(mu * exp(-t * (lambda + mu)) + lambda * exp(-T * (lambda + mu)))
return(exp(log_likelihood))
}
# the grid to calculate values for
grid <- crossing(
mu = seq(0.00001, 1, 0.01),
lambda = seq(0.00001, 1, 0.01)
)
# plot it all
plot_likelihood <- function(grid, k, t, T) {
grid %>%
mutate(k = k, t = t, T = T) %>%
mutate(likelihood = likelihood(mu, lambda, k, t, T)) %>%
ggplot() +
aes(mu, lambda, fill = likelihood) +
geom_raster() +
geom_contour(aes(z = likelihood), alpha = 0.7) +
labs(
x = 'μ',
y = 'λ',
title = str_glue("Likelihood for k = {k}, t = {t}, T = {T}"),
subtitle = 'restricted to the unit interval',
fill = 'Likelihood'
)
}
```

If \(k = t = 0 \approx T\), then we have almost no information to inform our estimates (we would rely strongly on our priors in this case). We see that both large and small lifetimes are equally possible, and the parameter estimates are approximately independent of one another.

```
grid %>%
plot_likelihood(k = 0, t = 0, T = 0.1)
```

Adding some observation time changes it up a little. We can increase the purchase rate without changing the likelihood if we also decrease the lifetime (= increase \(\mu\)). This trade-off is almost linear. There are almost always many customers that haven’t made a second purchase yet, so this case is likely important to deal with well.

```
grid %>%
plot_likelihood(k = 0, t = 0, T = 12)
```

If, on the other hand, we do observe some purchases in this period, the likelihood quickly shrinks around the average purchase rate. Likewise, the expected lifetime clings around the larger values.

```
grid %>%
plot_likelihood(k = 3, t = 12, T = 12)
```

Once a substantial length of time ellapses without any more purchases, we see the MLE estimate for \(\mu\) move away from small values. This makes sense since we would otherwise have observed more recent purchases. The estimate for \(\mu\) doesn’t increase too much though since we know the lifetime is at least 12.

```
grid %>%
plot_likelihood(k = 3, t = 12, T = 100000)
```

Let’s take a look at our Stan implementation. Note that Stan uses the log-likelihood, and we can increment it by incrementing the `target`

variable. We have also used the `log_sum_exp`

for numeric stability, where \(\text{log_sum_exp}(x, y) := \log(e^x + e^y)\).

```
pnb <- here('models/pnbd.stan') %>%
stan_model()
```

```
S4 class stanmodel 'pnbd' coded as follows:
data {
int<lower = 1> n; // number of customers
vector<lower = 0>[n] t; // time to most recent purchase
vector<lower = 0>[n] T; // total observation time
vector<lower = 0>[n] k; // number of purchases observed
// user-specified parameters
real<lower = 0> etau_alpha;
real<lower = 0> etau_beta;
real<lower = 0> lambda_alpha;
real<lower = 0> lambda_beta;
}
parameters {
vector<lower = 0>[n] lambda; // purchase rate
vector<lower = 0>[n] etau; // expected mean lifetime
}
transformed parameters {
vector<lower = 0>[n] mu = 1.0 ./ etau;
}
model {
// priors
etau ~ inv_gamma(etau_alpha, etau_beta);
lambda ~ gamma(lambda_alpha, lambda_beta);
// likelihood
target += k .* log(lambda) - log(lambda + mu);
for (i in 1:n) {
target += log_sum_exp(
log(lambda[i]) - (lambda[i] + mu[i]) .* T[i],
log(mu[i]) - (lambda[i] + mu[i]) .* t[i]
);
}
}
```

Let’s fit the model to our simulated data, using the correct priors.

```
pnb_fit <- rstan::sampling(
pnb,
data = compose_data(
df,
etau_alpha = etau_alpha,
etau_beta = etau_beta,
lambda_alpha = lambda_alpha,
lambda_beta = lambda_beta
),
control = list(max_treedepth = 15),
chains = 4,
cores = 4,
warmup = 1000,
iter = 3000
)
```

Using the default `max_treedepth`

of 10 shows problems with the energy diagnostic, with the `etau`

parameters seemingly most problematic. However, increasing it to 15 resolved these issues.

```
pnb_fit %>%
check_hmc_diagnostics()
```

```
Divergences:
0 of 8000 iterations ended with a divergence.
Tree depth:
0 of 8000 iterations saturated the maximum tree depth of 15.
Energy:
E-BFMI indicated no pathological behavior.
```

There are also no problems with the effective sample sizes, although `etau`

typically has the lowest.

```
pnb_neff <- pnb_fit %>%
neff_ratio() %>%
tibble(
ratio = .,
parameter = names(.)
) %>%
arrange(ratio) %>%
head(5)
```

ratio | parameter |
---|---|

0.3877002 | lp__ |

0.4838375 | etau[716] |

0.5157888 | etau[442] |

0.5722499 | etau[367] |

0.5803245 | etau[443] |

The rhat statistic also looks good.

```
pnb_rhat <- pnb_fit %>%
rhat() %>%
tibble(
rhat = .,
parameter = names(.)
) %>%
summarise(min(rhat), max(rhat))
```

min(rhat) | max(rhat) |
---|---|

0.9995222 | 1.001541 |

Around 50% of our 50% posterior intervals contain the true value, which is a good sign.

```
calibration <- pnb_fit %>%
spread_draws(mu[id], lambda[id]) %>%
mean_qi(.width = 0.5) %>%
inner_join(df, by = 'id') %>%
summarise(
mu = mean(mu.lower <= mu.y & mu.y <= mu.upper),
lambda = mean(lambda.lower <= lambda.y & lambda.y <= lambda.upper)
) %>%
gather(parameter, fraction)
```

parameter | fraction |
---|---|

mu | 0.495 |

lambda | 0.498 |

We described the data generating process behind the Pareto-NBD model, implemented a model in Stan using our derivation of the likelihood, and fit the model to simulated data. The diagnostics didn’t indicate any convergence problems, and around 50% of the 50% posterior intervals contained the true parameter values. However, we used our knowledge of the prior distribution to fit the model. It would be better to use a hierarchical prior to relax this requirement.

As a next step, it would be interesting to extend the model to

- estimate spend per purchase;
- use hierarchical priors on \(\mu\) and \(\lambda\);
- allow correlation between \(\mu\) and \(\lambda\); and
- allow covariates, such as cohorts.

Here’s my solution to exercise 3, chapter 1, of Gelman’s *Bayesian Data Analysis* (BDA), 3rd edition. There are solutions to some of the exercises on the book’s webpage.

Suppose a particular gene for eye colour has two alleles: a dominant X and a recessive x allele. Having \(xx\) gives you blue eyes, otherwise you have brown eyes. Suppose also that the proportion of blue-eyed people is \(p^2\), and the proportion of heterozygotes is \(2p(1 - p)\). There are 3 questions to answer:

- What is the probability of a brown-eyed child of brown-eyed parents being a heterozygote?
- If such a heterozygote, Judy, has n brown-eyed children with a random heterozygote, what’s the probability that Judy is a heterozygote?
- Under the conditions of part 2, what is the probability that Judy’s first grandchild has blue eyes?

Let’s first set up some data with which we can verify the results via simulation.

We’ll simulate a large population of individuals where the probability of the recessive allele is 0.2.

```
set.seed(11146)
N <- 5000000
p <- 0.2
alleles <- c('x', 'X')
weights <- c(p, 1 - p)
df <- tibble(id = 1:N %>% as.character()) %>%
mutate(
allele1 = sample(alleles, N, prob = weights, replace = TRUE),
allele2 = sample(alleles, N, prob = weights, replace = TRUE),
genotype = if_else(allele1 == allele2, 'homozygote', 'heterozygote'),
eye_colour = if_else(allele1 == 'x' & allele2 == 'x', 'blue', 'brown')
)
```

id | allele1 | allele2 | genotype | eye_colour |
---|---|---|---|---|

1 | X | X | homozygote | brown |

2 | x | X | heterozygote | brown |

3 | X | x | heterozygote | brown |

4 | X | X | homozygote | brown |

5 | x | X | heterozygote | brown |

6 | X | X | homozygote | brown |

This has the correct distribution of alleles, since \(p^2 \approx\) 0.04 and \((1-p)^2\approx\) 0.64.

```
allele_distribution <- df %>%
group_by(allele1, allele2) %>%
tally() %>%
mutate(frac = n / sum(n))
```

allele1 | allele2 | n | frac |
---|---|---|---|

x | x | 200006 | 0.2000018 |

x | X | 800015 | 0.7999982 |

X | x | 800802 | 0.2002016 |

X | X | 3199177 | 0.7997984 |

This also has the correct distribution of eye colours.

```
eye_colour_distribution <- df %>%
group_by(eye_colour) %>%
tally() %>%
mutate(frac = n / sum(n))
```

eye_colour | n | frac |
---|---|---|

blue | 200006 | 0.0400012 |

brown | 4799994 | 0.9599988 |

The genotype distribution is also correct, since \(2p(1-p) \approx\) 0.32.

```
genotype_distribution <- df %>%
group_by(genotype) %>%
tally() %>%
mutate(frac = n / sum(n))
```

genotype | n | frac |
---|---|---|

heterozygote | 1600817 | 0.3201634 |

homozygote | 3399183 | 0.6798366 |

Let’s also define a couple of functions to simulate reproduction within our population. The `pair`

function matches up random individuals from the first table with random individuals from the second.

```
pair <- function(df1, df2) {
inner_join(
df1 %>%
select(-matches('\\.(x|y)$')) %>%
select(matches('^(id|allele|genotype|eye)')) %>%
ungroup() %>%
sample_frac(size = 1) %>%
mutate(row = row_number()),
df2 %>%
select(-matches('\\.(x|y)$')) %>%
select(matches('^(id|allele|genotype|eye)')) %>%
ungroup() %>%
sample_frac(size = 1) %>%
mutate(row = row_number()),
by = 'row'
) %>%
select(-row) %>%
return()
}
```

The `reproduce`

function then randomly generates a child from the paired individuals.

```
reproduce <- function(pairs, n=1) {
pairs %>%
crossing(child = 1:n) %>%
mutate(
# the variables x and y indicate the allele taken from parent x and y, respectively
x = rbinom(n(), 1, 0.5) + 1,
y = rbinom(n(), 1, 0.5) + 1,
allele1 = if_else(x == 1, allele1.x, allele2.x),
allele2 = if_else(y == 1, allele1.y, allele2.y),
genotype = if_else(allele1 == allele2, 'homozygote', 'heterozygote'),
eye_colour = if_else(allele1 == 'x' & allele2 == 'x', 'blue', 'brown'),
id = paste(id.x, id.y, child, sep = '-')
) %>%
return()
}
```

The `kids`

table then represents the next generation from random mating within the entire population.

```
kids <- df %>%
pair(df) %>%
reproduce()
```

allele1.x | allele2.x | allele1.y | allele2.y | allele1 | allele2 | x | y |
---|---|---|---|---|---|---|---|

X | X | X | x | X | X | 2 | 1 |

X | x | X | X | x | X | 2 | 1 |

X | X | X | x | X | x | 1 | 2 |

X | X | X | X | X | X | 1 | 1 |

X | x | X | x | X | X | 1 | 1 |

x | X | X | X | x | X | 1 | 1 |

The parent attributes are contained in the `kids`

table, with the `.x`

suffix for one parent and `.y`

for the other.

We’ll use \(A\) to stand for the allele combination, e.g. \(XX\), or \(Xx = xX\), and \(E\) for eye colour. The subscripts \(i = 1, 2\) will be used for each of the two parents, and the absence of subscripts will indicate the variable for the child. We need to calculate the probability that the child is heterogenous given that they are brown-eyed with brown-eyed parents:

\[ \mathbb P (A = Xx \mid E, E_1, E_2 = B). \]

It will be easier to calculate this if we can rewrite it as a probability conditional only on \(A_\bullet\)-variables. First note that

\[ \begin{align} \mathbb P (A, A_1, A_2) &= \mathbb P (A \mid A_1, A_2) \mathbb P(A_1, A_2) \\ &= \mathbb P (A \mid A_1, A_2) \mathbb P(A_1) \mathbb P (A_2) \end{align} \]

using the chain rule and the assumption of random mating. Therefore,

\[ \begin{align} & P (A = Xx \mid E_\bullet = B) \\ &= \frac{\mathbb P (E_\bullet = B \mid A = Xx) \cdot \mathbb P (A = Xx)}{\mathbb P (E_\bullet = B)} \\ &= \frac{ \sum_{a_1, a_2} \mathbb P (E_\bullet = B \mid A = Xx, A_1 = a_1, A_2 = a_2) \cdot \mathbb P (A = Xx \mid A_1 = a_1, A_2 = a_2) \cdot \mathbb P (A_1 = a_1) \cdot \mathbb P (A_2 = a_2) }{ \sum_{a, a_1, a_2} \mathbb P (E_\bullet = B \mid A = a, A_1 = a_1, A_2 = a_2) \cdot \mathbb P (A = a \mid A_1 = a_1, A_2 = a_2) \cdot \mathbb P (A_1 = a_1) \cdot \mathbb P (A_2 = a_2) }, \end{align} \]

where the numerator is marginalised over possible values of \(A_1\) and \(A_2\), and the denominator additionally over \(A\).

The factors involving \(E_\bullet\) are either 1 or 0, depending only on whether the given combination of alleles can give rise to brown eyes or not, respectively. Moreover, \(\mathbb P (A_i = XX) = (1 - p)^2\) and \(\mathbb P (A_i = Xx) = 2p(1 - p)\), where the case \(A_i = xx\) is impossible conditional on everybody having brown eyes. The only non-trivial calculations now involve \(\mathbb P (A = a \mid A_1 = a_1, A_2 = a_2)\):

\[ \begin{align} \mathbb P (A = Xx \mid A_1 = Xx, A_2 = Xx) &= \frac{1}{2} \\ \mathbb P (A = Xx \mid A_1 = Xx, A_2 = XX) &= \frac{1}{2} \\ \mathbb P (A = Xx \mid A_1 = XX, A_2 = XX) &= 0 \\ \mathbb P (A = XX \mid A_1 = Xx, A_2 = Xx) &= \frac{1}{4} \\ \mathbb P (A = XX \mid A_1 = Xx, A_2 = XX) &= \frac{1}{2} \\ \mathbb P (A = XX \mid A_1 = XX, A_2 = XX) &= 1, \end{align} \]

as can be verified by inspection.

Now let’s plug in these values into the formula for the desired probability. The numerator is

\[ \begin{align} & P (A = Xx \mid E_\bullet = B) \\ &= \frac{ \frac{1}{2} \cdot (2p(1 - p))^2 + \frac{1}{2} \cdot 2 \cdot 2p(1 - p)(1 - p)^2 + 0 \cdot (1 - p)^4 }{ (\frac{1}{2} + \frac{1}{4}) \cdot 4p^2(1 - p)^2 + (\frac{1}{2} + \frac{1}{2}) \cdot 4p(1 - p)^3 + (0 + 1) \cdot (1 - p)^4 } \\ &= \frac{(1 - p)^2}{(1 - p)^2} \frac{ 2p^2 + 2p(1 - p) }{ 3p^2 + 4p(1 - p) + (1 - p)^2 } \\ &= \frac{ 2p^2 + 2p - 2p^2 }{ 3p^2 + 4p - 4p^2 + 1 + p^2 - 2p } \\ &= \frac{ 2p }{ 1 + 2p }, \end{align} \]

as required. This is approximately \(2p\) for small \(p\), and is approximatily \(\frac{1}{2}\) for large \(p\).

To condition on brown-eyed children from brown-eyed parents, we can just filter the `kids`

table. Such a child is called `judy`

in this exercise.

```
judy <- kids %>%
filter(
eye_colour.x == 'brown',
eye_colour.y == 'brown',
eye_colour == 'brown'
)
judy_genotypes <- judy %>%
group_by(genotype) %>%
tally() %>%
mutate(frac = n / sum(n))
```

genotype | n | frac |
---|---|---|

heterozygote | 1280529 | 0.2858268 |

homozygote | 3199559 | 0.7141732 |

This is very close to the theoretical value of \(\frac{2p}{1 + 2p}\approx\) 0.286.

Denote by \(E_{C_\bullet} = B\) the condition that all of Judy’s children have brown eyes, and by \(A^p = a\) the condition that Judy’s partner has allele combination \(a\). Then

\[ \begin{align} & \mathbb P (A = Xx \mid E_\bullet = B = E_{C_\bullet}, A^p = Xx) \\ &= \frac{ \mathbb P (A = Xx \mid E_\bullet = B, A^p = Xx) \cdot \mathbb P (E_{C_\bullet} = B \mid E_\bullet = B, A^p = Xx = A) }{ \mathbb P (E_{C_\bullet} = B \mid E_\bullet = B, A^p = Xx) } \\ &= \frac{ \mathbb P (A = Xx \mid E_\bullet = B) \cdot \mathbb P (E_{C_\bullet} = B \mid A^p = Xx = A) }{ \sum_a \mathbb P (E_{C_\bullet} = B \mid E_\bullet = B, A^p = Xx, A = a) \cdot \mathbb P (A = a \mid E_\bullet = B, A^p = Xx) } \\ &= \frac{ \mathbb P (A = Xx \mid E_\bullet = B) \cdot \mathbb P (E_{C_\bullet} = B \mid A^p = Xx = A) }{ \sum_a \mathbb P (E_{C_\bullet} = B \mid A^p = Xx, A = a) \cdot \mathbb P (A = a \mid E_\bullet = B) } \\ &= \frac{ \frac{2p}{1 + 2p} \cdot (\frac{3}{4})^n }{ \mathbb P (E_{C_\bullet} = B \mid A^p = Xx = A) \cdot \mathbb P (A = Xx \mid E_\bullet = B) + \mathbb P (E_{C_\bullet} = B \mid A^p = Xx, A = XX) \cdot \mathbb P (A = XX \mid E_\bullet = B) } \\ &= \frac{ \frac{2p}{1 + 2p} \cdot (\frac{3}{4})^n }{ \frac{2p}{1 + 2p} \cdot (\frac{3}{4})^n + \frac{1}{1 + 2p} } \\ &= \frac{2p \cdot (\frac{3}{4})^n}{2p \cdot (\frac{3}{4})^n + 1} \\ &= \frac{2p \cdot 3^n}{2p \cdot 3^n + 4^n} , \end{align} \]

where we have used conditional independence several times for the probability of the child’s alleles given the parents’ alleles. As \(n \rightarrow \infty\), this probability shrinks to 0.

To simulate part 2, we need to pair `judy`

with heterozygotes from the general population, then filter for those children with brown eyes.

```
judy_kids <- df %>%
filter(genotype == 'heterozygote') %>%
pair(judy) %>%
reproduce() %>%
ungroup() %>%
filter(eye_colour == 'brown')
```

Amongst `judy_kids`

, Judy’s attributes have the `.y`

suffix. Given the above conditions, the probability of her possible genotypes are then:

```
judy_genotypes_posterior <- judy_kids %>%
group_by(genotype.y) %>%
tally() %>%
mutate(frac = n / sum(n))
```

genotype.y | n | frac |
---|---|---|

heterozygote | 343962 | 0.2313911 |

homozygote | 1142534 | 0.7686089 |

This is close to the theoretical value of \(\frac{6p}{6p + 4}\approx\) 23.1%.

Let’s introduce some notation. Let \(A_g\) be the alleles of Judy’s first grandchild, the child of \(c\) with alleles \(A_c\) whose partner has alleles \(A_c^p\). We wish to calculate \(\mathbb P (A_g = xx \mid E_\bullet = B = E_{C_\bullet}, A^p = Xx)\).

First note that

$$ \[\begin{align} & \mathbb P (A_c = Xx \mid E_\bullet = E_{C_\bullet} = B, A^p = Xx, A = a) \\ &= \frac{ \mathbb P (E_{C_\bullet} = B \mid E_\bullet = B, A_c = Xx = A^p, A = a) \cdot \mathbb P (A_c = Xx \mid E_\bullet = B, A^p = Xx, A = a, A) }{ \mathbb P (E_{C_\bullet} = B \mid E_\bullet = B, A^p = Xx, A = a) } \\ &= \frac{ \mathbb P (E_{C_\bullet} = B \mid E_\bullet = B, A_c = Xx = A^p, A = a) \cdot 0.5 }{ \mathbb P (E_{C_\bullet} = B \mid E_\bullet = B, A^p = Xx, A = a) } \\ &= \begin{cases} \frac{\left(\frac{3}{4}\right)^{n-1} \cdot \frac{1}{2}}{\left(\frac{3}{4}\right)^n} &\text{if } A = Xx \\ 1 \cdot \frac{1}{2} / 1 &\text{othewrise} \end{cases} \\ &= \begin{cases} \frac{2}{3} &\text{if } A = Xx \\ \frac{1}{2} &\text{othewrise} \end{cases} . \end{align}\]$$

Thus,

$$ \[\begin{align} & \mathbb P (A_c = Xx \mid E_\bullet = B = E_{C_\bullet}, A^p = Xx) \\ &= \sum_a \mathbb P (A_c = Xx \mid E_\bullet = B = E_{C_\bullet}, A^p = Xx, A = a) \cdot \mathbb P (A = a \mid E_\bullet = B = E_{C_\bullet}, A^p = Xx) \\ &= \frac{2}{3} \cdot \frac{2p \cdot (\frac{3}{4})^n}{2p \cdot (\frac{3}{4})^n + 1} + \frac{1}{2} \cdot \frac{1}{2p \cdot (\frac{3}{4})^n + 1} \\ &= \frac{p\left( \frac{3}{4} \right)^{n-1} + 0.5}{2p \cdot \left(\frac{3}{4}\right)^n + 1} , \end{align}\]$$

which converges to \(\frac{1}{2}\) as \(n \rightarrow \infty\).

The probability that Judy’s first grandchild is a homozygote can then be calculated by marginalising over the allele combinations of the child and their partner:

$$ \[\begin{align} & \mathbb P (A_g = xx \mid E_\bullet = B = E_{C_\bullet}, A^p = Xx) \\ &= \sum_{a_c, a_c^p} \mathbb P (A_g = xx \mid E_\bullet = B = E_{C_\bullet}, A^p = Xx, A_c = a_c, A_c^p = a_c^p) \cdot \mathbb P (A_c = a_c, A_c^p = a_c^p \mid E_\bullet = B = E_{C_\bullet}, A^p = Xx) \\ &= \sum_{a_c, a_c^p} \mathbb P (A_g = xx \mid A_c = a_c, A_c^p = a_c^p) \cdot \mathbb P (A_c^p = a_c^p ) \cdot \mathbb P (A_c = a_c \mid E_\bullet = B = E_{C_\bullet}, A^p = Xx) \\ &= \sum_{a_c^p} \mathbb P (A_g = xx \mid A_c = Xx, A_c^p = a_c^p) \cdot \mathbb P (A_c^p = a_c^p ) \cdot \mathbb P (A_c = Xx \mid E_\bullet = B = E_{C_\bullet}, A^p = Xx) \\ &= \frac{p\left( \frac{3}{4} \right)^{n-1} + 0.5}{2p \cdot \left(\frac{3}{4}\right)^n + 1} \cdot \sum_{a_c^p} \mathbb P (A_g = xx \mid A_c = Xx, A_c^p = a_c^p) \cdot \mathbb P (A_c^p = a_c^p ) \\ &= \frac{p\left( \frac{3}{4} \right)^{n-1} + 0.5}{2p \cdot \left(\frac{3}{4}\right)^n + 1} \cdot \left( \mathbb P (A_g = xx \mid A_c = Xx, A_c^p = Xx) \cdot \mathbb P (A_c^p = Xx ) + \mathbb P (A_g = xx \mid A_c = Xx, A_c^p = xx) \cdot \mathbb P (A_c^p = xx ) \right) \\ &= \frac{p\left( \frac{3}{4} \right)^{n-1} + 0.5}{2p \cdot \left(\frac{3}{4}\right)^n + 1} \cdot \left( \frac{1}{4} \cdot 2p(1 - p) + \frac{1}{2} \cdot p^2 \right) \\ &= \frac{p\left( \frac{3}{4} \right)^{n-1} + 0.5}{2p \cdot \left(\frac{3}{4}\right)^n + 1} \cdot \frac{p}{2} , \end{align}\]$$

since

the grandchild can only be blue-eyed if the (brown-eyed) child has at least one x-allele, i.e. the child is \(Xx\);

\(A\) and \(A^p\) are independent by the random mating assumption; and

\(A_c\) and \(A_c^p\) are independent by the random mating assumption.

As \(n \rightarrow \infty\), this probability converges to \(\frac{p}{4}\).

To simulate Judy’s grandkids, we pair up `judy_kids`

with members of the general population.

```
judy_grandkids <- judy_kids %>%
pair(df) %>%
reproduce()
judy_grandkids %>%
summarise(mean(eye_colour == 'blue')) %>%
pull() %>%
signif(3)
```

`[1] 0.054`

The above fraction of grandkids with blue eyes is consistent with the theoretical value of \(\frac{4p + 0.5}{6p + 4}\frac{p}{2} \approx\) 0.0538.

]]>Here are my solutions to question 3.4.1 of Causal Inference in Statistics: a Primer (CISP). \(\DeclareMathOperator{\do}{do}\)

If we can only measure one additional variable to estimate the causal effect of \(X\) on \(Y\) in figure 3.8, then we should measure \(W\). From question 3.3.1 we see that no single variable satisfies the backdoor criteria. Moreover, visual inspection of the graph verifies that \(W\) satisfies the frontdoor criteria:

- it intercepts all (the only) directed paths from \(X\) to \(Y\);
- there is no unblocked path from \(X\) to \(W\); and
- all backdoor paths from \(W\) to \(Y\) are blocked by \(X\).

To illustrate this, lets simulate the causal effect in 3 separate ways:

- by intervention,
- via the backdoor, and
- via the frontdoor.

Here are the data. Note that we have created functions for \(W\) and \(Y\) for use later.

```
N <- 100000
W <- function(x) {
N <- length(x)
rbinom(N, 1, inv_logit(-x))
}
Y <- function(d, w, z) {
N <- length(d)
rbinom(N, 1, inv_logit(-d - w + 3*z))
}
df <- tibble(id = 1:N) %>%
mutate(
b = rnorm(N, 0, 1),
a = b + rnorm(N, 0, 0.1),
c = rnorm(N, 0, 1),
d = rbinom(N, 1, inv_logit(-1 + c)),
z = rbinom(N, 1, inv_logit(-2 + 2*b + c)),
x = rbinom(N, 1, inv_logit(a + z)),
w = W(x),
y = Y(d, w, z)
)
```

id | b | a | c | d | z | x | w | y |
---|---|---|---|---|---|---|---|---|

1 | 0.3641297 | 0.3917626 | 1.0369530 | 1 | 1 | 0 | 0 | 1 |

2 | 0.0287563 | 0.0397299 | 0.5736271 | 0 | 0 | 1 | 1 | 0 |

3 | -0.7727052 | -0.5993870 | -0.5179657 | 0 | 1 | 0 | 1 | 1 |

4 | 0.4107888 | 0.5737898 | 1.2586840 | 0 | 1 | 1 | 0 | 1 |

5 | 2.3512417 | 2.1631719 | 0.6746523 | 1 | 1 | 1 | 0 | 1 |

In order to simulate an intervention, we assign values to \(X\) randomly, then assign new values for all its descendents. After intervention, the causal effect of \(X\) on \(Y\) is simply \(\mathbb P(Y \mid X)\).

```
intervention <- df %>%
# intervene on x
mutate(
x = rbinom(n(), 1, 0.5),
w = W(x),
y = Y(d, w, z)
) %>%
# model P(y | do(x))
glm(
formula = y ~ x,
family = binomial(),
data = .
) %>%
# predict
augment(
newdata = tibble(x = 0:1),
type.predict = 'response'
)
```

x | .fitted | .se.fit |
---|---|---|

0 | 0.4637566 | 0.0022239 |

1 | 0.5072714 | 0.0022422 |

We can compare this causal effect to the simple statistical effect to see the difference.

```
noncausal <- df %>%
# model P(y | x)
glm(
formula = y ~ x,
family = binomial(),
data = .
) %>%
# predict
augment(
newdata = tibble(x = 0:1),
type.predict = 'response'
)
```

x | .fitted | .se.fit |
---|---|---|

0 | 0.3797282 | 0.0022595 |

1 | 0.5759927 | 0.0021293 |

Since \(\{X, Z\}\) satisfies the backdoor criteria, we can use it to apply the backdoor adjustment. First we’ll need \(\mathbb P(D, Z)\).

```
# P(d, z)
p_d_z <- df %>%
group_by(d, z) %>%
count() %>%
ungroup() %>%
mutate(p_d_z = n / sum(n))
```

Now we model \(\mathbb P(Y \mid X, D, Z)\), multiply it by \(\mathbb P(D, Z)\), then take the sum for each value of \(X\).

```
backdoor <- formula(y ~ 1 + x + z + d) %>%
# model P(y | x, d, z)
glm(
family = binomial(),
data = df
) %>%
# predict
augment(
type.predict = 'response',
newdata =
crossing(
d = c(0, 1),
x = c(0, 1),
z = c(0, 1)
)
) %>%
# get P(d, z)
mutate(p_y_given_d_x_z = .fitted) %>%
inner_join(p_d_z, by = c('d', 'z')) %>%
# backdoor adjustment over d, z
group_by(x) %>%
summarise(p_y_given_do_x = sum(p_y_given_d_x_z * p_d_z))
```

x | p_y_given_do_x |
---|---|

0 | 0.4681530 |

1 | 0.5033398 |

Note that the backdoor adjusted estimates are similar to the estimates from intervention.

To apply the frontdoor adjustment with \(W\), we’ll need \(\mathbb P(W \mid X)\), \(\mathbb P(X^\prime)\), and \(\mathbb P(Y \mid X, W)\).

```
p_w_given_x <- df %>%
group_by(x, w) %>%
count() %>%
group_by(x) %>%
mutate(p_w_given_x = n / sum(n)) %>%
ungroup()
p_xprime <- df %>%
group_by(xprime = x) %>%
count() %>%
ungroup() %>%
mutate(p_xprime = n / sum(n))
p_y_given_xprime_w <- formula(y ~ 1 + x + w) %>%
glm(
family = binomial(),
data = df
) %>%
augment(
newdata = crossing(x = 0:1, w = 0:1),
type.predict = 'response'
) %>%
transmute(
xprime = x,
w,
p_y_given_xprime_w = .fitted
)
```

Now we apply the frontdoor adjustment:

\[ \mathbb P (Y \mid \do(X)) = \sum_{x^\prime, w} \mathbb P(x^\prime) \cdot \mathbb P(w \mid x) \cdot \mathbb P (y \mid x^\prime, w) . \]

```
frontdoor <- p_w_given_x %>%
inner_join(p_y_given_xprime_w, by = 'w') %>%
inner_join(p_xprime, by = 'xprime') %>%
group_by(x) %>%
summarise(sum(p_w_given_x * p_y_given_xprime_w * p_xprime))
```

x | sum(p_w_given_x * p_y_given_xprime_w * p_xprime) |
---|---|

0 | 0.4623710 |

1 | 0.5041105 |

Our frontdoor estimates of \(\mathbb P(Y \mid \do(X))\) are very similar to the intervention and backdoor estimates.

]]>Here are my solutions to question 3.3.3 of Causal Inference in Statistics: a Primer (CISP). \(\DeclareMathOperator{\do}{do}\)

The drug you have been assigned determines which ward you go to. Whether you get a lollipop is determined by which ward you go to and whether you show signs of depression. Depression is a symptom of certain risk factors. These risk factors, together with the drug you have been assigned, determine your capacity for recovery.

Since `lollipop`

is a collider in this diagram, there are no backdoor paths from `drug`

to `recovery`

. In other words, it is not necessary to condition on any variables to estimate the causal effect of `drug`

on `recovery`

. In this case, \(\mathbb P (Y \mid \do(X)) = \mathbb P(Y \mid X)\).

If the nurse were to give out the lollipops in the day after the study, there would be no difference in the causal diagram.

]]>Here are my solutions to question 3.3.2 of Causal Inference in Statistics: a Primer (CISP).

The following DAG is a possible casual graph representing the situation. We wish to find the causal effect of the plan on weight gain. The weight gain \(W_g\) is defined as a linear function of the initial and final weights. From the graph we see that the plan chosen by the students is a function of their initial weight.

Since initial weight \(W_I\) is a confounder of plan and weight gain, the second statistician is correct to condition on initial weight.

The causal diagram here is essentially the same as in Simpson’s paradox. The debate is essentially the direction of the arrow between initial weight and plan.

]]>Here are my solutions to question 3.3.1 of Causal Inference in Statistics: a Primer (CISP).

For the causal effect of \(X\) on \(Y\), every backdoor path must pass via \(Z\). Since \(Z\) is adjacent to \(X\), we must condition on \(Z\). Since \(Z\) is a collider for \(B \rightarrow Z \rightarrow C\), we must also condition on either \(A\), \(B\), \(C\), or \(D\). Thus, the sets of variables that satisfy the backdoor criteria are arbitrary unions of the following minimal sets:

- \(\{ Z, A \}\),
- \(\{ Z, B \}\),
- \(\{ Z, C \}\), and
- \(\{ Z, D \}\).

All backdoor paths from \(D\) to \(Y\) must pass both \(C\) and \(Z\). We can block all backdoor paths by conditioning on \(C\). If we don’t condition on \(C\), then we must condition on \(Z\). Since \(Z\) is a collider, conditioning on it requires us to also condition on one of \(B\), \(A\), \(X\), or \(W\) (the nodes on the only backdoor path). The minimal sets satisfying the backdoor criteria are:

- \(\{ C \}\),
- \(\{ Z, B \}\),
- \(\{ Z, A \}\),
- \(\{ Z, X \}\), and
- \(\{ Z, W \}\).

Note that \(\{C, Z\}\) also satisfies the backdoor criteria but is not a union of any minimal sets.

All backdoor paths from \(\{D, W\}\) to \(Y\) must pass \(Z\) and must pass either \(C\) or \(X\). The node \(Z\) is sufficient to block all backdoor paths after intervening on \(D\) and \(W\). If we don’t condition on \(Z\), then we must condition on \(X\) and \(C\). The minimal sets satisfying the backdoor criteria are:

- \(\{ C, X \}\), and
- \(\{ Z \}\) .

Here are my solutions to question 3.2.1 of Causal Inference in Statistics: a Primer (CISP).

Here are the parameters we’ll use. Note that they are taken from the Simpson’s revesal example of question 1.5.2.

```
r <- 0.28 # fraction with syndrome
q0 <- 0.07 # P(X = 1 | Z = 0)
q1 <- 0.85 # P(X = 1 | Z = 1)
p00 <- 0.84 # P(Y = 1 | X = 0, Z = 0)
p10 <- 0.88 # P(Y = 1 | X = 1, Z = 0)
p01 <- 0.53 # P(Y = 1 | X = 0, Z = 1)
p11 <- 0.58 # P(Y = 1 | X = 1, Z = 1)
```

We can simulate the intervention by generating values for \(X\) independently of \(Z\).

```
N <- 10000 # number of individuals
set.seed(53201)
part_a <- tibble(z = rbinom(N, 1, r)) %>%
mutate(
x = rbinom(n(), 1, 0.5), # no Z-dependence
p_y_given_x_z = case_when(
x == 0 & z == 0 ~ p00,
x == 0 & z == 1 ~ p01,
x == 1 & z == 0 ~ p10,
x == 1 & z == 1 ~ p11
),
y = rbinom(n(), 1, p_y_given_x_z)
) %>%
group_by(x, y) %>%
summarise(n = n()) %>%
group_by(x) %>%
mutate(p_y_given_do_x = n / sum(n))
```

x | y | n | p_y_given_do_x |
---|---|---|---|

0 | 0 | 1238 | 0.2470565 |

0 | 1 | 3773 | 0.7529435 |

1 | 0 | 964 | 0.1932251 |

1 | 1 | 4025 | 0.8067749 |

To simulate observational data, we need to include the dependence of \(X\) on \(Z\).

```
N <- 100000 # number of individuals
set.seed(95400)
p_x_y_z <- tibble(
id = 1:N,
z = rbinom(N, 1, r),
x = rbinom(N, 1, if_else(z == 0, q0, q1)),
p_y_given_x_z = case_when(
x == 0 & z == 0 ~ p00,
x == 0 & z == 1 ~ p01,
x == 1 & z == 0 ~ p10,
x == 1 & z == 1 ~ p11
),
y = rbinom(N, 1, p_y_given_x_z)
) %>%
group_by(x, y, z) %>%
count() %>%
ungroup() %>%
mutate(p = n / sum(n))
```

x | y | z | n | p |
---|---|---|---|---|

0 | 0 | 0 | 10723 | 0.10723 |

0 | 0 | 1 | 2020 | 0.02020 |

0 | 1 | 0 | 56015 | 0.56015 |

0 | 1 | 1 | 2205 | 0.02205 |

1 | 0 | 0 | 624 | 0.00624 |

1 | 0 | 1 | 10067 | 0.10067 |

1 | 1 | 0 | 4470 | 0.04470 |

1 | 1 | 1 | 13876 | 0.13876 |

In order to apply the causal effect rule, we’ll need \(\mathbb P(x \mid z)\).

```
p_x_given_z <- p_x_y_z %>%
group_by(x, z) %>%
summarise(n = sum(n)) %>%
group_by(z) %>%
mutate(p = n / sum(n)) %>%
ungroup()
```

x | z | n | p |
---|---|---|---|

0 | 0 | 66738 | 0.9290845 |

0 | 1 | 4225 | 0.1499929 |

1 | 0 | 5094 | 0.0709155 |

1 | 1 | 23943 | 0.8500071 |

We can then add the conditional probabilities to the joint distribution table, then sum overal all the \(Z\) variables.

```
p_y_given_do_x <- p_x_y_z %>%
inner_join(
p_x_given_z,
by = c('x', 'z'),
suffix = c('_num', '_denom')
) %>%
mutate(p = p_num / p_denom) %>%
group_by(x, y) %>%
summarise(p = sum(p))
```

x | y | p |
---|---|---|

0 | 0 | 0.2500877 |

0 | 1 | 0.7499123 |

1 | 0 | 0.2064264 |

1 | 1 | 0.7935736 |

The causal effect estimates are very close to the simulated intervention.

We can calculate ACE simply by taking the difference of the causal effect estimates.

```
ace <- p_y_given_do_x %>%
spread(x, p) %>%
filter(y == 1) %>%
mutate(ace = `1` - `0`) %>%
pull(ace)
ace
```

`[1] 0.04366134`

This is different from the overall probability differences.

```
p_y_given_x <- p_x_y_z %>%
group_by(x, y) %>%
summarise(n = sum(n)) %>%
group_by(x) %>%
mutate(p = n / sum(n)) %>%
select(-n)
risk_difference <- p_y_given_x %>%
spread(x, p) %>%
filter(y == 1) %>%
mutate(rd = `1` - `0`) %>%
pull(rd)
risk_difference
```

`[1] -0.188613`

Making \(X\) independent of \(Z\) would minimise the disrepancy between ACE and RD, which would turn the adjustment formula into the formulat for \(\mathbb P(y \mid x\). In other words, setting \(q_0 = q_1 = \mathbb P(X = 1)\) would do the trick.

Note that the desegregated causal effects

- \(p_{1, 0} - p_{0, 0}\) is 0.04; and
- \(p_{1, 1} - p_{0, 1}\) is 0.05,

are both consisent with our calculation for the overall causal effect, ACE = 4.37%. The generated data are an illustration of Simpson’s reversal because the risk difference, -18.9%, has the opposite sign.

]]>Here are my solutions to question 1.5.2 of Causal Inference in Statistics: a Primer (CISP).

I’ll use different indexing to make the notation clearer. In particular, the indices will match the values of the conditioning variables.

The full joint probability is

\[ \mathbb P(x, y, z) = \mathbb P (z) \cdot \mathbb P (x \mid z) \cdot \mathbb P (y \mid x, z) \]

using the decomposition formula. Each factor is given by

\[ \begin{align} \mathbb P (z) &= z r + (1 - z) (1 - r) \\ \mathbb P (x \mid z) &= xq_z + (1 - x)(1 - q_z) \\ \mathbb P (y \mid x, z) &= yp_{x, z} + (1 - y)(1 - p_{x, z}) \end{align} \]

where each parameter is assumed to have support on \(\{0, 1\}\).

The marginal distributions are given by

\[ \begin{align} \mathbb P(x, z) &= \mathbb P(x \mid z) \cdot \mathbb P (z) \\ \mathbb P(y, z) &= \mathbb P(0, y, z) + \mathbb P(1, y, z) \\ \mathbb P(x, y) &= \mathbb P(x, y, 0) + \mathbb P(x, y, 1) \\ &= yp_{x, 0} + (1 - y)(1 - p_{x, 0}) + yp_{x, 1} + (1 - y)(1 - p_{x, 1}) \\ &= y (p_{x, 0} + p_{x, 1}) + (1 - y)(2 - p_{x, 0} - p_{x, 1}) . \end{align} \]

Furthermore,

\[ \begin{align} \mathbb P (x) &= \sum_z \mathbb P(x \mid z) \mathbb P (z) \\ &= \sum_z (xq_z + (1 - x)(1 - q_z))(zr + (1 - z)(1 - r)) \end{align} \]

so that

\[ \begin{align} \mathbb P(X = 0) &= (1 - q_0)(1 - r) + (1 - q_1)r \\ \mathbb P(X = 1) &= q_0(1 - r) + q_1r \end{align} \]

The increase in probability from taking the drug in each sub-population is:

- \(\mathbb P(y = 1 \mid x = 1, z = 0) - \mathbb P(y = 1 \mid x = 0, z = 0) = p_{1, 0} - p_{0, 0}\); and
- \(\mathbb P(y = 1 \mid x = 1, z = 1) - \mathbb P(y = 1 \mid x = 0, z = 1) = p_{1, 1} - p_{0, 1}\).

In the whole population, the increase is \(\mathbb P(Y = 1 \mid X = 1) - \mathbb P(Y = 1 \mid X = 0)\), calcualted via

\[ \begin{align} & \sum_{z = 0}^1 \mathbb P(Y = 1, Z = z \mid X = 1) - \mathbb P(Y = 1, Z = z \mid X = 0) \\ &= \sum_{z = 0}^1 \frac{\mathbb P(X = 1, Y = 1, Z = z)}{\mathbb P(X = 1)} - \frac{\mathbb P(X = 0, Y = 1, Z = z)}{\mathbb P(X = 0)} \\ &= \frac{(1 - r)q_0p_{1, 0} + rq_1p_{1, 1}}{q_0(1 - r) + q_1r} - \frac{(1 - r)(1 - q_0)p_{0, 0} + r(1 - q_1)p_{0, 1}}{(1 - q_0)(1 - r) + (1 - q_1)r} \end{align} \]

There’s no need to be smart about this. Let’s just simulate lots of values and find some combination with a Simpson’s reversal. We’ll generate a dataset with a positive probability difference in each sub-population, then filter out anything that also has a non-negative population difference.

```
set.seed(8168)
N <- 10000
part_c <- tibble(
id = 1:N %>% as.integer(),
r = rbeta(N, 2, 2), # P(Z = 1)
q0 = rbeta(N, 2, 2), # P(X = 1 | Z = 0)
q1 = rbeta(N, 2, 2), # P(X = 1 | Z = 1)
p00 = rbeta(N, 2, 2), # P(Y = 1 | X = 0, Z = 0)
p10 = rbeta(N, 2, 2) * (p00 - 1) + 1, # P(Y = 1 | X = 1, Z = 0)
p01 = rbeta(N, 2, 2), # P(Y = 1 | X = 0, Z = 1)
p11 = rbeta(N, 2, 2) * (p01 - 1) + 1, # P(Y = 1 | X = 1, Z = 1)
diff_pop = (p10 * q0 * (1 - r) + p11 * q1 * r) / (q0 * (1 - r) + q1 * r) - (p00 * (1 - q0) * (1 - r) + p01 * (1 - q1) * r) / ((1 - q0) * (1 - r) + (1 - q1) * r),
diff_z0 = p10 - p00,
diff_z1 = p11 - p01
)
```

As a check, there should be no rows with a non-positive difference.

```
check <- part_c %>%
filter(diff_z0 <= 0 | diff_z1 <= 0) %>%
nrow()
# throw error if there are rows
stopifnot(check == 0)
check
```

`[1] 0`

Now we simply throw away any rows with a non-negative population difference. Here is one combination of parameters exhibiting Simpson’s reversal.

```
simpsons_reversal <- part_c %>%
filter(diff_pop < -0.05) %>%
head(1) %>%
gather(term, value)
```

term | value |
---|---|

id | 109.0000000 |

r | 0.2837123 |

q0 | 0.0664811 |

q1 | 0.8468126 |

p00 | 0.8441892 |

p10 | 0.8827558 |

p01 | 0.5273831 |

p11 | 0.5816885 |

diff_pop | -0.1933634 |

diff_z0 | 0.0385666 |

diff_z1 | 0.0543054 |

As a final check, let’s generate a dataset for this set of parameters.

```
df <- simpsons_reversal %>%
spread(term, value) %>%
crossing(unit = 1:N) %>%
mutate(
z = rbinom(N, 1, r),
x = rbinom(N, 1, if_else(z == 0, q0, q1)),
p_y_given_x_z = case_when(
x == 0 & z == 0 ~ p00,
x == 0 & z == 1 ~ p01,
x == 1 & z == 0 ~ p10,
x == 1 & z == 1 ~ p11
),
y = rbinom(N, 1, p_y_given_x_z)
) %>%
select(unit, x, y, z)
```

The empirical joint probability distribution is as follows.

```
p_x_y_z <- df %>%
group_by(x, y, z) %>%
count() %>%
ungroup() %>%
mutate(p = n / sum(n))
```

x | y | z | n | p |
---|---|---|---|---|

0 | 0 | 0 | 1068 | 0.1068 |

0 | 0 | 1 | 197 | 0.0197 |

0 | 1 | 0 | 5609 | 0.5609 |

0 | 1 | 1 | 224 | 0.0224 |

1 | 0 | 0 | 52 | 0.0052 |

1 | 0 | 1 | 1016 | 0.1016 |

1 | 1 | 0 | 400 | 0.0400 |

1 | 1 | 1 | 1434 | 0.1434 |

The population-level probability difference is given by:

```
diff_pop <- p_x_y_z %>%
group_by(x) %>%
summarise(p = sum(n * y) / sum(n)) %>%
spread(x, p) %>%
mutate(diff = `1` - `0`)
```

0 | 1 | diff |
---|---|---|

0.8217808 | 0.6319779 | -0.1898028 |

which is close to the theoretical value.

Similarly, the sub-population differences are

```
diff_z <- p_x_y_z %>%
group_by(x, z) %>%
summarise(p = sum(n * y) / sum(n)) %>%
spread(x, p) %>%
mutate(diff = `1` - `0`)
```

z | 0 | 1 | diff |
---|---|---|---|

0 | 0.8400479 | 0.8849558 | 0.0449078 |

1 | 0.5320665 | 0.5853061 | 0.0532396 |

which are also close to the theoretical values we calculated. More importantly, they have a different sign to the population difference, confiming that we have case of Simpson’s reversal.

]]>