Here’s my solution to the medium exercises in chapter 3 of McElreath’s Statistical Rethinking, 2nd edition.

\(\DeclareMathOperator{\dbinomial}{Binomial} \DeclareMathOperator{\dbernoulli}{Bernoulli} \DeclareMathOperator{\dpoisson}{Poisson} \DeclareMathOperator{\dnormal}{Normal} \DeclareMathOperator{\dt}{t} \DeclareMathOperator{\dcauchy}{Cauchy} \DeclareMathOperator{\dexponential}{Exp} \DeclareMathOperator{\duniform}{Uniform} \DeclareMathOperator{\dgamma}{Gamma} \DeclareMathOperator{\dinvpamma}{Invpamma} \DeclareMathOperator{\invlogit}{InvLogit} \DeclareMathOperator{\logit}{Logit} \DeclareMathOperator{\ddirichlet}{Dirichlet} \DeclareMathOperator{\dbeta}{Beta}\)

Assuming Earth has 70% water cover, and we observe water 8 times out of 15 globe tosses, let’s calculate some posterior quantities with two choices of prior: uniform and step.

```
p_true <- 0.7
W <- 8
N <- 15
granularity <- 1000 # points on the grid
```

We calculate the grid approximation of the posterior as shown in the book.

```
m1_grid <- tibble(p = seq(0, 1, length.out = granularity)) %>%
mutate(prior = 1)
m1_posterior <- m1_grid %>%
mutate(
likelihood = dbinom(W, N, p),
posterior = prior * likelihood
)
```

We can get draws from our posterior by sampling the water cover values many times with replacement, each value being drawn in proportion to the posterior probability. We can then just summarise these draws to get the desired interval.

```
m2_samples <- m1_posterior %>%
sample_n(10000, replace = T, weight = posterior)
m2_hpdi <- HPDI(m2_samples$p, prob = 0.9)
m2_hpdi
```

```
|0.9 0.9|
0.3223223 0.7097097
```

The histogram looks as follows. This is much the same as the previous graph, but calculated from the samples.

To get the posterior predictive sample, we take our posterior draws of \(p\), then use them to draw a random number of observed water tosses out of 15. The fraction of posterior predictive samples with a given value is then the posterior predictive probability of that value.

```
m3_prob <- m2_samples %>%
mutate(W = rbinom(n(), 15, p)) %>%
group_by(W) %>%
tally() %>%
mutate(probability = n / sum(n))
```

We can also calculate the posterior predictive probabilities with a different number of tosses. Here with 9 tosses.

```
m4_prob <- m2_samples %>%
mutate(W = rbinom(n(), 9, p)) %>%
group_by(W) %>%
tally() %>%
mutate(probability = n / sum(n))
```

Now we repeat the same steps but with the step prior instead of the uniform prior. We’ll just repeat it without comment.

```
m5_grid <- m1_grid %>%
mutate(prior = if_else(p < 0.5, 0, 1))
m5_posterior <- m5_grid %>%
mutate(
likelihood = dbinom(W, N, p),
posterior = prior * likelihood
)
```

```
m5_samples <- m5_posterior %>%
sample_n(10000, replace = T, weight = posterior)
m5_hpdi <- HPDI(m5_samples$p, prob = 0.9)
m5_hpdi
```

```
|0.9 0.9|
0.5005005 0.7107107
```

```
m5_prob <- m5_samples %>%
mutate(W = rbinom(n(), 15, p)) %>%
group_by(W) %>%
tally() %>%
mutate(probability = n / sum(n))
```

```
m5_prob <- m5_samples %>%
mutate(W = rbinom(n(), 9, p)) %>%
group_by(W) %>%
tally() %>%
mutate(probability = n / sum(n))
```

Let’s compare the proportion of samples within 0.05 of the true value for each prior.

```
p_close_uniform <- m2_samples %>%
group_by(close = p %>% between(p_true - 0.05, p_true + 0.05)) %>%
tally() %>%
mutate(probability = n / sum(n)) %>%
filter(close) %>%
pull(probability)
p_close_step <- m5_samples %>%
group_by(close = p %>% between(p_true - 0.05, p_true + 0.05)) %>%
tally() %>%
mutate(probability = n / sum(n)) %>%
filter(close) %>%
pull(probability)
```

The probability of being close to the true value under the uniform and step priors is 0.1316 and 0.2157, respectively. The step prior thus has more mass around the true value.

Bayesian models are generative, meaning we can simulate new datasets according to our prior probabilities. We’ll simulate 10 datasets for each value of N of interest. We simulate a dataset by randomly choosing a `p_true`

from our uniform prior, then randomly choosing a `W`

from the corresponding binomial distribution.

```
m6_prior_predictive <- crossing(
N = 200 * (1:16),
iter = 1:10
) %>%
mutate(
p_true = runif(n(), min=0, max=1),
W = rbinom(n(), N, p_true)
)
```

For each of these simulated datasets, we grid approximate the posterior, take posterior samples, then calculate the HPDI.

```
m6_grid <- tibble(p = seq(0, 1, length.out = granularity)) %>%
mutate(prior = 1)
m6_posteriors <- m6_prior_predictive %>%
crossing(m6_grid) %>%
group_by(N, p_true, iter) %>%
mutate(
likelihood = dbinom(W, N, p),
posterior = prior * likelihood
)
m6_samples <- m6_posteriors %>%
sample_n(1000, replace = TRUE, weight = posterior)
m6_hpdi <- m6_samples %>%
summarise(lo = HPDI(p, 0.99)[1], hi = HPDI(p, 0.99)[2]) %>%
mutate(width = abs(hi - lo))
```

Now for each value of N, we check how many of the intervals have the desired width.

```
m6_n <- m6_hpdi %>%
group_by(N) %>%
summarise(fraction = mean(width < 0.05))
```

Thus we expect a sample size around 2600-3000 to give us a sufficiently precise posterior estimation.

]]>Here’s my solutions to the hard exercises in chapter 3 of McElreath’s Statistical Rethinking, 2nd edition.

Let’s first put the data into a tibble for easier manipulation later.

```
data(homeworkch3)
df <- tibble(birth1 = birth1, birth2 = birth2) %>%
mutate(birth = row_number())
```

birth1 | birth2 | birth |
---|---|---|

1 | 0 | 1 |

0 | 1 | 2 |

0 | 0 | 3 |

0 | 1 | 4 |

1 | 0 | 5 |

1 | 1 | 6 |

Let’s check we have the correct total cound and the correct number of boys.

```
h1_counts <- df %>%
gather(order, gender, -birth) %>%
summarise(boys = sum(gender), births = n())
```

Now we can grid approximate the posterior as before.

```
granularity <- 1000
h1_grid <- tibble(p = seq(0, 1, length.out = granularity)) %>%
mutate(prior = 1)
h1_posterior <- h1_grid %>%
mutate(
likelihood = dbinom(h1_counts$boys, h1_counts$births, p),
posterior = prior * likelihood,
posterior = posterior / sum(posterior)
)
```

The maximum a posteriori (MAP) value is the value of p that maximises the posterior.

```
h1_map <- h1_posterior %>%
slice(which.max(posterior)) %>%
pull(p)
h1_map
```

`[1] 0.5545546`

We draw samples with weight equalt to the posterior. We then apply the `HPDI`

function to these samples, each time with a different width.

```
h2_samples <- h1_posterior %>%
sample_n(10000, replace = TRUE, weight = posterior) %>%
pull(p)
h2_hpdi <- h2_samples %>%
crossing(prob = c(0.5, 0.89, 0.97)) %>%
group_by(prob) %>%
group_map(HPDI)
h2_hpdi
```

```
[[1]]
|0.5 0.5|
0.4574575 0.5735736
[[2]]
|0.89 0.89|
0.4534535 0.6606607
[[3]]
|0.97 0.97|
0.4294294 0.6616617
```

The posterior predictive samples are possible observations according to our posterior.

`h3_posterior_predictive <- rbinom(10000, 200, h2_samples)`

The number of observed births is very close to the MAP of the posterior predictive distribution, suggesting we have a decent fit.

Our data are from birth pairs and so far we didn’t make any distinction between the first and second births. To test this assumption, we can perform a posterior predictive check as in 3H3, but this time for first births.

`h4_posterior_predictive <- rbinom(10000, 100, h2_samples)`

The fit doesn’t look quite as good for first births as it did for all births together. It also doesn’t look bad since there is still a fair bit of probability mass around the observed number of first birth boys.

As the final posterior predictive check, let’s check the number of boys born after a girl.

```
h5_counts <- df %>%
filter(birth1 == 0) %>%
summarise(boys = sum(birth2), births = n())
h5_posterior_predictive <- rbinom(10000, h5_counts$births, h2_samples)
```

The fit here looks bad, since the observed number of boys is higher than the bulk of the model’s expectations.

]]>Here’s my solution to the hard exercises in chapter 2 of McElreath’s Statistical Rethinking, 1st edition. When writing this up, I came across a very relevant article. We’ll solve these problems in two ways: using the counting method and using Bayes rule.

Let’s generate a dataset with all the features necessary to solve all the questions: twins at first birth, twins at second birth, and testing positive for species A.

```
N <- 100000
dfa <- tibble(
species = 'A',
t1 = rbinom(N, 1, 0.1),
t2 = rbinom(N, 1, 0.1),
pa = rbinom(N, 1, 0.8)
)
dfb <- tibble(
species = 'B',
t1 = rbinom(N, 1, 0.2),
t2 = rbinom(N, 1, 0.2),
pa = rbinom(N, 1, 1 - 0.65)
)
df <- dfa %>% bind_rows(dfb)
```

All of the problems can now be solved by simply filtering out any events not consisent with our observations, then summarising the remaining events.

```
h1 <- df %>%
filter(t1 == 1) %>%
summarise(mean(t2 == 1)) %>%
pull()
h2 <- df %>%
filter(t1 == 1) %>%
summarise(mean(species == 'A')) %>%
pull()
h3 <- df %>%
filter(t1 == 1, t2 == 0) %>%
summarise(mean(species == 'A')) %>%
pull()
h4a <- df %>%
filter(pa == 1) %>%
summarise(mean(species == 'A')) %>%
pull()
h4b <- df %>%
filter(pa == 1, t1 == 1, t2 == 0) %>%
summarise(mean(species == 'A')) %>%
pull()
```

exercise | bayes | counting |
---|---|---|

h1 | 0.1666667 | 0.1669936 |

h2 | 0.3333333 | 0.3360484 |

h3 | 0.3529412 | 0.3635856 |

h4a | 0.6956522 | 0.6963991 |

h4b | 0.5443787 | 0.5656231 |

For H1 we expect the probability to be between 0.1 and 0.2, since those are the two possible birth rates. Also, since we observed a twin birth already, it makes sense that it is closer to 0.2 since species B is more likely to birth twins. In other words, in H2 we expect the species to be less likely to be species A. Birthing a singleton infant is fairly common, so we wouldn’t expect this observation to change our inference very much in H3.

Let’s also work out the solutions analytically using Bayes rule. Let’s start with H2 since it’s useful for calculating H1.

\[ \begin{align} \mathbb P(A \mid T_1) &= \frac{\mathbb P(T_1 \mid A) \mathbb P(A)}{\mathbb P(T_1)} \\ &= \frac{\mathbb P(T_1 \mid A) \mathbb P(A)}{\mathbb P(T_1 \mid A) \mathbb P(A) + \mathbb P(T_1 \mid B) \mathbb P(B)} \\ &= \frac{0.1 \cdot 0.5}{0.1 \cdot 0.5 + 0.2 \cdot 0.5} \\ &= \frac{0.05}{0.05 + 0.1} \\ &= \frac{1}{3} \end{align} \]

Now we can use our solution to H2 and plug it into the appropriate place in the formula for H1. Note that \(\mathbb P(T_2 \mid A)\) is the same as \(\mathbb P(T_1 \mid A)\) by the assumptions of the problem. Similarily, once we know the species, whether the first birth was twins is irrelevant to the probability of twins in the second birth, i.e. \(\mathbb P(T_2 \mid T_1, A) = \mathbb P(T_2 \mid A)\).

\[ \begin{align} \mathbb P(T_2 \mid T_1) &= \mathbb P(T_2 \mid T_1, A) \mathbb P(A \mid T_1) + \mathbb P(T_2 \mid T_1, B) \mathbb P(B \mid T_1) \\ &= \mathbb P(T_2 \mid A) \mathbb P(A \mid T_1) + \mathbb P(T_2 \mid B) \mathbb P(B \mid T_1) \\ &= \frac{1}{10} \cdot \frac{1}{3} + \frac{2}{10} \cdot \frac{2}{3} \\ &= \frac{5}{30} \\ &= \frac{1}{6} \end{align} \]

For H3, let’s use the notation \(-T_i\) to mean singleton infants (i.e. not twins).

\[ \begin{align} \mathbb P(A \mid T_1, - T_2) &= \frac{\mathbb P(- T_2 \mid T_1, A) \mathbb P(A \mid T_1)}{\mathbb P(- T_2 \mid T_1)} \\ &= \frac{\mathbb P(- T_2 \mid A) \mathbb P(A \mid T_1)}{\mathbb P(- T_2 \mid T_1)} \\ &= \frac{(1 - 0.1) \cdot \frac{1}{3}}{1 - 0.15} \\ &=\frac{0.3}{0.85} \\ &= \frac{6}{17} \end{align} \]

This is about 0.353.

Now for H4a.

\[ \begin{align} \mathbb P(A \mid P_A) &= \frac{\mathbb P(P_A \mid A) \mathbb P(A)}{\mathbb P(P_A)} \\ &= \frac{\mathbb P(P_A \mid A) \mathbb P(A)}{\mathbb P(P_A \mid A) \mathbb P(A) + \mathbb P(P_A \mid B) \mathbb P(B)} \\ &= \frac{0.8 \cdot 0.5 }{0.8 \cdot 0.5 + 0.35 \cdot 0.5} \\ &= \frac{0.4 }{0.4 + 0.175} \\ &= \frac{0.4 }{0.575} \end{align} \]

This is about 0.696.

Finally H4b.

\[ \begin{align} \mathbb P(A \mid P_A, T_1, -T_2) &= \frac{\mathbb P(P_A \mid A, T_1, -T_2) \mathbb P(A \mid T_1, -T_2)}{\mathbb P(P_A \mid T_1, -T_2)} \\ &= \frac{\mathbb P(P_A \mid A) \mathbb P(A \mid T_1, -T_2)}{\mathbb P(P_A \mid A) \mathbb P(A \mid T_1, -T_2) + \mathbb P(P_A \mid B) \mathbb P(B \mid T_1, -T_2)} \\ &= \frac{\frac{4}{5} \cdot \frac{6}{17} }{\frac{4}{5}\cdot \frac{6}{17} + \frac{7}{20} \cdot \frac{11}{17}} \\ &= \frac{\frac{24}{85} }{\frac{24}{85} + \frac{77}{340}} \\ &= \frac{\frac{24}{85} }{\frac{92 + 77}{340}} \\ &= \frac{24}{85} \cdot \frac{340}{169} \\ &= \frac{92}{169} \end{align} \]

This is about 0.544.

]]>Here’s my solutions to the medium exercises in chapter 2 of McElreath’s Statistical Rethinking, 1st edition. My intention is to move over to the 1nd edition when it comes out next month.

Start by creating a grid and the function `posterior`

which we we use for several calculations. This is analogous to the code provided in the chapter.

```
p_true <- 0.7 # assumed ground truth
granularity <- 1000 # number of points on grid
grid1 <- tibble(p = seq(0, 1, length.out = granularity)) %>%
mutate(prior = 1)
posterior <- function(data, grid) {
grid %>%
mutate(
likelihood = dbinom(sum(data == 'W'), length(data), p),
unstd_posterior = prior * likelihood,
posterior = unstd_posterior / sum(unstd_posterior)
)
}
```

The exercise asks us to approximate the posterior for each of the following three datasets. To do this, we just apply our `posterior`

function above to each of them.

```
data <- list(
'1' = c('W', 'W', 'L'),
'2' = c('W', 'W', 'W', 'L'),
'3' = c('L', 'W', 'W', 'L', 'W', 'W', 'W')
)
m1 <- data %>%
map_dfr(posterior, grid1, .id = 'dataset')
```

The posterior becomes gradually more concentrated around the ground truth.

For the second question, we simply do the same but with a different prior. More specifically, for any p below 0.5 we set the prior to zero, then map our posterior over each the the datasets with this new grid.

```
grid2 <- grid1 %>%
mutate(prior = if_else(p < 0.5, 0, prior))
m2 <- data %>%
map_dfr(posterior, grid2, .id = 'dataset')
```

Again we see the posterior concentrate more around the ground truth. Moreover, the distribution is more peaked (at ~ 0.003) than with the uniform prior, which peaks at around (~0.0025). The first dataset already gets pretty close to this peak, i.e. this more informative prior gets us better inferences sooner.

For the final question on globe tossing, we can just use the counting method rather than grid approximation. We enumerate all possible events in proportion to how likely they are to occur: 10 L for Mars, 3 L and 7 W for Earth. Then we filter our any inconsistent with our observation of land, and summarise the remaining possibilities.

```
m3 <- tibble(mars = rep('L', 10)) %>%
mutate(earth = if_else(row_number() <= 3, 'L', 'W')) %>%
gather(planet, observation) %>% # all possible events
filter(observation == 'L') %>% # only those events consistent with observation
summarise(mean(planet == 'earth')) %>% # fraction of possible events that are earth
pull()
m3
```

`[1] 0.2307692`

We get around 23%.

We make a list of all sides, filter out any inconsistent with our observation of a black side, then summarise the remaining card possibilities.

```
m4_events <- tibble(card = c("BB", "BW", "WW")) %>% # all the cards
separate(card, into = c('side1', 'side2'), sep = 1, remove = F) %>%
gather(side, colour, -card) # all the sides
m4_possibilities <- m4_events %>%
filter(colour == 'B') # just the possible events where there is a black side
m4 <- m4_possibilities %>%
summarise(mean(card == 'BB')) %>%
pull() # which fraction of possible events is a double black?
m4
```

`[1] 0.6666667`

The next exercise is the same as the previous but with more cards. Note that this equivalent to using the three cards as before but with a larger prior probability on the BB card.

```
m5_events <- tibble(card = c("BB", "BW", "WW", "BB")) %>%
separate(card, into = c('side1', 'side2'), sep = 1, remove = F) %>%
gather(side, colour, -card)
m5_possibilities <- m5_events %>%
filter(colour == 'B')
m5 <- m5_possibilities %>%
summarise(mean(card == 'BB')) %>%
pull()
m5
```

`[1] 0.8`

Putting the prior on the cards is equivalent to having the cards in proportion to their prior. The rest of the calculation is the same.

```
m6_events <- c("BB", "BW", "WW") %>% # cards
rep(c(1, 2, 3)) %>% # prior: repeat each card the given number of times
tibble(card = .) %>%
separate(card, into = c('side1', 'side2'), sep = 1, remove = F) %>%
gather(side, colour, -card)
m6_possibilities <- m6_events %>% # sides
filter(colour == 'B')
m6 <- m6_possibilities %>% # sides consistent with observation
summarise(mean(card == 'BB')) %>% # proportion of possible events that are BB
pull()
m6
```

`[1] 0.5`

This last card drawing exercise is slightly more involved since we can observe any of the two sides of the one card and any of the two sides of the other. Thus, we first generate the list of all possible pairs of cards, expand this into a list of all possible sides that could be observed for each card, filter out any event not consisent with our observations, then summarise whatever is left.

```
m7_card_pairs <- tibble(card = c("BB", "BW", "WW")) %>% # all the cards
crossing(., other_card = .$card) %>%
filter(card != other_card) # all card pairs (can't draw the same card twice)
m7_events <- m7_card_pairs %>%
separate(card, into = c('side1', 'side2'), sep = 1, remove = F) %>%
separate(other_card, into = c('other_side1', 'other_side2'), sep = 1, remove = F) %>%
gather(side, colour, side1, side2) %>% # all the sides for card of interest
gather(other_side, other_colour, other_side1, other_side2) # all sides of other card
m7_possibilities <- m7_events %>%
filter(
colour == 'B', # we observe that card of interest has a black side
other_colour == 'W' # we observe that the other card has a white side
)
m7 <- m7_possibilities %>%
summarise(mean(card == 'BB')) %>% # which fraction of possible events is a double black?
pull()
m7
```

`[1] 0.75`

]]>Fitting a full Bayesian model can be slow, especially with a large dataset. For example, it’d be great to analyse the climate crisis questions in the European Social Survey (ESS), which typically has around 45,000 respondents from around Europe on a range of socio-political questions. There are two main ways of parallelising your Bayesian model in Stan: between-chain parallelisation and within-chain parallelisation. The first of these is very easy to implement (`chains = 4`

, `cores = 4`

) - it simply runs the algorithm once on each core and pools the posterior samples at the end. The second method is more complicated as it requires a non-trivial modification to the Stan model, but can bring with it large speedups if you have the cores available. In this post we’ll get a >5x speedup of ordinal regression using within-chain parallelisation.

I’ll assume you are somewhat familiar with McElreath’s introduction with cmdstan, with Ignacio’s introduction with rstan, and/or with the Stan user guide. We’ll implement a mapped version of ordinal regression with one (factor) covariate using similar ideas. The main difference is that we’ll have a shard set up for each distinct level of the factor, and each shard will receive a different number of datapoints. This is my first attempt at making sense of this, so use at your own risk.

**Important note**: there is a bug in the `ordered_logistic_lpmf`

function in stan 2.19.2, the version I currently have installed. Until the fixed version in stan 2.20, I went for the easy fix.

Suppose you have a large dataset and/or a log-likelihood function that is expensive to evaluate. Then you can break down your dataset into chunks (called `shards`

), calculate the log-likelihood on each shard in parallel, then sum up the log-likelihood of each shard at the end.

There seem to be two types of within-chain parallelisation: `threading`

and `Message Passing Interface (MPI)`

. MPI requires some extra setup and is typicaly used if you want to implement within-chain parallelisation across multiple computers. We’ll stick with the simpler threading method.

A `thread`

is (confusingly) sometimes called a `core`

. The number of `threads`

you have will determine how many `shards`

you can calculate at the same time. You can see how many threads you have available with `nproc --all`

.

`nproc --all`

`4`

So I can run 4 threads at the same time. For a more detailed breakdown use `lscpu`

, where the number of threads is given by `CPU(s)`

and is equal to `Thread(s) per core`

* `Core(s) per socket`

* `Socket(s)`

. For me this is 4 = 1 * 4 * 1.

`lscpu`

```
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 158
Model name: Intel(R) Core(TM) i5-7600K CPU @ 3.80GHz
Stepping: 9
CPU MHz: 3993.031
CPU max MHz: 4200,0000
CPU min MHz: 800,0000
BogoMIPS: 7584.00
Virtualisation: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 6144K
NUMA node0 CPU(s): 0-3
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
```

Before compiling a model with threading, we have to tell Stan to compile with threading. For me, this worked by adding `-DSTAN_THREADS -pthread`

to my Makevars file. Check out the recommendations in the docs for more information on this.

```
Sys.getenv("HOME") %>%
file.path(".R", "Makevars") %>%
print() %>%
read_file() %>%
writeLines()
```

```
[1] "/home/brian/.R/Makevars"
CXX14FLAGS = -O3 -march=native -mtune=native
CXX14FLAGS += -fPIC
CXX14FLAGS += -DSTAN_THREADS
CXX14FLAGS += -pthread
```

Before fitting a model with threading, we’ll have to tell Stan how many threads are available via the environment variable `STAN_NUM_THREADS`

. We’ll run it now just to be sure.

`Sys.setenv(STAN_NUM_THREADS = 4)`

Now we’re all setup for threading.

Let’s generate observations from the prior predictive distribution. Skip this section if you’re just interested in the parallelisation. We’ll a similar model as described in Michael Betancourt’s case study. The main difference is that we’ll use contrast factors for our latent effect.

```
m_sim <- "models/ordinal_regression_sim_betancourt.stan" %>%
here() %>%
stan_model()
```

We’ll generate 20,000 observations, where the only covariate is called `factr`

, of which we have around 50 unique values. Notice that we will end up with a different number of observations for each level of `factr`

.

```
set.seed(12096)
N <- 20000 # number of observations
K <- 5 # number of ordinal outcomes
L <- 50 # number of unique levels in our factor
# the covariates
df_sim <- 1:L %>%
sample(size = N, replace = TRUE) %>%
tibble(factr = .) %>%
mutate(factr = factr %>% as_factor() %>% fct_reorder(factr)) %>%
arrange(factr)
# in list-format for stan
data_sim <- list(
N = N,
K = K,
L = L,
factr = model.matrix(~ 1 + factr, df_sim)[, 2:L], # contrast encoding
# hyperparameters
factr_mu = 0,
factr_sd = 1,
alpha = c(2, 4, 8, 4, 2)
)
```

Now we simply draw once from the prior predictive distribution, then extract the parameters and outcome.

```
# draw from the prior predictive distribution
fit_sim <- m_sim %>%
sampling(
algorithm = 'Fixed_param',
data = data_sim,
iter = 1,
chains = 1,
seed = 43484
)
# extract the parameters and observations
cutpoints <- fit_sim %>%
spread_draws(c[i]) %>%
pull(c)
effects <- fit_sim %>%
spread_draws(beta[i]) %>%
pull(beta)
y <- fit_sim %>%
spread_draws(y[i]) %>%
pull(y)
# put covariates and outcome in the one dataset
df <- df_sim %>%
mutate(
y = y,
factr = factr %>% as.integer()
) %>%
arrange(factr)
# as a list for stan
data <- data_sim %>%
list_modify(y = df$y)
```

Let’s check that Betancourt’s model `m`

passes some standard diagnostic tests on our data and time how long it takes to fit.

```
m <- "models/ordinal_regression_betancourt.stan" %>%
here() %>%
stan_model()
```

```
start <- Sys.time()
fit <- m %>%
sampling(
data = data,
chains = 1,
warmup = 500,
iter = 2000,
seed = 14031
)
end <- Sys.time()
duration <- end - start
```

The fitting took 5.7 minutes.

The HMC diagnostics look good.

`rstan::check_hmc_diagnostics(fit)`

```
Divergences:
0 of 1500 iterations ended with a divergence.
Tree depth:
0 of 1500 iterations saturated the maximum tree depth of 10.
Energy:
E-BFMI indicated no pathological behavior.
```

The rhat is smaller than 1.05 and the bulk/tail ESS are over 100, which is good.

variable | rhat_min | ess_bulk_min | ess_tail_min | rhat_max | ess_bulk_max | ess_tail_max |
---|---|---|---|---|---|---|

beta | 1.00 | 165.98 | 306.02 | 1.01 | 254.76 | 654.57 |

c | 1.01 | 102.99 | 179.38 | 1.01 | 107.56 | 214.77 |

The cutpoints have been estimated slightly too low, but within reasonable bounds.

Around 93.9% of the 90% intervals for β contained the true values. This is not bad considering that one error carries the weight of over 2 percentage points.

Now for the mapped version.

```
m_mapped <- "models/ordinal_regression_mapped.stan" %>%
here() %>%
stan_model()
```

We’ll set up a shard for every level of our factor. The function `lp`

for calculating the log-posterior on one shard looks like this. The first entry in our integer array `xi`

is the number of observations `M`

for this level/shard. The data we need is then contained in the next `M`

entries of `xi`

. The cutpoints are the only global parameters we’ll use here. The estimated effect for this level is given by `beta`

, the only local parameter for this shard. The log-likelihood `ll`

is then calculated as usual.

```
functions {
vector lp(vector global, vector local, real[] xr, int[] xi) {
int M = xi[1];
int y[M] = xi[2:M+1];
vector[4] c = global[1:4];
real beta = local[1];
real ll = ordered_logistic_lpmf(y | rep_vector(beta, M), c);
return [ll]';
}
}
```

The shards are set up in the transformed data section. Since we have a shard for every level, we simply index the shards using the levels. This makes it very easy to keep track of which shard gets the next datapoint. The first entry of each shard is reserved for the number of datapoints used in that shard. To keep track of where to put the next datapoint within a shard, we setup the array `j`

. This starts at 2 because position 1 is reserved for the number of datapoints in the shard. Everytime we add a datapoint to a shard, we increment that shard’s entry in `j`

so that the next datapoint lands in the correct place.

```
transformed data {
int<lower = 0, upper = N> counts[L] = count(factr, L);
int<lower = 1> M = max(counts) + 1;
int xi[L, max(counts) + 1];
real xr[L, max(counts) + 1];
int<lower = 1> j[L] = rep_array(2, L);
xi[, 1] = counts;
for (i in 1:N) {
int shard = factr[i];
xi[shard, j[shard]] = y[i];
j[shard] += 1;
}
}
```

I really like this way of creating shards because it doesn’t become such a mess of indices.

Now let’s time it.

```
start_mapped <- Sys.time()
fit_mapped <- m_mapped %>%
sampling(
data = data %>% list_modify(factr = df$factr),
chains = 1,
warmup = 500,
iter = 2000,
seed = 98176
)
end_mapped <- Sys.time()
duration_mapped <- end_mapped - start_mapped
```

The fitting took 64.8523884 seconds. This is a 5.3-fold speedup!

The HMC diagnostics look good.

`rstan::check_hmc_diagnostics(fit_mapped)`

```
Divergences:
0 of 1500 iterations ended with a divergence.
Tree depth:
0 of 1500 iterations saturated the maximum tree depth of 10.
Energy:
E-BFMI indicated no pathological behavior.
```

The rhat and ESS values are still good.

variable | rhat_min | ess_bulk_min | ess_tail_min | rhat_max | ess_bulk_max | ess_tail_max |
---|---|---|---|---|---|---|

beta | 0.999 | 439.148 | 587.322 | 1.005 | 634.109 | 979.157 |

c | 1.001 | 286.037 | 422.349 | 1.003 | 293.793 | 481.067 |

The posteriors of the cutpoints are much the same as before.

The level-effects are as well-calibrated as before, with around 93.9% of the 90% intervals for β containing the true values.

We can measure the similarity of the estimates in two ways:

- the absolute difference in the point estimates; and
- the ratio between the length of the overlap of the two 90% intervals and the length of the shortest of the two 90% intervals.

In each case the estimates look roughly the same, especially with respect to the ratio metric. The differences are a bit larger than I would have expected, but I’m not so sure on what scale a ‘good’ difference would be.

metric | value |
---|---|

change_max | 0.0153 |

change_mean | 0.0102 |

change_median | 0.0107 |

change_min | 0.0032 |

ratio_max | 1.0000 |

ratio_mean | 0.9897 |

ratio_median | 0.9960 |

ratio_min | 0.9619 |

I’m fairly happy with the speedup seen here. Actually, I’m mostly happy I got it working at all. It’s entirely possible that creating 50 shards with only 4 threads to run them on isn’t the most efficient way to use threading, but I’ll keep doing it like this until there’s a more convenient way to do it. Higher up in my priorities right now are:

- adding more covariates, especially factors; and
- putting a hierarchical prior on the factor; e.g. for use in MRP.

The bulk ESS values are a bit on the low side, so there could be a better way to parameterise the model.

]]>In a previous post, we described how a model of customer lifetime value (CLV) works, implemented it in Stan, and fit the model to simulated data. In this post, we’ll extend the model to use hierarchical priors in two different ways: centred and non-centred parameterisations. I’m not aware of any other HMC-based implementations of this hierarchical CLV model, so we’ll run some basic tests to check it’s doing the right thing. More specifically, we’ll fit it to a dataset drawn from the prior predictive distribution. The resulting fits pass the main diagnostic tests and the 90% posterior intervals capture about 91% of the true parameter values.

A word of warning: I came across a number of examples where the model showed severe E-BFMI problems. This seems to happen when the parameters \(\mu\) (inverse expected lifetime) and \(\lambda\) (expected purchase rate) are fairly similar, but I haven’t pinned down a solid reason for these energy problems yet. I suspect this has something to do with the difficulty in distinguishing a short lifetime from a low purchase rate. This is a topic we’ll leave for a future post.

We’ll work with raw stan, which can be a bit fiddly sometimes. To keep this post within the limits of readability, I’ll define some custom functions to simplify the process. Check out the full code to see the details of these functions.

\(\DeclareMathOperator{\dbinomial}{Binomial}\DeclareMathOperator{\dbernoulli}{Bernoulli}\DeclareMathOperator{\dpoisson}{Poisson}\DeclareMathOperator{\dnormal}{Normal}\DeclareMathOperator{\dt}{t}\DeclareMathOperator{\dcauchy}{Cauchy}\DeclareMathOperator{\dexponential}{Exp}\DeclareMathOperator{\duniform}{Uniform}\DeclareMathOperator{\dgamma}{Gamma}\DeclareMathOperator{\dinvgamma}{InvGamma}\DeclareMathOperator{\invlogit}{InvLogit}\DeclareMathOperator{\logit}{Logit}\DeclareMathOperator{\ddirichlet}{Dirichlet}\DeclareMathOperator{\dbeta}{Beta}\)

Let’s recap on the story from last time. We have a 2-year old company that has grown linearly over that time to gain a total of 1000 customers.

```
set.seed(65130) # https://www.random.org/integers/?num=2&min=1&max=100000&col=5&base=10&format=html&rnd=new
customers <- tibble(id = 1:1000) %>%
mutate(
end = 2 * 365,
start = runif(n(), 0, end - 1),
T = end - start
)
```

Within a customer’s lifetime \(\tau\), they will purchase with Poisson-rate \(\lambda\). We can simulate the time \(t\) till last observed purchase and number of purchases \(k\) with `sample_conditional`

.

```
sample_conditional <- function(T, tau, lambda) {
# start with 0 purchases
t <- 0
k <- 0
# simulate time till next purchase
wait <- rexp(1, lambda)
# keep purchasing till end of life/observation time
while(t + wait <= pmin(T, tau)) {
t <- t + wait
k <- k + 1
wait <- rexp(1, lambda)
}
# return tabular data
tibble(
t = t,
k = k
)
}
s <- sample_conditional(300, 200, 1)
```

t | k |
---|---|

198.2929 | 169 |

In the above example, even though the observation time is \(T = 300\), the time \(t\) till last purchase will always be below the lifetime \(\tau = 200\). With a purchase rate of 1 per unit time, we expect around \(k = 200\) purchases.

We’ll use the same likelihood as before, which says that the probability of customer \(i\)’s data given their parameters is

\[ \begin{align} \mathbb P(k, t, T \mid \mu_i, \lambda_i) &= \frac{\lambda_i^k}{\lambda_i + \mu_i} \left( \mu_i e^{-t(\lambda_i + \mu_i)} + \lambda_i e^{-T(\lambda_i + \mu_i)} \right) \\ &\propto p \dpoisson(k \mid t\lambda_i)S(t \mid \mu_i) \\ &\hphantom{\propto} + (1 - p) \dpoisson(k \mid t\lambda_i)\dpoisson(0 \mid (T-t)\lambda_i)S(T \mid \mu_i) , \\ p &:= \frac{\mu_i}{\lambda_i + \mu_i} , \end{align} \]

where \(S\) is the exponential survival function.

To turn this into a Bayesian model, we’ll need priors for the parameters. The last time, we put simple gamma priors on the parameters \(\mu_i\) and \(\lambda_i\). For example, we could choose \(\lambda_i \sim \dgamma(2, 28)\) if we were to use the simple model from last time (similarly for \(\mu_i\)). This time we’re going hierarchical. There are various ways to make this hierarchical. Let’s look at two of them.

One method arises directly from the difficulty of specifying the gamma-prior parameters. It involves just turning those parameters into random variables to be simultaneously estimated along with \(\mu_i\) and \(\lambda_i\). For example, we say \(\lambda_i \sim \dgamma(\alpha, \beta)\), where \(\alpha_i \sim \dgamma(\alpha_\alpha, \beta_\alpha)\) and \(\beta_i \sim \dgamma(\alpha_\beta, \beta_\beta)\) (and similarly for \(\mu_i\)). We eventually want to incorporate covariates, which is difficult with this parameterisation, so let’s move onto a different idea.

Another solution is to use log-normal priors. This means setting

\[ \begin{align} \lambda_i &:= \exp(\alpha_i) \\ \alpha_i &\sim \dnormal(\beta, \sigma) \\ \beta &\sim \dnormal(m, s_m) \\ \sigma &\sim \dnormal_+(0, s) , \end{align} \]

where \(m\), \(s_m\), and \(s\) are constants specified by the user (and similarly for \(\mu_i\)). This implies

- there is an overall mean value \(e^\beta\),
- the customer-level effects \(\alpha_i\) are deviations from the overall mean, and
- the extent of these deviations is controlled by the magnitude of \(\sigma\).

With \(\sigma \approx 0\), there can be very little deviation from the mean, so most customers would be the same. On the other hand, large values of \(\sigma\) allow for customers to be (almost) completely unrelated to each other. This means that \(\sigma\) is helping us to regularise the model.

The above parameterisation is called “centred”, which basically means the prior for \(\alpha_i\) is expressed in terms of other parameters (\(\beta\), \(\sigma\)). This can be rewritten as a “non-centred” parameterisation as

\[ \begin{align} \lambda_i &:= \exp(\beta + \sigma \alpha_i) \\ \alpha_i &\sim \dnormal(0, 1) \\ \beta &\sim \dnormal(m, s_m) \\ \sigma &\sim \dnormal_+(0, s) . \end{align} \]

Notice the priors now contain no references to any other parameters. This is equivalent to the centred parameterisation because \(\beta + \sigma \alpha_i \sim \dnormal(\beta, \sigma)\). The non-centred parameterisation is interesting because it is known to increase the sampling efficiency of HMC-based samplers (such as Stan’s) in some cases.

Here is a centred stan implementation of our log-normal hierarchical model.

```
centred <- here::here('models/rf_centred.stan') %>%
stan_model()
```

Note that we have introduced the `prior_only`

flag. When we specify that we want `prior_only`

, then stan will not consider the likelihood and will instead just draw from the priors. This allows us to make prior-predictive simulations. We’ll generate a dataset using the prior-predictive distribution, then fit our model to that dataset. The least we can expect from a model is that it fits well to data drawn from its prior distribution.

To simulate datasets we’ll use hyperpriors that roughly correspond to the priors from the previous post. In particular, the expected lifetime is around 31 days, and the expected purchase rate around once per fortnight.

```
data_hyperpriors <- list(
log_life_mean_mu = log(31),
log_life_mean_sigma = 0.7,
log_life_scale_sigma = 0.8,
log_lambda_mean_mu = log(1 / 14),
log_lambda_mean_sigma = 0.3,
log_lambda_scale_sigma = 0.5
)
data_prior <- customers %>%
mutate(t = 0, k = 0) %>%
tidybayes::compose_data(data_hyperpriors, prior_only = 1)
data_prior %>% str()
```

```
List of 14
$ id : int [1:1000(1d)] 1 2 3 4 5 6 7 8 9 10 ...
$ end : num [1:1000(1d)] 730 730 730 730 730 730 730 730 730 730 ...
$ start : num [1:1000(1d)] 328 707 408 342 666 ...
$ T : num [1:1000(1d)] 401.6 22.6 322.2 388.5 64.5 ...
$ t : num [1:1000(1d)] 0 0 0 0 0 0 0 0 0 0 ...
$ k : num [1:1000(1d)] 0 0 0 0 0 0 0 0 0 0 ...
$ n : int 1000
$ log_life_mean_mu : num 3.43
$ log_life_mean_sigma : num 0.7
$ log_life_scale_sigma : num 0.8
$ log_lambda_mean_mu : num -2.64
$ log_lambda_mean_sigma : num 0.3
$ log_lambda_scale_sigma: num 0.5
$ prior_only : num 1
```

Let’s simulate 8 possible datasets from our priors. Notice how the centres and spreads of the datasets can vary.

```
centred_prior <- centred %>%
fit( # a wrapper around rstan::sampling to allow caching
file = here::here('models/rf_centred_prior.rds'), # cache
data = data_prior,
pars = c('customer'), # ignore this parameter
include = FALSE,
chains = 8,
cores = 4,
warmup = 1000, # not sure why this needs to be so high
iter = 1001, # one more than warmup because we just want one dataset per chain
seed = 3901 # for reproducibility
)
centred_prior_draws <- centred_prior %>%
get_draws( # rstan::extract but also with energy
pars = c(
'lp__', 'energy__',
'theta',
'log_centres',
'scales'
)
) %>%
name_parameters() # add customer id, and idx = 1 (mu) or 2 (lambda)
```

Here are the exact hyperparameters used.

```
hyper <- centred_prior_draws %>%
filter(str_detect(parameter, "^log_|scales")) %>%
select(chain, parameter, value) %>%
spread(parameter, value)
```

chain | log_centres[1] | log_centres[2] | scales[1] | scales[2] |
---|---|---|---|---|

1 | 3.643546 | -2.057017 | 1.0866138 | 1.0874820 |

2 | 4.586349 | -2.231530 | 1.2064439 | 0.2725037 |

3 | 2.673924 | -2.475513 | 0.8847336 | 0.9429385 |

4 | 3.490525 | -2.557564 | 0.7708666 | 1.0124164 |

5 | 3.422691 | -2.877842 | 1.3232360 | 0.2920695 |

6 | 4.205884 | -3.196397 | 1.9217956 | 0.5307348 |

7 | 3.381881 | -2.995128 | 1.0299251 | 0.7266123 |

8 | 3.046531 | -2.328561 | 1.3993460 | 0.4751674 |

We’ll add the prior predictive parameter draws from chain 1 to our customers dataset.

```
set.seed(33194)
df <- centred_prior_draws %>%
filter(chain == 1) %>%
filter(name == 'theta') %>%
transmute(
id = id %>% as.integer(),
parameter = if_else(idx == '1', 'mu', 'lambda'),
value
) %>%
spread(parameter, value) %>%
mutate(tau = rexp(n(), mu)) %>%
inner_join(customers, by = 'id') %>%
group_by(id) %>%
group_map(~sample_conditional(.$T, .$tau, .$lambda) %>% bind_cols(.x))
data_df <- data_hyperpriors %>%
tidybayes::compose_data(df, prior_only = 0)
```

id | t | k | lambda | mu | tau | end | start | T |
---|---|---|---|---|---|---|---|---|

1 | 44.453109 | 44 | 0.8508277 | 0.0383954 | 44.669086 | 730 | 328.4312 | 401.56876 |

2 | 21.237790 | 5 | 0.2202757 | 0.0091300 | 263.541504 | 730 | 707.3906 | 22.60937 |

3 | 0.000000 | 0 | 0.0285432 | 0.0583523 | 8.405014 | 730 | 407.8144 | 322.18558 |

4 | 61.946272 | 7 | 0.0970620 | 0.0424386 | 71.617849 | 730 | 341.5465 | 388.45349 |

5 | 8.273831 | 3 | 0.1732173 | 0.0608967 | 8.747799 | 730 | 665.5210 | 64.47898 |

6 | 107.182661 | 19 | 0.1224131 | 0.0159025 | 113.792604 | 730 | 388.0271 | 341.97287 |

Now we can fit the model to the prior-predictive data `df`

.

```
centred_fit <- centred %>%
fit( # like rstan::sampling but with file-caching as in brms
file = here::here('models/rf_centred_fit.rds'), # cache
data = data_df,
chains = 4,
cores = 4,
warmup = 2000,
iter = 3000,
control = list(max_treedepth = 12),
seed = 24207,
pars = c('customer'),
include = FALSE
)
centred_fit %>%
check_hmc_diagnostics()
```

```
Divergences:
0 of 4000 iterations ended with a divergence.
Tree depth:
0 of 4000 iterations saturated the maximum tree depth of 12.
Energy:
E-BFMI indicated no pathological behavior.
```

The HMC diagnostics pass. However, in some of the runs not shown here, there were pretty severe problems with the E-BFMI diagnostic (~0.01) and I’ve yet to figure out exactly which kinds of situations cause these energy problems. Let’s check out the pairwise posterior densities of energy with the hyperparameters.

The scale parameter for the expected lifetime (`scales[1]`

) is correlated with energy, which is associated with the energy problems described above. I’m not sure how much of a problem this poses, so let’s check out some more diagnostics.

```
neff <- centred_fit %>%
neff_ratio() %>%
tibble(
ratio = .,
parameter = names(.)
) %>%
filter(ratio < 0.5) %>%
arrange(ratio) %>%
head(20)
```

ratio | parameter |
---|---|

0.0937772 | scales[1] |

0.1195873 | lp__ |

0.2592819 | log_centres[1] |

0.2748369 | scales[2] |

0.3228056 | log_centres[2] |

0.4340080 | theta[574,2] |

0.4381685 | theta[169,1] |

0.4800795 | theta[38,1] |

Both the `lp__`

and `scales[1]`

parameters have low effective sample sizes. The rhat values seem fine though.

```
centred_rhat <- centred_fit %>%
rhat() %>%
tibble(
rhat = .,
parameter = names(.)
) %>%
summarise(
min_rhat = min(rhat, na.rm = TRUE),
max_rhat = max(rhat, na.rm = TRUE)
)
```

min_rhat | max_rhat |
---|---|

0.9990247 | 1.005071 |

Let’s now compare the 90% posterior intervals with the true values. Ideally close to 90% of the 90% posterior intervals capture their true value.

```
centred_cis <- centred_draws %>%
group_by(parameter) %>%
summarise(
lo = quantile(value, 0.05),
point = quantile(value, 0.50),
hi = quantile(value, 0.95)
) %>%
filter(!str_detect(parameter, '__')) # exclude diagostic parameters
```

The table below shows we managed to recover three of the hyperparameters. The `scales[2]`

parameter was estimated slightly too high.

```
calibration_hyper <- hyper %>%
filter(chain == 1) %>%
gather(parameter, value, -chain) %>%
inner_join(centred_cis, by = 'parameter') %>%
mutate(hit = lo <= value & value <= hi)
```

parameter | value | lo | point | hi | hit |
---|---|---|---|---|---|

log_centres[1] | 3.643546 | 3.6280113 | 3.733964 | 3.831287 | TRUE |

log_centres[2] | -2.057017 | -2.1786705 | -2.090098 | -2.008663 | TRUE |

scales[1] | 1.086614 | 0.9988373 | 1.119258 | 1.242780 | TRUE |

scales[2] | 1.087482 | 1.0938685 | 1.160415 | 1.232225 | FALSE |

We get fairly close to 90% of the customer-level parameters.

```
true_values <- df %>%
select(id, mu, lambda) %>%
gather(parameter, value, -id) %>%
mutate(
idx = if_else(parameter == 'mu', 1, 2),
parameter = str_glue("theta[{id},{idx}]")
)
centred_calibration <- centred_cis %>%
inner_join(true_values, by = 'parameter') %>%
ungroup() %>%
summarise(mean(lo <= value & value <= hi)) %>%
pull() %>%
percent()
centred_calibration
```

`[1] "91.1%"`

This is slightly higher than the ideal value of 90%.

Here is a non-centred stan implementation of our log-normal hierarchical model. The important difference is in the expression for \(\theta\) and in the prior for \(\text{customer}\).

```
noncentred <- here::here('models/rf_noncentred.stan') %>%
stan_model()
```

Since the non-centred and centred models are equivalent, we can also consider `df`

as a draw from the non-centred prior predictive distribution.

```
noncentred_fit <- noncentred %>%
fit( # like rstan::sampling but with file-caching as in brms
file = here::here('models/rf_noncentred_fit.rds'), # cache
data = data_df,
chains = 4,
cores = 4,
warmup = 2000,
iter = 3000,
control = list(max_treedepth = 12),
seed = 1259,
pars = c('customer'),
include = FALSE
)
noncentred_fit %>%
check_hmc_diagnostics()
```

```
Divergences:
0 of 4000 iterations ended with a divergence.
Tree depth:
0 of 4000 iterations saturated the maximum tree depth of 12.
Energy:
E-BFMI indicated no pathological behavior.
```

Again, the HMC diagnostics indicate no problems. Let’s check the pairwise densities anyway.

The correlation between `scales[1]`

and `energy__`

is smaller with the non-centred parameterisation. This is reflected in the higher effective sample size for `scales[1]`

below. Unfortunately, the effective sample size for the purchase rate hyperpriors has gone down.

```
neff <- noncentred_fit %>%
neff_ratio() %>%
tibble(
ratio = .,
parameter = names(.)
) %>%
filter(ratio < 0.5) %>%
arrange(ratio) %>%
head(20)
```

ratio | parameter |
---|---|

0.0631821 | log_centres[2] |

0.0681729 | scales[2] |

0.1579867 | lp__ |

0.1776186 | scales[1] |

0.2217369 | log_centres[1] |

0.3910369 | theta[830,1] |

0.4767405 | theta[639,1] |

0.4792693 | theta[250,2] |

0.4847856 | theta[41,1] |

0.4978518 | theta[231,1] |

Again, the rhat values seem fine.

```
noncentred_rhat <- noncentred_fit %>%
rhat() %>%
tibble(
rhat = .,
parameter = names(.)
) %>%
summarise(
min_rhat = min(rhat, na.rm = TRUE),
max_rhat = max(rhat, na.rm = TRUE)
)
```

min_rhat | max_rhat |
---|---|

0.9990501 | 1.013372 |

Let’s check how many of the 90% posterior intervals contain the true value.

```
noncentred_cis <- noncentred_draws %>%
group_by(parameter) %>%
summarise(
lo = quantile(value, 0.05),
point = quantile(value, 0.50),
hi = quantile(value, 0.95)
) %>%
filter(!str_detect(parameter, '__'))
```

The hyperparameter estimates are much the same as with the centred parameterisation.

```
noncentred_calibration_hyper <- hyper %>%
filter(chain == 1) %>%
gather(parameter, value, -chain) %>%
inner_join(noncentred_cis, by = 'parameter') %>%
mutate(hit = lo <= value & value <= hi)
```

parameter | value | lo | point | hi | hit |
---|---|---|---|---|---|

log_centres[1] | 3.643546 | 3.6325159 | 3.735363 | 3.838160 | TRUE |

log_centres[2] | -2.057017 | -2.1770703 | -2.092537 | -2.014462 | TRUE |

scales[1] | 1.086614 | 0.9970698 | 1.109297 | 1.235166 | TRUE |

scales[2] | 1.087482 | 1.0963503 | 1.159626 | 1.234190 | FALSE |

About 91% of customer-level posterior intervals contain the true value.

```
noncentred_calibration <- noncentred_cis %>%
inner_join(true_values, by = 'parameter') %>%
summarise(mean(lo <= value & value <= hi)) %>%
pull() %>%
percent()
noncentred_calibration
```

`[1] "91.0%"`

Both centred and non-centred models performed reasonably well on the dataset considered. The non-centred model showed slightly less correlation between `scales`

and `energy__`

, suggesting it might be the better one to tackle the low E-BFMI problems. Since we only checked the fit on one prior-predictive draw, it would be a good idea to check out the fit to more draws. Some casual attempts of mine (not shown here) suggest there are situations that cause severe E-BFMI problems. Identifying these situations would be an interesting next step. It would also be great to see how it performs on some of the benchmarked datasets mentioned in the BTYDPlus package.

Here’s my solution to exercise 9, chapter 1, of Gelman’s *Bayesian Data Analysis* (BDA), 3rd edition. There are solutions to some of the exercises on the book’s webpage.

Suppose there 3 doctors, who open their practice at 09:00 and stop accepting patients at 16:00. If customers arrive in exponentially distributed intervals with mean 10 minutes, and appointment duration is uniformly distributed between 5 and 10 minutes, we want to know:

- how many patients arrive per day?
- how many patients have to wait for their appointment?
- how long do patients have to wait?
- when does the last patient leave the practice?

We do this by simulation. The `arrivals`

function will simulate the arrival times of the patients, in minutes after 09:00. In principle, we should simulate draws from the exponential distribution until the sum of all draws is above 420, the number of minutes the practice accepts patients. However, I couldn’t find any efficient way to run this in R. Instead we’ll draw so many variables such that is is highly unlikely that we have too few, then just filter out what we don’t need.

To calculate a suitably large number, note that the number of patients in one day is \(\dpois(\frac{1}{10} \cdot (16 - 9) \cdot 60)\)-distributed. The 99.99999% percentile of this distribution is `qpois(0.9999999, (16 - 9) * 6) =`

80. We’ll err on the safe side and use \(n=100\).

```
arrivals <- function(λ, t, n=100) {
rexp(n, λ) %>%
tibble(
delay = .,
time = cumsum(delay)
) %>%
filter(time <= t) %>%
pull(time) %>%
return()
}
λ <- 1 / 10
t <- (16 - 9) * 60
arrivals(λ, t)
```

```
[1] 4.854265 4.963889 6.651014 9.518990 17.415709 20.110178
[7] 28.852188 34.862538 35.970215 44.205468 48.342152 52.934693
[13] 83.072579 86.746318 117.586517 122.811176 133.662687 142.016603
[19] 170.935913 190.999325 202.511439 204.915770 205.191951 208.422873
[25] 219.437526 225.162971 233.122550 235.351649 253.558658 254.097711
[31] 255.639118 256.270049 277.905899 291.055504 291.737173 294.688419
[37] 300.949679 302.681417 329.751112 335.998940 355.506712 361.162687
[43] 381.436543 388.767558 393.072689 393.088807 395.570811 401.669343
[49] 401.911643 408.650710 418.291196
```

Given the patients that arrive in a day, we now need a function to simulate the appointments. Let’s assume the patients get seen in the order they arrive. As we cycle through the patients, we’ll keep track of

`n_waited`

, the number of patients who have had to wait for their appointment so far;`time_waiting`

, the sum of all waiting times of the patients so far; and`doctors`

, the next time at which each doctor is free to see another patient.

The `doctors`

variable starts at `c(0, 0, 0)`

because they are immediately availble to see patients. The doctor with the smallest availability time is the next doctor to see a patient. The start of the appointment is either the doctor’s availability time or the arrival time of the patient, whichever is greater. The end of the appointment is \(\duniform(5, 20)\)-minutes after the start of the appointment. The doctor’s availability time is then set to the end of the appointment. Once all patients have been given an appointment, the closing time is the maximum of the doctors’ next availability times or the closing time `(16 - 9) * 60`

, whichever is greater.

```
process <- function(arrivals, t=0) {
n_waited <- 0 # number of patients who have had to wait so far
time_waiting <- 0 # total waiting time so far
doctors <- c(0, 0, 0) # next time at which each doctor is free to see another patient
for(i in (1:length(arrivals))) {
wait <- pmax(min(doctors) - arrivals[i], 0) # waiting time of patient i
time_waiting <- time_waiting + wait
n_waited <- n_waited + (wait > 0)
appointment_start <- max(c(min(doctors), arrivals[i]))
appointment_end <- appointment_start + runif(1, 5, 20)
doctors[which.min(doctors)] <- appointment_end
}
list(
n_patients = length(arrivals),
n_waited = n_waited,
time_waiting = time_waiting,
time_waiting_per_patient = time_waiting / length(arrivals),
time_waiting_per_waiting_patient = time_waiting / n_waited,
closing_time = pmax(max(doctors), t)
) %>% return()
}
arrivals(λ, t) %>%
process(t)
```

```
$n_patients
[1] 39
$n_waited
[1] 0
$time_waiting
[1] 0
$time_waiting_per_patient
[1] 0
$time_waiting_per_waiting_patient
[1] NaN
$closing_time
[1] 426.9273
```

To simulate the above many times, we’ll use the `replicate`

function. For convenience, we’ll turn this into a `tibble`

.

```
simulate <- function(iters, λ, t, n=100) {
iters %>%
replicate(process(arrivals(λ, t, n), t)) %>%
t() %>%
as_tibble() %>%
mutate_all(unlist)
}
sims <- simulate(1000, λ, t)
```

n_patients | n_waited | time_waiting | time_waiting_per_patient | time_waiting_per_waiting_patient | closing_time |
---|---|---|---|---|---|

45 | 8 | 10.493347 | 0.2331855 | 1.311668 | 421.2221 |

44 | 6 | 13.226377 | 0.3005995 | 2.204396 | 435.4524 |

29 | 1 | 1.174172 | 0.0404887 | 1.174172 | 435.9763 |

31 | 1 | 9.413756 | 0.3036696 | 9.413756 | 440.1210 |

40 | 5 | 16.692607 | 0.4173152 | 3.338521 | 431.8950 |

37 | 2 | 11.329963 | 0.3062152 | 5.664981 | 420.0000 |

Finally, we can calculate the 50% intervals by applying the `quantile`

function to each summary.

```
sims_summary <- sims %>%
gather(variable, value) %>%
group_by(variable) %>%
summarise(
q25 = quantile(value, 0.25, na.rm=TRUE),
q50 = quantile(value, 0.5, na.rm=TRUE),
q75 = quantile(value, 0.75, na.rm=TRUE),
simulations = n()
)
```

variable | q25 | q50 | q75 | simulations |
---|---|---|---|---|

closing_time | 420.000000 | 424.9488049 | 430.7023807 | 1000 |

n_patients | 38.000000 | 42.0000000 | 46.2500000 | 1000 |

n_waited | 3.000000 | 5.0000000 | 9.0000000 | 1000 |

time_waiting | 8.266744 | 19.3983242 | 38.9367142 | 1000 |

time_waiting_per_patient | 0.212705 | 0.4685262 | 0.8622254 | 1000 |

time_waiting_per_waiting_patient | 2.660497 | 3.8266619 | 5.1309744 | 1000 |

Suppose you have a bunch of customers who make repeat purchases - some more frequenty, some less. There are a few things you might like to know about these customers, such as

- which customers are still active (i.e. not yet churned) and likely to continue purchasing from you?; and
- how many purchases can you expect from each customer?

Modelling this directly is more difficult than it might seem at first. A customer that regularly makes purchases every day might be considered at risk of churning if they haven’t purchased anything in the past week, whereas a customer that regularly puchases once per month would not be considered at risk of churning. That is, churn and frequency of purchasing are closely related. The difficulty is that we don’t observe the moment of churn of any customer and have to model it probabilistically.

There are a number of established models for estimating this, the most well-known perhaps being the SMC model (a.k.a pareto-nbd model). There are already some implementations using maximum likelihood or Gibbs sampling. In this post, we’ll explain how the model works, make some prior predictive simulations, and fit a version implemented in Stan.

\(\DeclareMathOperator{\dbinomial}{Binomial} \DeclareMathOperator{\dbern}{Bernoulli} \DeclareMathOperator{\dpois}{Poisson} \DeclareMathOperator{\dnorm}{Normal} \DeclareMathOperator{\dt}{t} \DeclareMathOperator{\dcauchy}{Cauchy} \DeclareMathOperator{\dexp}{Exp} \DeclareMathOperator{\duniform}{Uniform} \DeclareMathOperator{\dgamma}{Gamma} \DeclareMathOperator{\dinvgamma}{InvGamma} \DeclareMathOperator{\invlogit}{InvLogit} \DeclareMathOperator{\logit}{Logit} \DeclareMathOperator{\ddirichlet}{Dirichlet} \DeclareMathOperator{\dbeta}{Beta}\)

Let’s describe the model first by simulation. Suppose we have a company that is 2 years old and a total of 2000 customers, \(C\), that have made at least one purchase from us. We’ll assume a linear rate of customer acquisition, so that the first purchase date is simply a uniform random variable over the 2 years of the company existance. These assumptions are just to keep the example concrete, and are not so important for understanding the model.

```
customers <- tibble(id = 1:1000) %>%
mutate(
end = 2 * 365,
start = runif(n(), 0, end - 1),
T = end - start
)
```

The \(T\)-variable is the total observation time, counted from the date of first joining to the present day.

First the likelihood. Each customer \(c \in C\) is assumed to have a certain lifetime, \(\tau_c\), starting on their join-date. During their lifetime, they will purchase at a constant rate, \(\lambda_c\), so that they will make \(k \sim \dpois(t\lambda_c)\) purchases over a time-interval \(t\). Once their lifetime is over, they will stop purchasing. We only observe the customer for \(T_c\) units of time, and this observation time can be either larger or smaller than the lifetime, \(\tau_c\). Since we don’t observe \(\tau_c\) itself, we will assume it follows an exponential distribution, i.e. \(\tau_c \sim \dexp(\mu_c)\).

The following function generates possible observations given \(\mu\) and \(\lambda\).

```
sample_conditional <- function(mu, lambda, T) {
# lifetime
tau <- rexp(1, mu)
# start with 0 purchases
t <- 0
k <- 0
# simulate time till next purchase
wait <- rexp(1, lambda)
# keep purchasing till end of life/observation time
while(t + wait <= pmin(T, tau)) {
t <- t + wait
k <- k + 1
wait <- rexp(1, lambda)
}
# return tabular data
tibble(
mu = mu,
lambda = lambda,
T = T,
tau = tau,
k = k,
t = t
)
}
s <- sample_conditional(0.01, 1, 30)
```

mu | lambda | T | tau | k | t |
---|---|---|---|---|---|

0.01 | 1 | 30 | 49.63373 | 39 | 29.21926 |

Given \(\mu\) and \(\lambda\), the CLV is calculated as follows. The remaining lifetime is the lifetime minus the age of the customer. So if the customer is estimated to have a lifetime of 1 year and has been a customer for 3 months already, then the remaining lifetime will be 9 months.

```
lifetime <- function(n, mu, age=0) {
rexp(n, mu) %>%
`-`(age) %>%
pmax(0) # remaining lifetime always >= 0
}
```

The number of purchases in a given timeframe (within the customer’s lifetime) is simply a poisson random variable.

```
purchases <- function(n, lambda, time) {
rpois(n, lambda * time)
}
```

To simulate the CLV, we just simulate a possible lifetime remaining, then simulate the number of puchases in that timeframe. Repeating many times gives us the distribution of the total number of purchases the customer is expected to make.

```
clv <- function(n, mu, lambda, age=0) {
lifetime(n, mu, age) %>%
purchases(n, lambda, .)
}
```

The probability of churning can be estimated by the fraction of `lifetime`

draws that are above 0. For example, for a customer with an expected lifetime of 10 (i.e. \(\mu = 0.1\)) and a current age of 10, the probability of still being active is

`mean(lifetime(100000, 0.1, 10) > 0)`

`[1] 0.36881`

which is roughly \(\exp(-0.1 * 10)\), the survival function of the exponential distribution.

Now the priors. Typically, \(\mu\) and \(\lambda\) are given gamma priors, which we’ll use too. However, the expected mean lifetime \(\mathbb E (\tau) = \frac{1}{\mu}\) is easier to reason about than \(\mu\), so we’ll put an inverse gamma distribution on \(\frac{1}{\mu}\). The reciprocal of an inverse gamma distribution has a gamma distribution, so \(\mu\) will still end up with a gamma distribution.

The mean expected lifetime in our simulated example will be ~2 months, with a standard deviation of 30. The mean purchase rate will be once a fortnight, with a standard deviation around 0.05.

```
set.seed(2017896)
etau_mean <- 60
etau_variance <- 30^2
etau_beta <- etau_mean^3 / etau_variance + etau_mean
etau_alpha <- etau_mean^2 / etau_variance + 2
lambda_mean <- 1 / 14
lambda_variance <- 0.05^2
lambda_beta <- lambda_mean / lambda_variance
lambda_alpha <- lambda_mean * lambda_beta
df <- customers %>%
mutate(
etau = rinvgamma(n(), etau_alpha, etau_beta),
mu = 1 / etau,
lambda = rgamma(n(), lambda_alpha, lambda_beta)
) %>%
group_by(id) %>%
group_map(~sample_conditional(.$mu, .$lambda, .$T))
```

id | mu | lambda | T | tau | k | t |
---|---|---|---|---|---|---|

1 | 0.0241091 | 0.2108978 | 295.3119 | 32.2814622 | 6 | 29.46052 |

2 | 0.0122084 | 0.0135551 | 673.2100 | 11.5250690 | 0 | 0.00000 |

3 | 0.0032994 | 0.0789800 | 357.1805 | 4.7921238 | 0 | 0.00000 |

4 | 0.0227431 | 0.0980176 | 270.0511 | 141.4766791 | 10 | 125.60765 |

5 | 0.0270742 | 0.0429184 | 608.9049 | 5.7293256 | 0 | 0.00000 |

6 | 0.0208168 | 0.0661296 | 666.1305 | 0.9481004 | 0 | 0.00000 |

The lifetimes are mostly under 3 months, but also allow some more extreme values up to around a year.

The purchase rates are mostly around once a fortnight, but there are also rates as high as 4 purchases per week and ras low as one per quarter.

The likelihood is somewhat complicated, so we’ll derive a more concise expression for it. Knowing the lifetime simplifies the probabilities, so we’ll marginalise the liklihood over \(\tau\).

\[ \begin{align} \mathbb P (k, t \mid \mu, \lambda) &= \int_{\tau = t}^\infty \mathbb P (k, t \mid \mu, \lambda, \tau) \cdot \mathbb P(\tau \mid \mu, \lambda) d\tau \\ &= \int_{\tau = t}^T \mathbb P (k, t \mid \mu, \lambda, \tau) \cdot \mathbb P(\tau \mid \mu, \lambda) + \int_{\tau = T}^\infty \mathbb P (k, t \mid \mu, \lambda, \tau) \cdot \mathbb P(\tau \mid \mu, \lambda) \\ &= \int_{\tau = t}^T \dpois(k \mid t\lambda) \cdot \dpois(0 \mid (\tau-t)\lambda) \cdot \dexp(\tau \mid \mu) d\tau \\ &\hphantom{=} + \int_{\tau = T}^\infty \dpois(k \mid t\lambda) \cdot \dpois(0 \mid (T-t)\lambda) \cdot \dexp(\tau \mid \mu) d\tau \end{align} \]

The right-hand side is straight forward. The Poisson probabilities can be pulled out of the integral since they are independent of \(\tau\), turning the remaining integral into the survival function of the exponential distribution.

\[ \begin{align} \text{RHS} &= \int_{\tau = T}^\infty \dpois(k \mid t\lambda) \cdot \dpois(0 \mid (T - t)\lambda) \cdot\dexp(\tau \mid \mu) d\tau \\ &= \frac{(t\lambda)^k e^{-t\lambda}}{k!} e^{-(T-t)\lambda}\int_T^\infty \dexp(\tau \mid \mu) d\tau \\ &= \frac{(t\lambda)^k e^{-T\lambda}}{k!} e^{-T\mu} \\ &= \frac{(t\lambda)^k e^{-T(\lambda + \mu)}}{k!} \end{align} \]

The left-hand side is a little more involved.

\[ \begin{align} \text{LHS} &= \int_{\tau = t}^T \dpois(k \mid t\lambda) \cdot \dpois(0 \mid (\tau-t)\lambda) \cdot \dexp(\tau \mid \mu) d\tau \\ &= \frac{(t\lambda)^k e^{-t\lambda} }{k!} \int_t^T e^{-(\tau - t)\lambda} \mu e^{-\tau\mu} d\tau \\ &= \frac{(t\lambda)^k e^{-t\lambda} }{k!} e^{t\lambda} \mu \int_t^T e^{-\tau(\lambda + \mu)} d\tau \\ &= \frac{(t\lambda)^k }{k!} \mu \left. \frac{ e^{-\tau(\lambda + \mu)}}{-(\lambda + \mu)} \right|_t^T \\ &= \frac{(t\lambda)^k }{k!} \mu \frac{ e^{-t(\lambda + \mu)} - e^{-T(\lambda + \mu)}}{\lambda + \mu} \end{align} \]

Adding both expressions gives our final expression for the likelihood

\[ \begin{align} \mathbb P (k, t \mid \mu, \lambda) &= \frac{(t\lambda)^k e^{-T(\lambda + \mu)}}{k!} + \frac{(t\lambda)^k }{k!} \mu \frac{ e^{-t(\lambda + \mu)} - e^{-T(\lambda + \mu)}}{\lambda + \mu} \\ &\propto \lambda^k e^{-T(\lambda + \mu)} + \lambda^k \mu \frac{ e^{-t(\lambda + \mu)} - e^{-T(\lambda + \mu)}}{\lambda + \mu} \\ &= \frac{\lambda^k}{\lambda + \mu} \left( \mu e^{-t(\lambda + \mu)} - \mu e^{-T(\lambda + \mu)} + \mu e^{-T(\lambda + \mu)} + \lambda e^{-T(\lambda + \mu)} \right) \\ &= \frac{\lambda^k}{\lambda + \mu} \left( \mu e^{-t(\lambda + \mu)} + \lambda e^{-T(\lambda + \mu)} \right) , \end{align} \]

where we dropped any factors independent of the parameters, \(\lambda, \mu\). This expression agrees with equation 2 in ML07.

Another way to view this likelihood is as a mixture of censored observations, but where the mixture probability \(p(\mu, \lambda) := \frac{\mu}{\lambda + \mu}\) depends on the parameters. We can write this alternative interpretation as

\[ \begin{align} \mathbb P(k, t \mid \mu, \lambda) &\propto p \dpois(k \mid t\lambda)S(t \mid \mu) \\ &\hphantom{\propto}+ (1 - p) \dpois(k \mid t\lambda)\dpois(0 \mid (T-t)\lambda)S(T \mid \mu) , \end{align} \]

where \(S\) is the survival function of the exponential distribution. In other words, either we censor at \(t\) with probability \(p\), or we censor at \(T\) with probability \((1 - p)\). Note that

- either decreasing the expected lifetime (i.e. increasing \(\mu\)) or decreasing the purchase rate increases \(p\);
- if \(t \approx T\), then the censored distributions are approximately equal. The smaller \(\lambda\) is, the closer the approximation has to be for this to hold.

To implement this in stan, we’ll need the log-likelihood, which is given by

\[ \log\mathbb P (k, t \mid \mu, \lambda) = k \log\lambda - \log(\lambda + \mu) + \log\left(\mu e^{-t(\lambda + \mu)} + \lambda e^{-T(\lambda + \mu)} \right) . \]

Let’s plot the likelihood to see how it changes as we vary \(k\), \(t\), and \(T\). We’ll use the following functions to do this.

```
# calculate the likelihood
likelihood <- function(mu, lambda, k, t, T) {
log_likelihood <- k * log(lambda) - log(lambda + mu) + log(mu * exp(-t * (lambda + mu)) + lambda * exp(-T * (lambda + mu)))
return(exp(log_likelihood))
}
# the grid to calculate values for
grid <- crossing(
mu = seq(0.00001, 1, 0.01),
lambda = seq(0.00001, 1, 0.01)
)
# plot it all
plot_likelihood <- function(grid, k, t, T) {
grid %>%
mutate(k = k, t = t, T = T) %>%
mutate(likelihood = likelihood(mu, lambda, k, t, T)) %>%
ggplot() +
aes(mu, lambda, fill = likelihood) +
geom_raster() +
geom_contour(aes(z = likelihood), alpha = 0.7) +
labs(
x = 'μ',
y = 'λ',
title = str_glue("Likelihood for k = {k}, t = {t}, T = {T}"),
subtitle = 'restricted to the unit interval',
fill = 'Likelihood'
)
}
```

If \(k = t = 0 \approx T\), then we have almost no information to inform our estimates (we would rely strongly on our priors in this case). We see that both large and small lifetimes are equally possible, and the parameter estimates are approximately independent of one another.

```
grid %>%
plot_likelihood(k = 0, t = 0, T = 0.1)
```

Adding some observation time changes it up a little. We can increase the purchase rate without changing the likelihood if we also decrease the lifetime (= increase \(\mu\)). This trade-off is almost linear. There are almost always many customers that haven’t made a second purchase yet, so this case is likely important to deal with well.

```
grid %>%
plot_likelihood(k = 0, t = 0, T = 12)
```

If, on the other hand, we do observe some purchases in this period, the likelihood quickly shrinks around the average purchase rate. Likewise, the expected lifetime clings around the larger values.

```
grid %>%
plot_likelihood(k = 3, t = 12, T = 12)
```

Once a substantial length of time ellapses without any more purchases, we see the MLE estimate for \(\mu\) move away from small values. This makes sense since we would otherwise have observed more recent purchases. The estimate for \(\mu\) doesn’t increase too much though since we know the lifetime is at least 12.

```
grid %>%
plot_likelihood(k = 3, t = 12, T = 100000)
```

Let’s take a look at our Stan implementation. Note that Stan uses the log-likelihood, and we can increment it by incrementing the `target`

variable. We have also used the `log_sum_exp`

for numeric stability, where \(\text{log_sum_exp}(x, y) := \log(e^x + e^y)\).

```
pnb <- here('models/pnbd.stan') %>%
stan_model()
```

```
S4 class stanmodel 'pnbd' coded as follows:
data {
int<lower = 1> n; // number of customers
vector<lower = 0>[n] t; // time to most recent purchase
vector<lower = 0>[n] T; // total observation time
vector<lower = 0>[n] k; // number of purchases observed
// user-specified parameters
real<lower = 0> etau_alpha;
real<lower = 0> etau_beta;
real<lower = 0> lambda_alpha;
real<lower = 0> lambda_beta;
}
parameters {
vector<lower = 0>[n] lambda; // purchase rate
vector<lower = 0>[n] etau; // expected mean lifetime
}
transformed parameters {
vector<lower = 0>[n] mu = 1.0 ./ etau;
}
model {
// priors
etau ~ inv_gamma(etau_alpha, etau_beta);
lambda ~ gamma(lambda_alpha, lambda_beta);
// likelihood
target += k .* log(lambda) - log(lambda + mu);
for (i in 1:n) {
target += log_sum_exp(
log(lambda[i]) - (lambda[i] + mu[i]) .* T[i],
log(mu[i]) - (lambda[i] + mu[i]) .* t[i]
);
}
}
```

Let’s fit the model to our simulated data, using the correct priors.

```
pnb_fit <- rstan::sampling(
pnb,
data = compose_data(
df,
etau_alpha = etau_alpha,
etau_beta = etau_beta,
lambda_alpha = lambda_alpha,
lambda_beta = lambda_beta
),
control = list(max_treedepth = 15),
chains = 4,
cores = 4,
warmup = 1000,
iter = 3000
)
```

Using the default `max_treedepth`

of 10 shows problems with the energy diagnostic, with the `etau`

parameters seemingly most problematic. However, increasing it to 15 resolved these issues.

```
pnb_fit %>%
check_hmc_diagnostics()
```

```
Divergences:
0 of 8000 iterations ended with a divergence.
Tree depth:
0 of 8000 iterations saturated the maximum tree depth of 15.
Energy:
E-BFMI indicated no pathological behavior.
```

There are also no problems with the effective sample sizes, although `etau`

typically has the lowest.

```
pnb_neff <- pnb_fit %>%
neff_ratio() %>%
tibble(
ratio = .,
parameter = names(.)
) %>%
arrange(ratio) %>%
head(5)
```

ratio | parameter |
---|---|

0.3877002 | lp__ |

0.4838375 | etau[716] |

0.5157888 | etau[442] |

0.5722499 | etau[367] |

0.5803245 | etau[443] |

The rhat statistic also looks good.

```
pnb_rhat <- pnb_fit %>%
rhat() %>%
tibble(
rhat = .,
parameter = names(.)
) %>%
summarise(min(rhat), max(rhat))
```

min(rhat) | max(rhat) |
---|---|

0.9995222 | 1.001541 |

Around 50% of our 50% posterior intervals contain the true value, which is a good sign.

```
calibration <- pnb_fit %>%
spread_draws(mu[id], lambda[id]) %>%
mean_qi(.width = 0.5) %>%
inner_join(df, by = 'id') %>%
summarise(
mu = mean(mu.lower <= mu.y & mu.y <= mu.upper),
lambda = mean(lambda.lower <= lambda.y & lambda.y <= lambda.upper)
) %>%
gather(parameter, fraction)
```

parameter | fraction |
---|---|

mu | 0.495 |

lambda | 0.498 |

We described the data generating process behind the Pareto-NBD model, implemented a model in Stan using our derivation of the likelihood, and fit the model to simulated data. The diagnostics didn’t indicate any convergence problems, and around 50% of the 50% posterior intervals contained the true parameter values. However, we used our knowledge of the prior distribution to fit the model. It would be better to use a hierarchical prior to relax this requirement.

As a next step, it would be interesting to extend the model to

- estimate spend per purchase;
- use hierarchical priors on \(\mu\) and \(\lambda\);
- allow correlation between \(\mu\) and \(\lambda\); and
- allow covariates, such as cohorts.

Here’s my solution to exercise 3, chapter 1, of Gelman’s *Bayesian Data Analysis* (BDA), 3rd edition. There are solutions to some of the exercises on the book’s webpage.

Suppose a particular gene for eye colour has two alleles: a dominant X and a recessive x allele. Having \(xx\) gives you blue eyes, otherwise you have brown eyes. Suppose also that the proportion of blue-eyed people is \(p^2\), and the proportion of heterozygotes is \(2p(1 - p)\). There are 3 questions to answer:

- What is the probability of a brown-eyed child of brown-eyed parents being a heterozygote?
- If such a heterozygote, Judy, has n brown-eyed children with a random heterozygote, what’s the probability that Judy is a heterozygote?
- Under the conditions of part 2, what is the probability that Judy’s first grandchild has blue eyes?

Let’s first set up some data with which we can verify the results via simulation.

We’ll simulate a large population of individuals where the probability of the recessive allele is 0.2.

```
set.seed(11146)
N <- 5000000
p <- 0.2
alleles <- c('x', 'X')
weights <- c(p, 1 - p)
df <- tibble(id = 1:N %>% as.character()) %>%
mutate(
allele1 = sample(alleles, N, prob = weights, replace = TRUE),
allele2 = sample(alleles, N, prob = weights, replace = TRUE),
genotype = if_else(allele1 == allele2, 'homozygote', 'heterozygote'),
eye_colour = if_else(allele1 == 'x' & allele2 == 'x', 'blue', 'brown')
)
```

id | allele1 | allele2 | genotype | eye_colour |
---|---|---|---|---|

1 | X | X | homozygote | brown |

2 | x | X | heterozygote | brown |

3 | X | x | heterozygote | brown |

4 | X | X | homozygote | brown |

5 | x | X | heterozygote | brown |

6 | X | X | homozygote | brown |

This has the correct distribution of alleles, since \(p^2 \approx\) 0.04 and \((1-p)^2\approx\) 0.64.

```
allele_distribution <- df %>%
group_by(allele1, allele2) %>%
tally() %>%
mutate(frac = n / sum(n))
```

allele1 | allele2 | n | frac |
---|---|---|---|

x | x | 200006 | 0.2000018 |

x | X | 800015 | 0.7999982 |

X | x | 800802 | 0.2002016 |

X | X | 3199177 | 0.7997984 |

This also has the correct distribution of eye colours.

```
eye_colour_distribution <- df %>%
group_by(eye_colour) %>%
tally() %>%
mutate(frac = n / sum(n))
```

eye_colour | n | frac |
---|---|---|

blue | 200006 | 0.0400012 |

brown | 4799994 | 0.9599988 |

The genotype distribution is also correct, since \(2p(1-p) \approx\) 0.32.

```
genotype_distribution <- df %>%
group_by(genotype) %>%
tally() %>%
mutate(frac = n / sum(n))
```

genotype | n | frac |
---|---|---|

heterozygote | 1600817 | 0.3201634 |

homozygote | 3399183 | 0.6798366 |

Let’s also define a couple of functions to simulate reproduction within our population. The `pair`

function matches up random individuals from the first table with random individuals from the second.

```
pair <- function(df1, df2) {
inner_join(
df1 %>%
select(-matches('\\.(x|y)$')) %>%
select(matches('^(id|allele|genotype|eye)')) %>%
ungroup() %>%
sample_frac(size = 1) %>%
mutate(row = row_number()),
df2 %>%
select(-matches('\\.(x|y)$')) %>%
select(matches('^(id|allele|genotype|eye)')) %>%
ungroup() %>%
sample_frac(size = 1) %>%
mutate(row = row_number()),
by = 'row'
) %>%
select(-row) %>%
return()
}
```

The `reproduce`

function then randomly generates a child from the paired individuals.

```
reproduce <- function(pairs, n=1) {
pairs %>%
crossing(child = 1:n) %>%
mutate(
# the variables x and y indicate the allele taken from parent x and y, respectively
x = rbinom(n(), 1, 0.5) + 1,
y = rbinom(n(), 1, 0.5) + 1,
allele1 = if_else(x == 1, allele1.x, allele2.x),
allele2 = if_else(y == 1, allele1.y, allele2.y),
genotype = if_else(allele1 == allele2, 'homozygote', 'heterozygote'),
eye_colour = if_else(allele1 == 'x' & allele2 == 'x', 'blue', 'brown'),
id = paste(id.x, id.y, child, sep = '-')
) %>%
return()
}
```

The `kids`

table then represents the next generation from random mating within the entire population.

```
kids <- df %>%
pair(df) %>%
reproduce()
```

allele1.x | allele2.x | allele1.y | allele2.y | allele1 | allele2 | x | y |
---|---|---|---|---|---|---|---|

X | X | X | x | X | X | 2 | 1 |

X | x | X | X | x | X | 2 | 1 |

X | X | X | x | X | x | 1 | 2 |

X | X | X | X | X | X | 1 | 1 |

X | x | X | x | X | X | 1 | 1 |

x | X | X | X | x | X | 1 | 1 |

The parent attributes are contained in the `kids`

table, with the `.x`

suffix for one parent and `.y`

for the other.

We’ll use \(A\) to stand for the allele combination, e.g. \(XX\), or \(Xx = xX\), and \(E\) for eye colour. The subscripts \(i = 1, 2\) will be used for each of the two parents, and the absence of subscripts will indicate the variable for the child. We need to calculate the probability that the child is heterogenous given that they are brown-eyed with brown-eyed parents:

\[ \mathbb P (A = Xx \mid E, E_1, E_2 = B). \]

It will be easier to calculate this if we can rewrite it as a probability conditional only on \(A_\bullet\)-variables. First note that

\[ \begin{align} \mathbb P (A, A_1, A_2) &= \mathbb P (A \mid A_1, A_2) \mathbb P(A_1, A_2) \\ &= \mathbb P (A \mid A_1, A_2) \mathbb P(A_1) \mathbb P (A_2) \end{align} \]

using the chain rule and the assumption of random mating. Therefore,

\[ \begin{align} & P (A = Xx \mid E_\bullet = B) \\ &= \frac{\mathbb P (E_\bullet = B \mid A = Xx) \cdot \mathbb P (A = Xx)}{\mathbb P (E_\bullet = B)} \\ &= \frac{ \sum_{a_1, a_2} \mathbb P (E_\bullet = B \mid A = Xx, A_1 = a_1, A_2 = a_2) \cdot \mathbb P (A = Xx \mid A_1 = a_1, A_2 = a_2) \cdot \mathbb P (A_1 = a_1) \cdot \mathbb P (A_2 = a_2) }{ \sum_{a, a_1, a_2} \mathbb P (E_\bullet = B \mid A = a, A_1 = a_1, A_2 = a_2) \cdot \mathbb P (A = a \mid A_1 = a_1, A_2 = a_2) \cdot \mathbb P (A_1 = a_1) \cdot \mathbb P (A_2 = a_2) }, \end{align} \]

where the numerator is marginalised over possible values of \(A_1\) and \(A_2\), and the denominator additionally over \(A\).

The factors involving \(E_\bullet\) are either 1 or 0, depending only on whether the given combination of alleles can give rise to brown eyes or not, respectively. Moreover, \(\mathbb P (A_i = XX) = (1 - p)^2\) and \(\mathbb P (A_i = Xx) = 2p(1 - p)\), where the case \(A_i = xx\) is impossible conditional on everybody having brown eyes. The only non-trivial calculations now involve \(\mathbb P (A = a \mid A_1 = a_1, A_2 = a_2)\):

\[ \begin{align} \mathbb P (A = Xx \mid A_1 = Xx, A_2 = Xx) &= \frac{1}{2} \\ \mathbb P (A = Xx \mid A_1 = Xx, A_2 = XX) &= \frac{1}{2} \\ \mathbb P (A = Xx \mid A_1 = XX, A_2 = XX) &= 0 \\ \mathbb P (A = XX \mid A_1 = Xx, A_2 = Xx) &= \frac{1}{4} \\ \mathbb P (A = XX \mid A_1 = Xx, A_2 = XX) &= \frac{1}{2} \\ \mathbb P (A = XX \mid A_1 = XX, A_2 = XX) &= 1, \end{align} \]

as can be verified by inspection.

Now let’s plug in these values into the formula for the desired probability. The numerator is

\[ \begin{align} & P (A = Xx \mid E_\bullet = B) \\ &= \frac{ \frac{1}{2} \cdot (2p(1 - p))^2 + \frac{1}{2} \cdot 2 \cdot 2p(1 - p)(1 - p)^2 + 0 \cdot (1 - p)^4 }{ (\frac{1}{2} + \frac{1}{4}) \cdot 4p^2(1 - p)^2 + (\frac{1}{2} + \frac{1}{2}) \cdot 4p(1 - p)^3 + (0 + 1) \cdot (1 - p)^4 } \\ &= \frac{(1 - p)^2}{(1 - p)^2} \frac{ 2p^2 + 2p(1 - p) }{ 3p^2 + 4p(1 - p) + (1 - p)^2 } \\ &= \frac{ 2p^2 + 2p - 2p^2 }{ 3p^2 + 4p - 4p^2 + 1 + p^2 - 2p } \\ &= \frac{ 2p }{ 1 + 2p }, \end{align} \]

as required. This is approximately \(2p\) for small \(p\), and is approximatily \(\frac{1}{2}\) for large \(p\).

To condition on brown-eyed children from brown-eyed parents, we can just filter the `kids`

table. Such a child is called `judy`

in this exercise.

```
judy <- kids %>%
filter(
eye_colour.x == 'brown',
eye_colour.y == 'brown',
eye_colour == 'brown'
)
judy_genotypes <- judy %>%
group_by(genotype) %>%
tally() %>%
mutate(frac = n / sum(n))
```

genotype | n | frac |
---|---|---|

heterozygote | 1280529 | 0.2858268 |

homozygote | 3199559 | 0.7141732 |

This is very close to the theoretical value of \(\frac{2p}{1 + 2p}\approx\) 0.286.

Denote by \(E_{C_\bullet} = B\) the condition that all of Judy’s children have brown eyes, and by \(A^p = a\) the condition that Judy’s partner has allele combination \(a\). Then

\[ \begin{align} & \mathbb P (A = Xx \mid E_\bullet = B = E_{C_\bullet}, A^p = Xx) \\ &= \frac{ \mathbb P (A = Xx \mid E_\bullet = B, A^p = Xx) \cdot \mathbb P (E_{C_\bullet} = B \mid E_\bullet = B, A^p = Xx = A) }{ \mathbb P (E_{C_\bullet} = B \mid E_\bullet = B, A^p = Xx) } \\ &= \frac{ \mathbb P (A = Xx \mid E_\bullet = B) \cdot \mathbb P (E_{C_\bullet} = B \mid A^p = Xx = A) }{ \sum_a \mathbb P (E_{C_\bullet} = B \mid E_\bullet = B, A^p = Xx, A = a) \cdot \mathbb P (A = a \mid E_\bullet = B, A^p = Xx) } \\ &= \frac{ \mathbb P (A = Xx \mid E_\bullet = B) \cdot \mathbb P (E_{C_\bullet} = B \mid A^p = Xx = A) }{ \sum_a \mathbb P (E_{C_\bullet} = B \mid A^p = Xx, A = a) \cdot \mathbb P (A = a \mid E_\bullet = B) } \\ &= \frac{ \frac{2p}{1 + 2p} \cdot (\frac{3}{4})^n }{ \mathbb P (E_{C_\bullet} = B \mid A^p = Xx = A) \cdot \mathbb P (A = Xx \mid E_\bullet = B) + \mathbb P (E_{C_\bullet} = B \mid A^p = Xx, A = XX) \cdot \mathbb P (A = XX \mid E_\bullet = B) } \\ &= \frac{ \frac{2p}{1 + 2p} \cdot (\frac{3}{4})^n }{ \frac{2p}{1 + 2p} \cdot (\frac{3}{4})^n + \frac{1}{1 + 2p} } \\ &= \frac{2p \cdot (\frac{3}{4})^n}{2p \cdot (\frac{3}{4})^n + 1} \\ &= \frac{2p \cdot 3^n}{2p \cdot 3^n + 4^n} , \end{align} \]

where we have used conditional independence several times for the probability of the child’s alleles given the parents’ alleles. As \(n \rightarrow \infty\), this probability shrinks to 0.

To simulate part 2, we need to pair `judy`

with heterozygotes from the general population, then filter for those children with brown eyes.

```
judy_kids <- df %>%
filter(genotype == 'heterozygote') %>%
pair(judy) %>%
reproduce() %>%
ungroup() %>%
filter(eye_colour == 'brown')
```

Amongst `judy_kids`

, Judy’s attributes have the `.y`

suffix. Given the above conditions, the probability of her possible genotypes are then:

```
judy_genotypes_posterior <- judy_kids %>%
group_by(genotype.y) %>%
tally() %>%
mutate(frac = n / sum(n))
```

genotype.y | n | frac |
---|---|---|

heterozygote | 343962 | 0.2313911 |

homozygote | 1142534 | 0.7686089 |

This is close to the theoretical value of \(\frac{6p}{6p + 4}\approx\) 23.1%.

Let’s introduce some notation. Let \(A_g\) be the alleles of Judy’s first grandchild, the child of \(c\) with alleles \(A_c\) whose partner has alleles \(A_c^p\). We wish to calculate \(\mathbb P (A_g = xx \mid E_\bullet = B = E_{C_\bullet}, A^p = Xx)\).

First note that

$$ \[\begin{align} & \mathbb P (A_c = Xx \mid E_\bullet = E_{C_\bullet} = B, A^p = Xx, A = a) \\ &= \frac{ \mathbb P (E_{C_\bullet} = B \mid E_\bullet = B, A_c = Xx = A^p, A = a) \cdot \mathbb P (A_c = Xx \mid E_\bullet = B, A^p = Xx, A = a, A) }{ \mathbb P (E_{C_\bullet} = B \mid E_\bullet = B, A^p = Xx, A = a) } \\ &= \frac{ \mathbb P (E_{C_\bullet} = B \mid E_\bullet = B, A_c = Xx = A^p, A = a) \cdot 0.5 }{ \mathbb P (E_{C_\bullet} = B \mid E_\bullet = B, A^p = Xx, A = a) } \\ &= \begin{cases} \frac{\left(\frac{3}{4}\right)^{n-1} \cdot \frac{1}{2}}{\left(\frac{3}{4}\right)^n} &\text{if } A = Xx \\ 1 \cdot \frac{1}{2} / 1 &\text{othewrise} \end{cases} \\ &= \begin{cases} \frac{2}{3} &\text{if } A = Xx \\ \frac{1}{2} &\text{othewrise} \end{cases} . \end{align}\]$$

Thus,

$$ \[\begin{align} & \mathbb P (A_c = Xx \mid E_\bullet = B = E_{C_\bullet}, A^p = Xx) \\ &= \sum_a \mathbb P (A_c = Xx \mid E_\bullet = B = E_{C_\bullet}, A^p = Xx, A = a) \cdot \mathbb P (A = a \mid E_\bullet = B = E_{C_\bullet}, A^p = Xx) \\ &= \frac{2}{3} \cdot \frac{2p \cdot (\frac{3}{4})^n}{2p \cdot (\frac{3}{4})^n + 1} + \frac{1}{2} \cdot \frac{1}{2p \cdot (\frac{3}{4})^n + 1} \\ &= \frac{p\left( \frac{3}{4} \right)^{n-1} + 0.5}{2p \cdot \left(\frac{3}{4}\right)^n + 1} , \end{align}\]$$

which converges to \(\frac{1}{2}\) as \(n \rightarrow \infty\).

The probability that Judy’s first grandchild is a homozygote can then be calculated by marginalising over the allele combinations of the child and their partner:

$$ \[\begin{align} & \mathbb P (A_g = xx \mid E_\bullet = B = E_{C_\bullet}, A^p = Xx) \\ &= \sum_{a_c, a_c^p} \mathbb P (A_g = xx \mid E_\bullet = B = E_{C_\bullet}, A^p = Xx, A_c = a_c, A_c^p = a_c^p) \cdot \mathbb P (A_c = a_c, A_c^p = a_c^p \mid E_\bullet = B = E_{C_\bullet}, A^p = Xx) \\ &= \sum_{a_c, a_c^p} \mathbb P (A_g = xx \mid A_c = a_c, A_c^p = a_c^p) \cdot \mathbb P (A_c^p = a_c^p ) \cdot \mathbb P (A_c = a_c \mid E_\bullet = B = E_{C_\bullet}, A^p = Xx) \\ &= \sum_{a_c^p} \mathbb P (A_g = xx \mid A_c = Xx, A_c^p = a_c^p) \cdot \mathbb P (A_c^p = a_c^p ) \cdot \mathbb P (A_c = Xx \mid E_\bullet = B = E_{C_\bullet}, A^p = Xx) \\ &= \frac{p\left( \frac{3}{4} \right)^{n-1} + 0.5}{2p \cdot \left(\frac{3}{4}\right)^n + 1} \cdot \sum_{a_c^p} \mathbb P (A_g = xx \mid A_c = Xx, A_c^p = a_c^p) \cdot \mathbb P (A_c^p = a_c^p ) \\ &= \frac{p\left( \frac{3}{4} \right)^{n-1} + 0.5}{2p \cdot \left(\frac{3}{4}\right)^n + 1} \cdot \left( \mathbb P (A_g = xx \mid A_c = Xx, A_c^p = Xx) \cdot \mathbb P (A_c^p = Xx ) + \mathbb P (A_g = xx \mid A_c = Xx, A_c^p = xx) \cdot \mathbb P (A_c^p = xx ) \right) \\ &= \frac{p\left( \frac{3}{4} \right)^{n-1} + 0.5}{2p \cdot \left(\frac{3}{4}\right)^n + 1} \cdot \left( \frac{1}{4} \cdot 2p(1 - p) + \frac{1}{2} \cdot p^2 \right) \\ &= \frac{p\left( \frac{3}{4} \right)^{n-1} + 0.5}{2p \cdot \left(\frac{3}{4}\right)^n + 1} \cdot \frac{p}{2} , \end{align}\]$$

since

the grandchild can only be blue-eyed if the (brown-eyed) child has at least one x-allele, i.e. the child is \(Xx\);

\(A\) and \(A^p\) are independent by the random mating assumption; and

\(A_c\) and \(A_c^p\) are independent by the random mating assumption.

As \(n \rightarrow \infty\), this probability converges to \(\frac{p}{4}\).

To simulate Judy’s grandkids, we pair up `judy_kids`

with members of the general population.

```
judy_grandkids <- judy_kids %>%
pair(df) %>%
reproduce()
judy_grandkids %>%
summarise(mean(eye_colour == 'blue')) %>%
pull() %>%
signif(3)
```

`[1] 0.054`

The above fraction of grandkids with blue eyes is consistent with the theoretical value of \(\frac{4p + 0.5}{6p + 4}\frac{p}{2} \approx\) 0.0538.

]]>Here are my solutions to question 3.4.1 of Causal Inference in Statistics: a Primer (CISP). \(\DeclareMathOperator{\do}{do}\)

If we can only measure one additional variable to estimate the causal effect of \(X\) on \(Y\) in figure 3.8, then we should measure \(W\). From question 3.3.1 we see that no single variable satisfies the backdoor criteria. Moreover, visual inspection of the graph verifies that \(W\) satisfies the frontdoor criteria:

- it intercepts all (the only) directed paths from \(X\) to \(Y\);
- there is no unblocked path from \(X\) to \(W\); and
- all backdoor paths from \(W\) to \(Y\) are blocked by \(X\).

To illustrate this, lets simulate the causal effect in 3 separate ways:

- by intervention,
- via the backdoor, and
- via the frontdoor.

Here are the data. Note that we have created functions for \(W\) and \(Y\) for use later.

```
N <- 100000
W <- function(x) {
N <- length(x)
rbinom(N, 1, inv_logit(-x))
}
Y <- function(d, w, z) {
N <- length(d)
rbinom(N, 1, inv_logit(-d - w + 3*z))
}
df <- tibble(id = 1:N) %>%
mutate(
b = rnorm(N, 0, 1),
a = b + rnorm(N, 0, 0.1),
c = rnorm(N, 0, 1),
d = rbinom(N, 1, inv_logit(-1 + c)),
z = rbinom(N, 1, inv_logit(-2 + 2*b + c)),
x = rbinom(N, 1, inv_logit(a + z)),
w = W(x),
y = Y(d, w, z)
)
```

id | b | a | c | d | z | x | w | y |
---|---|---|---|---|---|---|---|---|

1 | 0.3641297 | 0.3917626 | 1.0369530 | 1 | 1 | 0 | 0 | 1 |

2 | 0.0287563 | 0.0397299 | 0.5736271 | 0 | 0 | 1 | 1 | 0 |

3 | -0.7727052 | -0.5993870 | -0.5179657 | 0 | 1 | 0 | 1 | 1 |

4 | 0.4107888 | 0.5737898 | 1.2586840 | 0 | 1 | 1 | 0 | 1 |

5 | 2.3512417 | 2.1631719 | 0.6746523 | 1 | 1 | 1 | 0 | 1 |

In order to simulate an intervention, we assign values to \(X\) randomly, then assign new values for all its descendents. After intervention, the causal effect of \(X\) on \(Y\) is simply \(\mathbb P(Y \mid X)\).

```
intervention <- df %>%
# intervene on x
mutate(
x = rbinom(n(), 1, 0.5),
w = W(x),
y = Y(d, w, z)
) %>%
# model P(y | do(x))
glm(
formula = y ~ x,
family = binomial(),
data = .
) %>%
# predict
augment(
newdata = tibble(x = 0:1),
type.predict = 'response'
)
```

x | .fitted | .se.fit |
---|---|---|

0 | 0.4637566 | 0.0022239 |

1 | 0.5072714 | 0.0022422 |

We can compare this causal effect to the simple statistical effect to see the difference.

```
noncausal <- df %>%
# model P(y | x)
glm(
formula = y ~ x,
family = binomial(),
data = .
) %>%
# predict
augment(
newdata = tibble(x = 0:1),
type.predict = 'response'
)
```

x | .fitted | .se.fit |
---|---|---|

0 | 0.3797282 | 0.0022595 |

1 | 0.5759927 | 0.0021293 |

Since \(\{X, Z\}\) satisfies the backdoor criteria, we can use it to apply the backdoor adjustment. First we’ll need \(\mathbb P(D, Z)\).

```
# P(d, z)
p_d_z <- df %>%
group_by(d, z) %>%
count() %>%
ungroup() %>%
mutate(p_d_z = n / sum(n))
```

Now we model \(\mathbb P(Y \mid X, D, Z)\), multiply it by \(\mathbb P(D, Z)\), then take the sum for each value of \(X\).

```
backdoor <- formula(y ~ 1 + x + z + d) %>%
# model P(y | x, d, z)
glm(
family = binomial(),
data = df
) %>%
# predict
augment(
type.predict = 'response',
newdata =
crossing(
d = c(0, 1),
x = c(0, 1),
z = c(0, 1)
)
) %>%
# get P(d, z)
mutate(p_y_given_d_x_z = .fitted) %>%
inner_join(p_d_z, by = c('d', 'z')) %>%
# backdoor adjustment over d, z
group_by(x) %>%
summarise(p_y_given_do_x = sum(p_y_given_d_x_z * p_d_z))
```

x | p_y_given_do_x |
---|---|

0 | 0.4681530 |

1 | 0.5033398 |

Note that the backdoor adjusted estimates are similar to the estimates from intervention.

To apply the frontdoor adjustment with \(W\), we’ll need \(\mathbb P(W \mid X)\), \(\mathbb P(X^\prime)\), and \(\mathbb P(Y \mid X, W)\).

```
p_w_given_x <- df %>%
group_by(x, w) %>%
count() %>%
group_by(x) %>%
mutate(p_w_given_x = n / sum(n)) %>%
ungroup()
p_xprime <- df %>%
group_by(xprime = x) %>%
count() %>%
ungroup() %>%
mutate(p_xprime = n / sum(n))
p_y_given_xprime_w <- formula(y ~ 1 + x + w) %>%
glm(
family = binomial(),
data = df
) %>%
augment(
newdata = crossing(x = 0:1, w = 0:1),
type.predict = 'response'
) %>%
transmute(
xprime = x,
w,
p_y_given_xprime_w = .fitted
)
```

Now we apply the frontdoor adjustment:

\[ \mathbb P (Y \mid \do(X)) = \sum_{x^\prime, w} \mathbb P(x^\prime) \cdot \mathbb P(w \mid x) \cdot \mathbb P (y \mid x^\prime, w) . \]

```
frontdoor <- p_w_given_x %>%
inner_join(p_y_given_xprime_w, by = 'w') %>%
inner_join(p_xprime, by = 'xprime') %>%
group_by(x) %>%
summarise(sum(p_w_given_x * p_y_given_xprime_w * p_xprime))
```

x | sum(p_w_given_x * p_y_given_xprime_w * p_xprime) |
---|---|

0 | 0.4623710 |

1 | 0.5041105 |

Our frontdoor estimates of \(\mathbb P(Y \mid \do(X))\) are very similar to the intervention and backdoor estimates.

]]>