Departments of Medicine, Epidemiology and Biostatistics, McGill University
2024-11-26
Conflicts of interest
No conflicts of interest
To the best of my knowledge I’m equally disliked, or at best deemed irrelevant, by all drug and device companies
AI generated image
Objectives
Review of probability concepts
Review of hierarchical modeling concepts
Realization of their practical applications
Hierarchical models (models with memory)
Often data is not iid (e.g. repeat or clustered measurements)
- nested data (patients within hospitals, within regions, with countries)
- panel or time-series cross-sectional data
Hierarchical (mixed of multilevel) models account for the dependency structure with potential pooling of information across clusters
Benefits of the multilevel approach
(1) Provide robust and conservative estimates across levels
(2) Estimate variation on all levels of the model
(3) Predict values of new groups not originally present in the data
Simple averaging can be dangerous leading to false confidence
Multilevel models allow us to preserve all uncertainties & should be default models
Increasing use of hierarchical models
Hierarchical model
Hierarchical model: terms
Hierarchical model: terms
The standard Bayesian random effects meta-analysis model follows the same hierarchical approach where level 1 is \[ y_j = \theta_j + \epsilon \:where\>error\> term\> \epsilon_j \, \sim \, normal(0,s_j)\] models the uncertainty of the effect from each individual study, \(\theta_j\). We assume that the estimate \(y_j\) from the j-th study is unbiased and normally distributed with standard error \(s_j\). Then level 2 is
\[ \theta_j = \mu + \mu_j \: where\>error\> term\> \mu_j \, \sim \, normal(0,\tau)\] where \(\mu\) is the average treatment effect in the (hypothetical) superpopulation of similar studies, \(\theta_j\) is the effect in the j-th individual study and the variance \(\tau^2\) is referred to as the heterogeneity
Hierarchical model: comparisons
Stein’s paradox (1997)
Which average (model) to use?
Want to predict each player’s season-long average
3 possible predictive models
(a) complete pooling, which assumes each player is the same, \(\boldsymbol y\), corresponds to zero population variance
(b) no pooling, which assumes the players are unrelated, \(y_i\), corresponds to infinite population variance
(c) partial pooling (hierarchical), where each player is assumed to have a different chance of success, but the data for all of the observed players informs the estimates for each individual player
IOW, how much pooling depends on the variance in the population, more variance -> less pooling Stein’s paradox is that these new Stein/James (partial) estimators, \((z_i = \boldsymbol y + c(y_i - \boldsymbol y))\), predict better than individual averages, \(y_i\), or the global average, \(\boldsymbol y\) where c is the shrinkage factor
Stein’s paradox (1977)
If more than 3 means, estimating each by its own mean is not the most efficient
Essence of the Stein method is that of “borrowing” of information or “shrinkage” will produce “better” estimates, meaning lowered mean squared errors
Baseball data for early season average (hits (successes) / 45 chances (AB)) for 18 players
library(rstanarm)data(bball1970)bball <- bball1970# graph on next slidebball_graph <- bball %>%mutate(final_avg = (Hits + RemainingHits) / (AB + RemainingAB), early_avg = Hits / AB,stein_est = .265+ .212*(early_avg-.265)) %>% dplyr::select(Player, final_avg, early_avg, stein_est) %>%pivot_longer(!Player, names_to ="average", values_to ="count") %>%ggplot(aes(x=Player, y=count, fill=factor(average)))+ylab("batting average") +geom_col(position='dodge') +geom_hline(yintercept = .265, linewidth=2, color ="orange") +# early average# geom_hline(yintercept = .269, linewidth=2, color = "black") + final averagecoord_flip()
Key points
1. Stein estimators (blue bar) closer to global early mean than individual early averages (red bar) (“shrinkage”)
2. Shrinkage is greatest for points farthest from the global average
3. Stein estimators (blue bar) closer to individual final avg (green bar) 14 vs 4 times than for the individual early avg (red bar) (i.e beats separate model)
4. Stein estimators (blue bar) closer to individual final avg (green bar) 13 vs 5 times than for the global early avg (orange vertical line) (i.e beats pooled model)
Bayesian approach
Stein estimator is not Bayesian as it is not dependent on priors, but it is a precursor to Bayesian multilevel models (shrinkage, borrowing information)
Full Bayesian analysis (with weakly informative priors) is possible and follows using rstanarm package
SEED <-101fit_partialpool <-stan_glmer(cbind(Hits, AB - Hits) ~ (1| Player), data = bball, #<<family =binomial("logit"), seed = SEED, refresh=0)# shift each player's estimate by intercept (and then drop intercept)shift_draws <-function(draws) {sweep(draws[, -1], MARGIN =1, STATS = draws[, 1], FUN ="+")}summary_stats <-function(posterior) {x <-invlogit(posterior) # log-odds -> probabilitiest(apply(x, 2, quantile, probs =c(0.05, 0.5, 0.95))) }alphas <-shift_draws(as.matrix(fit_partialpool));partialpool <-summary_stats(alphas)partialpool <- partialpool[-nrow(partialpool),];rownames(partialpool) <-as.character(bball$Player)batting_avg <-function(x) print(format(round(x, digits =3), nsmall =3), quote =FALSE)
James–Stein estimators also have direct implications on the machine learning era of statistical innovation - Efron 2023
Radon example
919 obs. of 4 variables from houses in 85 counties in Minnesota - log_radon - Radon measurement from the house (log scale)
- floor - 0 = basement, 1 = first floor
- county - County name (factor)
- log_uranium - Uranium level in the county (log scale)
Gives completely independent estimates for each group without any Bayesian shrinkage, useful only if suspect that the groups (counties) are very different and nothing to be gained from borrowing information across groups.
Radon models - pooling model
Each county has its own intercept (alpha[j]) with common floor slope (beta) The error term is normally distributed with standard deviation sigma.
# Example Stan code:data { int<lower=1> N; // observations int<lower=1> J; // counties array[N] int<lower=1, upper=J> county; vector[N] x; // floor vector[N] y; // radon}parameters { vector[J] alpha; real beta; real<lower=0> sigma;}model {// Model specifications alpha ~normal(0, 10); beta ~normal(0, 10); sigma ~normal(0, 10);// Likelihood y ~normal(alpha[county] + beta * x, sigma); }generated quantities { array[N] real log_lik; // Log-likelihood for each observation array[N] real y_rep; // Posterior predictive checksfor (i in1:N) { log_lik[i] =normal_lpdf(y[i] | alpha[county[i]] + beta * x[i], sigma); y_rep[i] =normal_rng(alpha[county[i]] + beta * x[i], sigma); // Generate predictive samples }}
Radon models - complete vs. partial pooling
Easier to use front end packages like rstanarm and brms than stan directly. These packages are not approprite for no pooling because they’re designed to do Bayesian shrinkage.
library(rstanarm)library(loo)mn_radon <-read_csv("data/mn_radon.csv", show_col_types =FALSE)# ====== Complete Pooling ====== model_cp_rstanarm <-stan_glm(log_radon ~ floor, data = mn_radon, family =gaussian(), refresh=0)# ====== Partial Pooling ====== model_pp_rstanarm <-stan_glmer(log_radon ~ floor + (1|county_id), data = mn_radon, family =gaussian(), refresh=0)save(model_pp_rstanarm, file="output/model_pp_rstanarm.RData")rstan_comp <-loo_compare(loo(model_cp_rstanarm), loo(model_pp_rstanarm))save(rstan_comp, file="output/rstan_comp.RData")
library(brms)library(loo)# ====== Complete Pooling ====== model_cp_brms <-brm(log_radon ~ floor, data = mn_radon, family =gaussian(), chains =2, iter =2000, refresh=0)# ====== Partial Pooling ====== model_pp_brms <-brm(log_radon ~ floor + (1|county_id), data = mn_radon, family =gaussian(), chains =2, iter =2000, refresh=0)brms_comp <-loo_compare(loo(model_cp_brms), loo(model_pp_brms))save(brms_comp, file="output/brms_comp.RData")
Radon models - complete vs. partial pooling plots
1. More shrinkage of point estimate toward global mean with i) “trials” farther away from global mean ii) smaller sample size
2. More shrinkage of credible intervals with smaller studies
Radon models - complete vs. partial pooling
Pooled model, log_radon ~ floor, ignores any grouping structure (e.g., counties) in the data and assumes that all observations are independent and identically distributed & has a straightforward interpretation floor coefficient = average effect across all observations, without accounting for any variation between different counties.
Treats floor effect as uniform across all counties, assumes variance homogeneity in dependent variable.
Provides overall average effect but can lead to biased or misleading conclusions if significant group-level variability exists, underestimates the standard errors of the estimated effects due to ignoring the intra-group correlation.
Hierarchical,AKA mixed-effects, model, log_radon ~ floor + (1|county_id),intercept varies by county around the overall average radon level
If 2 models are very different, -> high between-group variation, & more complex hierarchical model is likely better.
\(\Delta\) in expected log pointwise predictive density (epld) between two statistical models measures model performance, specifically how well a model predicts new data
Metric is commonly used when comparing models using tools like the loo package in R, which implements leave-one-out cross-validation (LOO) for Bayesian models.
elpd_diff = -53.2 (52.8) suggests that the first model (hierarchical) likely to be a better predictor of new data than the comparison completely pooled model with se = 10.6 and therefore z \(\approx\) 5
Interpretation: There is strong evidence to suggest a hierarchical model is preferred over the completely pooled model.
Radon models
What about other possible confounders e.g. log_uranium
# The subset of the radon data we need is included in rstanarmlibrary("rstanarm"); data(radon)# Now adjust the pooled model for the county-level uranium predictormodelA <-stan_lmer( log_radon ~ floor + log_uranium + floor:log_uranium + (1+ floor | county),data = radon, cores =4, iter =2000, chains =4)saveRDS(modelA, file ="output/modelA.rds")# Recall the hierarchical model without the county-level uranium predictormodelB <-update(modelA, formula = log_radon ~ floor + (1+ floor | county))saveRDS(modelB, file ="output/modelB.rds")
# loading outcome to save repeating MCMC calculationsmodelA <-readRDS("output/modelA.rds");modelB <-readRDS("output/modelB.rds")# Compare models needs to separate code block from preceeding (why?)looA <-loo(modelA); looB <-loo(modelB)print(loo_compare(looA, looB))
elpd_diff se_diff
modelA 0.0 0.0
modelB -9.8 5.1
elpd_diff = -9.8 suggests that the including the variable log_uranium in the hierarchical likely leads to better predictor compared to the baseline hierarchical model without it, but se = 5.1 suggesting some uncertainty (z = - 1.92)
Interpretation: Although there is some uncertainty does suggest hierarchical model including county uranium levels (model A) is preferred.
The effect of floor is therefore \(exp^{-0.66}\) = 0.51 decrease in radon measurements from basement to first floor
Multinational RCT
PLATO NEJM 2009 multinational RCT of antiplatelets, ticagrelor vs. clopidogrel > 5,000 citations
Results Primary end point — composite of death from vascular causes, myocardial infarction, or stroke — occurred in 9.8% of ticagrelor subjects vs. 11.7% of clopidogrel subjects (HR, 0.84; 95% CI 0.77 to 0.92; P<0.001) Conculsion Ticagrelor significantly reduced the primary outcome Data - 862 centers, 43 countries, 6 geographic regions (iid?)
NEJM -> time to event analysis. With only aggregate data, limited to cumulative RR (epiR::epi.2by2)
nt <-sum(dat$tic_N); nc <-sum(dat$clo_N); et <-sum(dat$tic_O); ec <-sum(dat$clo_O)total <-matrix(c(et,nt-et,ec,nc-ec), nrow =2, byrow =TRUE, dimnames =list(c("Ticagrelor", "Clopidogrel"), c("Died", "Alive")))epi.2by2(dat =as.table(total), method ="cohort.count", conf.level =0.95, units =100, outcome ="as.columns")
Outcome + Outcome - Total Inc risk *
Exposed + 864 8469 9333 9.26 (8.68 to 9.86)
Exposed - 1014 8276 9290 10.91 (10.29 to 11.57)
Total 1878 16745 18623 10.08 (9.66 to 10.53)
Point estimates and 95% CIs:
-------------------------------------------------------------------
Inc risk ratio 0.85 (0.78, 0.92)
Inc odds ratio 0.83 (0.76, 0.92)
Attrib risk in the exposed * -1.66 (-2.52, -0.79)
Attrib fraction in the exposed (%) -17.90 (-28.50, -8.18)
Attrib risk in the population * -0.83 (-1.60, -0.06)
Attrib fraction in the population (%) -8.24 (-12.61, -4.03)
-------------------------------------------------------------------
Uncorrected chi2 test that OR = 1: chi2(1) = 14.106 Pr>chi2 = <0.001
Fisher exact test that OR = 1: Pr>chi2 = <0.001
Wald confidence limits
CI: confidence interval
* Outcomes per 100 population units
Results concordant with reported (HR, 0.84; 95% CI 0.77 to 0.92; P<0.001)
Data clustering
Original treated observations as iid, ignores possible clustering within specific geographic regions -> error underestimation -> over confidence in results precision
Reasonable to ignore these differences? Possible solution - hierarchical model
US data
# to get individual non-pooled results for each regiondat2 <- dat[dat$Region=="US",]mat <-matrix(c(dat2$tic_O, dat2$tic_N - dat2$tic_O, dat2$clo_O, dat2$clo_N - dat2$clo_O), nrow =2, byrow =TRUE,dimnames =list(c("Ticagrelor", "Clopidogrel"), c("Died", "Alive")))tt <-epi.2by2(dat =as.table(mat), method ="cohort.count", conf.level =0.95, units =100, outcome ="as.columns")tt$massoc.summary[1,1:4]
var est lower upper
1 Inc risk ratio 1.251958 0.9241111 1.696115
Eastern Europe data
# to get individual non-pooled results for each regiondat2 <- dat[dat$Region=="Eastern Europe",]mat <-matrix(c(dat2$tic_O, dat2$tic_N - dat2$tic_O, dat2$clo_O, dat2$clo_N - dat2$clo_O), nrow =2, byrow =TRUE,dimnames =list(c("Ticagrelor", "Clopidogrel"), c("Died", "Alive")))tt <-epi.2by2(dat =as.table(mat), method ="cohort.count", conf.level =0.95, units =100, outcome ="as.columns")tt$massoc.summary[1,1:4]
var est lower upper
1 Inc risk ratio 0.7598766 0.6583074 0.8771167
Different models
Separate RCTs, or in this case from separate regions, can be modeled as follows \(y_j\) = log(RR) \[y_j \sim N(\theta_j, \sigma^2_j) \\ \theta_j \sim N(\mu, \tau)\] where \(y_j = log(RR_j)\) and \(\sigma^2_j\) is assumed to be known with certainty (large sample sizes). Meta-analyses are typically concerned with the overall mean, \(\mu\).
Recall, in general, three ways to estimate \(\theta_j\);
1. No-pooling: there is a separate model for each study and \(\theta_j=y_j\). i.e. hierarchical model in which the between study variation \(\tau = \infty\).
2. Complete-pooling: patients in each study are random samples from a common distribution so \(\theta_j = \mu\). i.e. hierarchical model in with \(\tau = 0\).
3. Partial-pooling: a compromise between 1. and 2. In this case \(\tau\) is estimated from the data and \(\theta_j\) is closer to \(\mu\) when \(\tau\) is small relative to \(\sigma^2_j\) and closer to \(y_j\) when the reverse is true.
Meta-analysis
Random effects meta-analysis is a form of hierarchical modeling (regression analysis) where both within and between group variations are considered
A relevant statistic is the relative risk ratio, or \(p_{1j} / p_{0j}\), where \(p_{1j} = d_{1j}/n_{1j}\)and \(p_{0j} = d_{0j}/n_{0j}\)
Log relative risk ratio, \(y_j = log(p_{1j}) - log(p_{0j})\) preferred, since \(\approx \:N(\mu, \sigma)\)
Variance of each \(y_j\) calculated by treating \(p_{1j}\) and \(p_{0j}\) as sample proportions and using the delta method, so \(y_j\) variance is
RR 95%-CI %W(common) %W(random)
US 1.25 [0.92; 1.70] 8.1 12.8
Eastern Europe 0.76 [0.66; 0.88] 36.1 23.8
Western Europe 0.85 [0.72; 1.00] 27.6 22.2
Asia 0.78 [0.60; 1.01] 11.1 15.3
Latin America 0.87 [0.67; 1.12] 11.1 15.3
Other 1.09 [0.77; 1.55] 6.0 10.6
Number of studies: k = 6
Number of observations: o = 18623 (o.e = 9333, o.c = 9290)
Number of events: e = 1878
RR 95%-CI z p-value
Common effect model 0.85 [0.78; 0.92] -3.75 0.0002
Random effects model 0.88 [0.77; 1.02] -1.73 0.0842
Prediction interval [0.59; 1.33]
Quantifying heterogeneity:
tau^2 = 0.0162 [0.0000; 0.2158]; tau = 0.1272 [0.0000; 0.4645]
I^2 = 54.2% [0.0%; 81.6%]; H = 1.48 [1.00; 2.33]
Test of heterogeneity:
Q d.f. p-value
10.91 5 0.0531
Details on meta-analytical method:
- Inverse variance method
- Restricted maximum-likelihood estimator for tau^2
- Q-Profile method for confidence interval of tau^2 and tau
- Prediction interval based on t-distribution (df = 4)
Meta-analysis - Forest plot
possible plotting functions metafor::forest or meta::forest
meta::forest(plot_Plato)
Do we understand these summary statistics and their difference?
Summary statistics
Fixed effect meta-analysis assumes a common treatment effect in each study and variation in observed study estimates is due only to chance, within study variation. The summary result provides the best estimate of an assumed common treatment effect
Random effects meta-analysis assumes the true treatment effect differs from study to study and provides an estimate of the average treatment effect considers both within and between study variation). The summary result gives the average from the distribution of treatment effects across studies
Prediction interval aids the interpretation of random effects meta-analysis by providing a predicted range for the true treatment effect in an individual (next) study
Bayesian meta-analysis
Maximum likelihood approach treats parameters as fixed
Ignores uncertainty in \(\tau\) and confidence intervals for \(\mu\) are too narrow
Bayesian model using brms wrapper package addresses this concern
Region condition_mean .lower .upper
1 Asia 0.8296125 0.6616653 1.0057436
2 Eastern.Europe 0.7917701 0.6834757 0.9006256
3 Latin.America 0.8690452 0.7093868 1.0744040
4 Other 0.9444320 0.7624756 1.3088410
5 US 1.0297661 0.8056876 1.4245776
6 Western.Europe 0.8560940 0.7337112 0.9877441
Bayesian meta-analysis
The individual effects are shrunken towards the global mean (dotted line) and the variance is also improved by this borrowing of information and is better shown in the figure below.
Easy in a Bayesian framework since study effects are assumed to be exchangeable
Simulate the posterior of \(\tilde{\theta_j}\) by drawing \(\tilde{\theta}_j \sim N(\mu, \tau)\) using \(\mu\) and \(\tau\) from the posterior
US Eastern Europe Western Europe Asia Latin America
y 1.2519580 0.75987655 0.84751053 0.7827260 0.86654589
sigma 0.1549170 0.07320748 0.08359779 0.1320247 0.13178799
mode 1.0003205 0.80668934 0.86023523 0.8502698 0.86891932
median 1.0344567 0.79999914 0.85972569 0.8398095 0.87445741
mean 1.0480330 0.79747707 0.85937876 0.8346058 0.87624220
sd 0.1530569 0.06993446 0.07175383 0.1015066 0.09956671
95% lower 0.7853382 0.65875763 0.71652872 0.6235269 0.67742674
95% upper 1.3458635 0.93052361 1.00151553 1.0321938 1.07999519
Other
y 1.0885769
sigma 0.1789770
mode 0.8952275
median 0.9490039
mean 0.9666059
sd 0.1346969
95% lower 0.7308986
95% upper 1.2510222
tau mu theta
mode 0.1151484 0.87873048 0.8669743
median 0.1411086 0.88993912 0.8843447
mean 0.1610604 0.89668778 0.8966878
sd 0.1128046 0.09628676 0.2190497
95% lower 0.0000000 0.71619581 0.4560304
95% upper 0.3708797 1.10094337 1.3694340
Bayesian meta-analysis - Predictions
Predictions can also be displayed graphically
forestplot(ma01, xlog=TRUE) #adds vertical line at x=1
Compare to MLE - random effects 0.88 (0.77 - 1.02), prediction interval 0.88 (0.59 - 1.33)
Bayesian meta-analysis - Predictions
Predictions can also be displayed graphically using ggridges package
Bayesian meta-analysis
In summary, the standard Bayesian random effects meta-analysis model follows a hierarchical approach \[ \beta_j = \mu + \mu_j \: where\>error\> term\> \mu_j \, \sim \, normal(0,\tau)\] where \(\mu\) is the average treatment effect in the (hypothetical) superpopulation of similar studies, \(\beta_j\) is the effect in the j-th individual study and the variance \(\tau^2\) is referred to as the heterogeneity and \[ b_j = \beta_j + \epsilon \:where\>error\> term\> \epsilon_j \, \sim \, normal(0,s_j)\] models the uncertainty of the estimates from each individual study. We assume that the estimate \(b_j\) from the j-th study is unbiased and normally distributed with standard error \(s_j\).
Bayesian meta-analysis
While the observed treatment effect of a single unbiased clinical trial estimates the underlying “true” treatment effect, it is incorrect to view this as an immutable property of the treatment
Many reasons exist to expect additional variation among the underlying effects from differences between study populations, application of the treatment and measurement protocols among other factors if the study was repeated.
Leads to a paradox whereby the precision for a single trial (with less data) is better than for a meta-analysis of several trials (with more data) due to the consideration of the between study heterogeneity, \(\tau\), in the meta-analysis
Bayesian meta-analysis of a single trial!
To address this paradox Gelman has proposed using the distribution of treatment effects and heterogeneity among the 1,636 meta-analyses from the Cochrane Database of Systematic Reviews (CDSR) as prior information.
Using simulated data and cross validation, they show that their Bayesian “meta-analyses of single studies” perform much better than naively considering the single trial, which equates to setting \(\tau \> = 0\).
The CDSR prior on the heterogeneity results in better quantification of the uncertainty, reducing the mean squared error both for estimating the study-level and population-level effects by about 60% on average, the equivalent to more than doubling the sample size!
Hierarchical modeling can aid in causal inference
Causal inference can be formulated statistically as a missing-data problem & hierarchical modeling aids in causal inference by
1. Accounting for data collection: In any data analysis, it is appropriate to account for any individual or group characteristics that are predictive of treatment assignment and inclusion in the dataset
2. Adjusting for unmeasured covariates: In structured data, multilevel modeling can yield more efficient estimates than classical no-pooling estimates
3. Modeling variation in treatment effects: Can model not only the expected treatment effect as a function of pre-treatment covariates 𝑥, but can also model the unexplained variance in the treatment effect