Departments of Medicine, Epidemiology and Biostatistics, McGill University
2025-01-07
No conflicts of interest
To the best of my knowledge I’m equally disliked, or at best deemed irrelevant, by all drug and device companies
AI generated image
Review of some probability concepts
Review of statistical inference concepts
Realization of their practical applications
2012 Harvard study suggests the answer is YES
For producers of clinical research helps
collection and analysis of the data
recognition & acknowledgement of uncertainty
interpretation & presentation of the results
For consumers of clinical research helps
their appreciation & understanding of new knowledge
formation of reasonable conclusions
good decision making
Hypothetico-deductive model of the scientific method
Where this talk concentrates
Plenty of other places to go wrong besides “stats”
Statistical inference: learning from a sample about the underlying population (without it we’re left simply with our data)
Samples data are “noisy”, need to account for uncertainty
Inferences require referral to common statistical distributions (or else simulations)
Statistical models alone are insufficient for causality, remember
\[association \ne causation\]
Understanding probability requires understanding what probability distributions could have generated the observed data
Common probability distributions
Normal distributions are common due to CLT, many random variables that are the sum of independent processes, such as measurement errors, are often close to normal.
Mathematically the normal distribution is expressed as \[f(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x-\mu)^2}{2 \sigma^2}}\]
where \({\mu}\) is the mean or expectation of the distribution (and also its median and mode)
\({\sigma^{2}}\) is the variance and \({\sigma }\) is the standard deviation
Student’s t “fatter” tails as f(df)
Graphically, probability density functions (pdf)
Cumulative distribution function (CDF) - is the integral of the PDF (i.e. the area under the curve) and represents the probability that a random variable will take a value less than or equal to a specific point or btw 2 points.
The uniform distribution assumes that all continuous outcomes between boundaries have an equal probability of occurring.
Of interest in Bayesian statistics as a prior non-informative distribution which allows the observed data to completely dominate the final (posterior) distribution.
Mathematically this is expressed as
\[PDF(x) = \frac{1}{b-a} \,\, for \, a \le x \le b\] PDF = probability density funcion
Graphically
mean(x) = \(\frac{1}{2(b-a)}\)
var(x) = \(\frac{(b-a)^2}{12}\)
Discrete data, binomial distribution ->
# successes in a fixed number of independent Bernoulli trials (0,1), each with same P(success), like coin flips.
Probability mass function (pmf) \[P(X = k) = {n \choose k} * p^k * (1 - p)^{n-k}\] • P(X=k) = P(k successes in n trials)
• \({n \choose k}\) # ways choose k from n
• p = P (success in each trial)
• 1-p = P (failure in each trial)
Models any binary independent outcomes with many applications in medicine
Graphically
Discrete probability distribution for random, independent, and rare # events in a time interval with constant rate, \(\lambda\) Probability mass function (pmf) \[P(X = k) = \frac{\lambda^k * e^{-\lambda}}{k!}\]
• P(X=k) P(k events in a given time period)
• \(\lambda\) is the expectation of the # events over the same time period
• variance = mean (E(Y))
Graphically
Poisson P(k events with \(\lambda = \textit{np}) \approx\) Binomial P(k events in n trials)
If n -> \(\infty\) & p (any trial success) -> 0
Statistical inferences depends on Probability, branch of mathematics of how likely events occur
Contrasting views of probability
Null hypothesis significance testing (frequentist) inference: views probability as a long run frequency.
Answers questions like “What should I decide given my data, controlling the long run proportion of mistakes I make at a tolerable level.”
Bayesian inference: views probability as a calculus of beliefs.
Answers questions like “Given my prior beliefs and the objective information from the data, what should I now believe?”
Frequentist | Bayesian |
---|---|
Probability is “long-run frequency” | Probability is “degree of certainty” |
\(Pr(X \mid \theta)\) is a sampling distribution (function of variable \(X\) with fixed \(\theta\)) |
\(Pr(X \mid \theta)\) is a likelihood (function of fixed \(X\) with variable \(\theta\)) |
No prior | Prior |
P-values (NHST) | Full posterior probability model available for summary/decisions |
Confidence intervals | Credible intervals |
Violates the “likelihood principle”1: Sampling intention matters Corrections for multiple testing Adjustment for planned/post hoc testing |
Respects the “likelihood principle”: Sampling intention is irrelevant No corrections for multiple testing No adjustment for planned/post hoc testing |
Objective? | Subjective? |
A case study
A 2024 publication “One-Year Outcomes of Transseptal Mitral Valve in Valve in Intermediate Surgical Risk Patients” in Circulation: Cardiovascular Interventions based on 50 MVIV patients, concluded in symptomatic patients with a failing mitral bioprosthesis that
“Mitral valve-in-valve with a balloon-expandable valve via transseptal approach in intermediate-risk patients was associated with improved symptoms and quality of life, adequate transcatheter valve performance, & no mortality or stroke at 1-year follow-up.”
According to the authors, expected 1 year STS mortality was 4%
I wondered what is the probability of seeing 0 deaths if MVIV had same redo mortality
I also wondered what would be my colleagues’ probability estimates.
Q1. What P(observing 0 deaths | MVIV mortality = expected 4% redo mortality)
Q2. What P(observing 0 deaths | MVIV mortality = 40% higher expected redo mortality)
Q3. Given only this study, what is P(MVIV is as safe or safer than a redo)?
15 (42%, n= 36) MUHC cardiology staff, 4 (18%, n = 22) fellows, 4 GIM staff replied
Q1. P(0 deaths | true death rate 4%)?
Q2. P(0 deaths | true death rate 5.6%)? (i.e. 40% increase)?
Lots of variability, n’est-ce pas?
Discrete data -> Poisson (counts) or a binomial (independent Bernoulli trials) distribution
Calculations by hand via above equations or any software with distribution functions.
Q1. P(0 deaths | true death rate 4%)?
With one line of code in R
db <- 100*dbinom(c(0:6),50, .04)
Assuming a binomial distribution with an event (death) probability of 4%, the probability for 0, 1, 2, 3, 4, 5, 6 events is 13%, 27.1%, 27.6%, 18.4%, 9%, 3.5%, respectively.
Assuming a Poisson distribution with an event (death) rate of 2 (# deaths in 50 pts in 1 year with 4% annual mortality)
Again from a single line of R
code dp <- 100*dpois(c(0:6),2)
P(0, 1, 2, 3, 4, 5, 6 events) is 13.5%, 27.1%, 27.1%, 18%, 9%, 3.6%, respectively.
Therefore the answer to Q1 is 10 - 14.9%
Visualiztions emphasizing the similarity between the Poisson and Binomial distributions.
Therefore the answer to Q1 is 10 - 14.9%
Q2. P(0 deaths | true death rate 5.6%)? (i.e. 40% increase)?
Again with a single line of R
code
dp1 <- 100*dpois(c(0:6),2.8)
Assuming a P(2.8) (rate = # deaths in 50 individual in 1 year with 5.6% expected mortality = 50*.056), the probability for 0, 1, 2, 3, 4, 5, 6 events is 6.1%, 17%, 23.8%, 22.2%, 15.6%, 8.7%, respectively.
Assuming a binomial, the probability for 0, 1, 2, 3, 4, 5, 6 events is 5.6%, 16.6%, 24.2%, 22.9%, 16%, 8.7%, respectively.
Graphically
If expected(death rate) is higher, the curves shift right P(observing 0 deaths) will fall, so
The answer to Q2 is 5 - 9.9%
Q3. P(MVIV as safer or safer than redo)?
IOW P(MVIV mortality <4%)
Again lots of variability, n’est-ce pas?
Q3. P(MVIV as safer or safer than redo)
Ideally want RCT & hope that subjects in each arm are exchangeable. Here, no RCT but assuming the STS model is accurate can do a simulation with 50 observed MVIV results and 50 counterfactual simulated subjects receiving a redo operation with 4% mortality. Perform logistic regression (e.g.success | trials(total) ~ Tx, family = binomial
)
Family: binomial
Links: mu = identity
Formula: success | trials(total) ~ Tx
Data: dat1 (Number of observations: 2)
Draws: 4 chains, each with iter = 10000; warmup = 5000; thin = 1;
total post-warmup draws = 20000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 0.98 0.02 0.93 1.00 1.00 1229 2480
Txredo -0.04 0.04 -0.13 0.03 1.01 703 521
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
As expected, mean mortality difference (MVIV - redo) is - 4% but this Bayesian model also now gives us a measure of the associated uncertainty with 95% CrI -13% to 3%
The probability density for this mortality difference is plotted here
This shows a 12% probability (blue AUC) that MVIV patients in this study would have a worse outcome with a surgical redo, provided their counterfactual mortality has been well predicted with the STS model. IOW, P(MVIV is as safe or safer than a redo) = 88.3% (grey AUC).
Therefore the answer to Q3 is 76-95%
Q1, Only 26% (6/23) correctly estimated was a 10-14.9% probability of observing 0 deaths, if true rate was 4%. 7 of 23 (30%) estimated this probability as < 5%.
Q2,74% (17/23) respondents correctly reasoned that if the true underlying mortality increased, less likely to observe 0 deaths but a 26% (6/23) didn’t.
Q3 Only 9% (2/23) responses correctly estimated P(MVIV as safe or safer than a redo) = 76-95%. 15 of 23 (65%) of respondents estimated this probability at being under 50%.
Q1-3, No one gave all 3 correct answers
Limitations
i) small sample size of respondents
ii) unknown if respondents have better, worse or same quantitative skills as the non-respondents.
Notwithstanding limitations, suggests additional quantitative probability training may be helpful.
Definition
𝑝-value is the probability under the null hypothesis and an assumed data-generating model, that an appropriate test statistic would be as or more extreme than what was observed.
\(P(data | H_o, \ model\))
Consider an encouraging experiment with true effect = 2 standard errors from zero (N(2,1)
Imagine a simulation where the same experiment is repeated 10 times.
What do you think is the range of the 10 p values?
1. 0.02 to 0.05
2. 0.002 to 0.05
3. 0.0002 to 0.05
4. 0.0002 to 0.5
Again one line of R code
set.seed(123)
# for reproducibility
p_value <- 2*(1 - pnorm(abs(rnorm(10, 2, 1))))
# 10 random draws from N(2,1)
print(p_value)
# Print the 10 random draws
0.1500, 0.0767, 0.0003, 0.0384, 0.033,
0.0002, 0.0138, 0.4623, 0.1891, 0.1201
How can this be?
Even though we KNOW the effect = 2, about 1/2 the time p > 0.05 meaning we should accept H\(_o\) of no effect!
“If we assume minor variations around the 2 standard errors effect size, p-values might typically range from just below 0.0456 to slightly above it. For example, if some experiments measure the effect size at slightly less than 2 or slightly more than 2 standard errors, the p-values could range approximately between 0.04 and 0.05.”
THIS IS WRONG.
ChatGPT4: “Your simulation highlighted the variability that can occur in experimental results due to random sampling, which indeed can lead to a much broader range of p-values. It demonstrates that even if the true effect size is relatively large, the exact p-values obtained in any given experiment can vary widely.”
Me: “If you ask the same question tomorrow or in the future, will you update your response accordingly?”
ChatGPT4: “My response will certainly take into account this discussion and the insights from your simulation.”
IT ABSOLUTELY DIDN’T AS WHEN ASKED THE SAME QUESTION 1 WEEK LATER IT GAVE THE SAME INCORRECT RESPONSE.
The larger points are:
The p-value is a statement relative to the null hypothesis of no effect, doesn’t have much of an interpretation relative to a real, nonzero effect.
The p-value is a weird nonlinear transformation of the z-score with nonintuitive behavior & is super-noisy
Even ChatGPT4 doesn’t realize how nosiy p values are!
You can learn a lot from simulations.
Andrew Gelman’s Blog
Bayes’ Theorem -> probability statements about hypotheses, model parameters or anything else that has associated uncertainty
Advantages
treats unknown parameters as random variables -> direct and meaningful answers (estimates)
allows integration of all available information -> mirrors sequential human learning with constant updating
allows consideration of complex questions / models where all sources of uncertainty can be simultaneously and coherently considered
Disadvantages
subjectivity (?) & problem of induction (Hume / Popper - difficulty generalizing about future)
A 2023 RCT in the NEJM concluded “In patients with refractory out-of-hospital cardiac arrest, extracorporeal CPR and conventional CPR had similar effects on survival with a favorable neurologic outcome”
Are the results truly similar?
What do MUHC cardiologists think?
MUHC cardiologists (51% response rate, 18 of 35) gave their probabilities that eCPR was superior to cCPR
Only 2 of the cardiologist gave >50% probability of eCPR being superior
The respondents were next provided with 2 previous trials of e-CRP with these results
name | n_ecrp | survival_ecrp | n_ctl | survival_ctl |
---|---|---|---|---|
ARREST | 15 | 6 | 15 | 1 |
PRAGUE | 124 | 38 | 132 | 24 |
Both trials showed improved 30 day survival with eCPR.
The respondents were then asked to update their previous probabilities that eCRP is better than cCRP
Slight overall shift in >50% from 2 to 5 respondents but still collectively 13/18 < 50%
How did MUHC cardiologists update their beliefs with additional evidence?
Observations: i) Original authors claimed “similarity” but actually greater 50% probability of eCPR superiority based on their data and 76-95% with prior data ii) 16/18 MUHC cardiologists increased their probability estimate with the new data but iii) their original low estimate has a persistent grounding effect on revised estimates
Gut1 (heuristic) probability estimates are widely variable & often very far from probabilistic quantified effects
Statistical literacy depends on appreciating and understanding the underlying probability distributions.
Statistical literacy improves diagnostic and therapeutic decision-making.
Undergraduate and graduate medical education should consider improving their quantitative training.
My PhD supervisor, (emeritus) Professor Lawrence Joseph, arguably Canada’s first Bayesian biostatistician
Fonds de Recherche du Québec (Santé) whose salary support (1999 - 2023) allowed me to pursue these statistical musings.
Brophy JM. Key Issues in the Statistical Interpretation of Randomized Clinical Trials. Canadian Journal of Cardiology. 2021;37(9):1312-1321
Brophy JM. Bayesian Analyses of Cardiovascular Trials-Bringing Added Value to the Table. Can J Cardiol. 2021;37(9):1415-1427
Heuts S et. al. Bayesian Analytical Methods in Cardiovascular Clinical Trials (A hands-on tutorial). Can J Cardiol online
Statistical code is available online for the above references
Slides available here
Barney (2011-) friend, running partner, and favorite muse
Stats and the physician