Statistics and the cardiologist

Jay Brophy MD PhD

Departments of Medicine, Epidemiology and Biostatistics, McGill University

2024-10-23

Conflicts of interest


No conflicts of interest
To the best of my knowledge I’m equally disliked, or at best deemed irrelevant, by all drug and device companies


AI generated image

Objectives


Review of probability concepts

Review of statistical inference concepts

Realization of their practical applications

Are statistics (math) important for MDs?

2012 Harvard study suggests the answer is YES

What are the benefits of “good” stats knowledge?


Informs proper research methods for the

  • collection and analysis of the data

  • acknowledgement of uncertainty

  • presentation and interpretation of the results

Facilitates

  • the creation and understanding of knowledge

  • reasonable conclusions

  • good decision making

Metascience - study of science itself

Hypothetico-deductive model of the scientific method

Metascience

Where this talk concentrates

Metascience

Plenty of other places to go wrong besides “stats”

Statistical inference


Statistical inference is the process of using data analysis to infer (learn) from a sample about the properties of the underlying population (without it we’re left simply with our data)

Most often involves “noisy” sample data, so variability and uncertainty must be accounted for

Inferences require referral to common statistical distributions (or else simulations)

The area under the curve (AUC) of the probability density function from the assumed statistical model corresponds to probabilities for that random variable

Statistical models alone are insufficient for causality, remember
\[association \ne causation\]

Probability

Probability branch of mathematics of how likely events occur -> statistical inferences

Contrasting views of probability

Frequency style inference: is long run frequency parameter considered as fixed but unknown quantity, so can’t make probability statements about them. Answers questions like “What should I decide given my data, controlling the long run proportion of mistakes I make at a tolerable level.”

Bayesian style inference: probability is the calculus of beliefs, parameters that are considered random variables with probability distributions that follow the rules of probability. Answers questions like “Given my prior subjective beliefs and the objective information from the data, what should I believe now?”

Basic Probability Distributions

Understanding probability requires understanding some posible models (& their assumptions)

IOW, what underlying probability distributions could have generated the observed data.

Here are common probability distributions according to continuous or discrete data.

Normal and t distributions

Normal distributions are common due to CLT, many random variables that are the sum of independent processes, such as measurement errors, are often close to normal.

Mathematically the normal distribution is expressed as \[f(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x-\mu)^2}{2 \sigma^2}}\]

where \({\mu}\) is the mean or expectation of the distribution (and also its median and mode)

\({\sigma^{2}}\) is the variance and \({\sigma }\) is the standard deviation

Graphically this may be plotted as shown

Student’s t is a continuous distribution with “fatter” tails (depending on the degrees of freedom)

Uniform distribution

The uniform distribution assumes that all continuous outcomes between boundaries have an equal probability of occurring.

Of interest in Bayesian statistics as a prior non-informative distribution which allows the observed data to completely dominate the final (posterior) distribution.

Graphically this may be plotted as shown

Binomial Distribution

For discrete data, continuous models inappropriate (cf negative counts are impossible). A binomial distribution describes the number of successes in a fixed number of independent Bernoulli trials, where each trial has the same probability of success.

Mathematically, \[P(X = k) = {n \choose x} * p^k * (1 - p)^{n-k}\]P(X=k) is the probability of having k successes in n trials
\({n \choose x}\) is the binomial coefficient, the number of ways to choose k from n trials
p is the probability of success in each trial
1-p is the probability of failure in each trial

Binomial distributions can model coin flips, where there are two possible outcomes (heads or tails) and each trial is independent of the others.

But many, many other more important applications for binary outcomes in various fields, including medicine

Poisson Distribution

Another discrete probability distribution - # events occurring in a fixed time interval with a constant rate.
Mathematically, \[P(X = k) = \frac{\lambda^k * e^{-\lambda}}{k!}\]
P(X=k) is the probability of observing k events in a given period of time
\(\lambda\) is the expectation of the # events over the same time period
variance = mean (E(Y))


If n -> \(\infty\) & p (any trial success) -> 0 then \[binomial \:P(\textit{k} \:successes \:in \:\textit{n} \:trials) \approx poisson \:P(k \:with \: \lambda = \textit{np})\]

Better appreciated by plotting the equation

How well do we understand probability?

A case study

Based on 50 MVIV patients, a 2024 publication “One-Year Outcomes of Transseptal Mitral Valve in Valve in Intermediate Surgical Risk Patients” in Circulation: Cardiovascular Interventions concluded in symptomatic patients with a failing mitral bioprosthes is that

“Mitral valve-in-valve with a balloon-expandable valve via transseptal approach in intermediate-risk patients was associated with improved symptoms and quality of life, adequate transcatheter valve performance, & no mortality or stroke at 1-year follow-up.”

I wondered what is the probability of seeing 0 deaths if truly no difference in mortality between MVIV and a redo with an estimated STS mortality = 4%?

I also wondered what would be my colleagues’ probability estimates.

Short questionnaire

Q1. What P(observing 0 deaths | MVIV mortality = expected redo mortality)1

  1. <1%
  2. 1 - 4.9%
  3. 5 -9.9%
  4. 10 - 14.9%
  5. > 15%

Q2. What P(observing 0 deaths | MVIV mortality = 40% higher expected redo mortality)

  1. <1%
  2. 1 - 4.9%
  3. 5 -9.9%
  4. 10 - 14.9%
  5. > 15%

Q3. Given only this study, what is P(MVIV is as safe or safer than a redo)?

  1. < 25 %
  2. 25 - 50%
  3. 51 - 75%
  4. 76 - 95%
  5. > 95%

Quiz Results

15 (42%, n= 36) MUHC cardiology staff and 4 (18%, n = 22) fellows replied
Q1. P(0 deaths | true death rate 4%)?
Q2. P(0 deaths | true death rate 5.6%)? (i.e. 40% increase)?

Lot’s of variability, n’est-ce pas?

Quiz Results

Q3. P(MVIV as safer or safer than redo)?

Lot’s of variability, n’est-ce pas?

Quiz Q1 Discussion

Use either a Poisson distribution (counts) or a binomial distribution (independent Bernouilli trials) where each of the 50 subjects is considered as alive or dead with a probability of 4%. These calculations can be done with the above equations or more easily with any software that includes Poisson or Binomial distributions.

Q1. P(0 deaths | true death rate 4%)?
With one line of code in R
db <- 100*dbinom(c(0:6),50, .04)

Assuming a binomial distribution with an event (death) probability of 4%, the probability for 0, 1, 2, 3, 4, 5, 6 events is 13%, 27.1%, 27.6%, 18.4%, 9%, 3.5%, respectively.

Assuming a Poisson distribution with an event (death) rate of 2 (# deaths in 50 individual in 1 year with 4% expected mortality), the probability for 0, 1, 2, 3, 4, 5, 6 events is 13.5%, 27.1%, 27.1%, 18%, 9%, 3.6%, respectively.

Therefore the answer to Q1 is 10 - 14.9%

Quiz Q1 Discussion

This data can also be visualized emphasizing the similarity between the Poisson and Binomial distributions.

Therefore the answer to Q1 is 10 - 14.9%

Quiz Q2 Discussion


Q2. P(0 deaths | true death rate 5.6%)? (i.e. 40% increase)?

Again with a single line of R code
dp1 <- 100*dpois(c(0:6),2.8)

Assuming a B(n, 5.6%), the probability for 0, 1, 2, 3, 4, 5, 6 events is 5.6%, 16.6%, 24.2%, 22.9%, 16%, 8.7%, respectively.

Assuming a P(2.8) (rate = # deaths in 50 individual in 1 year with 5.6% expected mortality = 50*.056), the probability for 0, 1, 2, 3, 4, 5, 6 events is 6.1%, 17%, 23.8%, 22.2%, 15.6%, 8.7%, respectively.

Visually

Clearly, if the expected death rate is higher the curves shift left meaning the probability of observing 0 deaths will fall with increasing mortality rates, so

The answer to Q2 is 5 - 9.9%

Quiz Q3 Answer & Discussion

Q3. P(MVIV as safer or safer than redo)

Ideal get outcome from both interventions in the same patient but impossible. So we do RCTs & hope that subjects in both treatment arms are exchangeable. Here, we lack this ideal but assuming the STS model is accurate then the MVIV patients would have had a 4% mortality if, instead of MVIV, they had had a surgical redo. Simulate a dataset with the 50 observed MVIV results and 50 counterfactual results receiving a redo operation with a 4% mortality rate.
Again, binomial distributions with binary outcome (dead or alive) but extended to include the explanatory variable treatment (logistic regression).

 Family: binomial 
  Links: mu = identity 
Formula: success | trials(total) ~ Tx 
   Data: dat1 (Number of observations: 2) 
  Draws: 4 chains, each with iter = 10000; warmup = 5000; thin = 1;
         total post-warmup draws = 20000

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     0.98      0.02     0.93     1.00 1.00     1229     2480
Txredo       -0.04      0.04    -0.13     0.03 1.01      703      521

Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

As expected, mean mortality difference (MVIV - redo) is - 4% but the model also now gives us a measure of the associated uncertainty with 95% CrI -13% to 3%

Quiz Q3 Discussion

The probability density for this mortality difference is plotted here

This shows a 12% probability (blue AUC) that MVIV patients in this study would have a worse outcome with a surgical redo, provided their counterfactual mortality has been well predicted with the STS model. IOW, P(MVIV is as safe or safer than a redo) = 88.3% (grey AUC).

Therefore the answer to Q3 is 76-95%

Final thoughts on Quiz #1

For Q1, A minority (3/19, 16%) correctly estimated that if the true expected rate was 4% there was still a 10-14.9% probability of observing 0 deaths. 7 of 91 (37%) estimated this as < 5%.

For Q2, most respondents (13/19, 68%) correctly reasoned that if the true underlying mortality increased, it would be less likely to observe 0 deaths but a third (6/19) didn’t.

For Q3, there are many assumptions required but the question asked to ignore the potential biases and to consider only this study. Only 2 (10.5%) of individuals correctly estimated the probability of MVIV being as safe or safer than a redo as being in the 76-95% interval. 12 of 19 (63%) of respondents estimated this probability at being under 50%.

Limitations
i) small sample size of respondents
ii) unknown if respondents have better, worse or same quantitative skills as the non-respondents.

Notwithstanding
suggests additional quantitative probability training may be helpful.

(In)famous P value

Definition
𝑝-value is the probability under the null hypothesis and an assumed data-generating model, that an appropriate test statistic would be as or more extreme than what was observed.
\(P(data | H_o, \ model\))

Quiz 2

Consider an experiment which is quite encouraging with a true effect size that is 2 standard errors from zero.

Imagine a simulation where the same experiment can be repeated 10 times.
What do you think is the range of the 10 p values?

1. 0.02 to 0.05
2. 0.002 to 0.05
3. 0.0002 to 0.05
4. 0.0002 to 0.5

Again one line of R code
p_value <- 2*(1 - pnorm(abs(rnorm(10, 2, 1)), 0, 1)) # set.seed(123)

Print the p values associated with the 10 random draws from a N(2,1) distribution
0.1500, 0.0767, 0.0003, 0.0384, 0.033,
0.0002, 0.0138, 0.4623, 0.1891, 0.1201

Quiz 2 answers

Consider an experiment which is quite encouraging with a true effect size that is 2 standard errors from zero. What do you think is the range of the 10 p values?

Quiz 2 What is ChatGPT’ answer?

“For a z-score of 2.0, the p-value can be calculated using the standard normal distribution. A z-score of 2.0 corresponds to a one-tailed p-value of about 0.0228 or a two-tailed p-value of about 0.0456.” THIS IS WRONG.

“Most p-values are still below 0.05, but some might be higher, depending on the fluctuation of the observed effect sizes. You might see p-values ranging from around 0.01 to 0.10.”

Quiz 2 answers

How can this be?

Even though we KNOW here that the effect = 2, according to the p value about 1/2 the time p > 0.05 meaning we should accept the null hypothesis of no difference!

Quiz 2 What does ChatGPT say now?

“Your simulation highlighted the variability that can occur in experimental results due to random sampling, which indeed can lead to a much broader range of p-values. It demonstrates that even if the true effect size is relatively large, the exact p-values obtained in any given experiment can vary widely, influenced by the specific sample drawn in each case.”

If you ask the same question tomorrow or in the future, my response will certainly take into account this discussion and the insights from your simulation.”

Quiz 2 answers

The larger points are:

  1. The p-value is a statement relative to the null hypothesis of no effect. It doesn’t have much of an interpretation relative to a real, nonzero effect.

  2. The p-value is super-noisy. It’s a weird nonlinear transformation of the z-score with nonintuitive behavior.

  3. Even ChatGPT gets this wrong!

  4. You can learn a lot from simulations.


    Andrew Gelman’s Blog

Bayesian Inference

Bayes’ Theorem -> probability statements about hypotheses, model parameters or anything else that has associated uncertainty

Advantages
treats unknown parameters as random variables -> direct and meaningful answers (estimates)

allows integration of all available information -> mirrors sequential human learning with constant updating

allows consideration of complex questions / models where all sources of uncertainty can be simultaneously and coherently considered

Disadvantages
subjectivity (?) problem of induction (Hume / Popper - difficulty generalizing about future)

Frequentist vs Bayesian (summary)

Frequentist Bayesian
Probability is “long-run frequency” Probability is “degree of certainty”
\(Pr(X \mid \theta)\) is a sampling distribution
(function of \(X\) with \(\theta\) fixed)
\(Pr(X \mid \theta)\) is a likelihood
(function of \(\theta\) with \(X\) fixed)
No prior Prior
P-values (NHST) Full probability model available for summary/decisions
Confidence intervals Credible intervals
Violates the “likelihood principle”:
 Sampling intention matters
 Corrections for multiple testing
 Adjustment for planned/post hoc testing
Respects the “likelihood principle”:
 Sampling intention is irrelevant
 No corrections for multiple testing
 No adjustment for planned/post hoc testing
Objective? Subjective?

Probabilities with 2 arm studies

A 2023 paper in the NEJM concluded “In patients with refractory out-of-hospital cardiac arrest, extracorporeal CPR and conventional CPR had similar effects on survival with a favorable neurologic outcome

Are the results truly similar?
What do MUHC cardiologists think?

Quiz 3 Q1

I surveyed MUHC cardiologists (51% response rate, 18 of 35) and asked their probabilities that eCPR was superior to cCPR

Only 2 of the cardiologist gave >50% probability of eCPR being superior

Quiz 3 Q2

The respondents were next provided with 2 previous trials of e-CRP with these results

name n_ecrp survival_ecrp n_ctl survival_ctl
ARREST 15 6 15 1
PRAGUE 124 38 132 24

Both trials showed improved 30 day survival with eCPR.
The respondents were then asked to update their previous probabilities that eCRP is better than cCRP

Quiz 3 Q2

Positive overall shift in >50% from 2 to 5 respondents but still collectively 13/18 < 50%

Quiz 3 Q2

How did individual cardiologists update their beliefs with additional evidence?

Observations: i) Original authors claimed “similarity” but actually greater 50% probability of eCPR superiority based on their data and 76-95% with prior data ii) 16/18 increased their probability estimate with the new data but iii) the original low estimate has a persistent grounding effect on revised estimates

Conclusions

Gut1 probability estimates are widely variable
Gut probability estimates are often very far from probabilistic quantified effects
Statistical literacy depends on understanding probability distributions.
Statistical literacy will improve diagnostic and therapeutic decision-making.
Undergraduate and graduate medical education should consider improving their quantitative training.




Acknowledgements



My former PhD supervisor, Professor (emeritus) Lawrence Joseph, arguably Canada’s first Bayesian biostatistician


Fonds de Recherche du Québec (Santé) whose salary support (1999 - 2023) allowed me to continue these statistical musings.

References

  1. Brophy JM. Key Issues in the Statistical Interpretation of Randomized Clinical Trials. Canadian Journal of Cardiology. 2021;37(9):1312-1321.

  2. Brophy JM. Bayesian Analyses of Cardiovascular Trials-Bringing Added Value to the Table. Can J Cardiol. 2021;37(9):1415-1427.

  3. Heuts S et. al. Bayesian Analytical Methods in Cardiovascular Clinical Trials (A hands-on tutorial). Can J Cardiol. (in press)

Statistical code is available online for the references

Slides available here

Thank you


Barney (2011-) friend, running partner, and favorite muse