jay’s website - TAVR vs. surgery - NHST gets it wrong (again)

Background

The Notion-2 trial(Jorgensen et al. 2024) was recently fast tracked for publication in the European Heart Journal. This study randomized low risk patients (≤75 years of age and median Society of Thoracic Surgeons (STS) risk score of 1.1%) with severe aortic stenosis (AS) to Transcatheter Aortic Valve Implantation (TAVI) or to conventional aortic valve surgery. The study population included both tricuspid and bicuspid AS.

The primary endpoint was a composite of all-cause mortality, stroke, or rehospitalization (related to the procedure, valve, or heart failure) at 12 months.

A total of 370 patients were enrolled and 1-year incidence of the primary endpoint was 10.2% in the TAVI group and 7.1% in the surgery group [absolute risk difference 3.1%; 95% confidence interval (CI), −2.7% to 8.8%; hazard ratio (HR) 1.4; 95% CI, 0.7–2.9; P = .3]. The authors concluded

> “Among low-risk patients aged ≤75 years with severe symptomatic AS, the rate of the composite of death, stroke, or rehospitalization at 1 year was similar between TAVI and surgery.”

How was the study designed?

The study was designed assuming a sample of 372 patients would provide the trial 90% power to show the non-inferiority of TAVI to surgery with regard to the primary endpoint at 1 year, assuming a Kaplan–Meier estimate of the primary endpoint of 10% in the TAVI group and 15% in the surgery group. The authors stated “To test for non-inferiority, we determined whether the upper boundary of the 95% confidence interval (CI) for the difference in the rate of the primary endpoint between the TAVI and surgery group was less than the pre-specified non-inferiority margin of 5% points.”

In other words, the null hypothesis (\(H_O\)) is \(\theta_{TAVI} - \theta_{surgery} > 5%\) where \(\theta\) is the proportion of outcomes in the respective treatment arms. One hopes to reject this null hypothesis and accept the alternative hypothesis (\(H_A\)) that the difference is < 5% and therefore claim non-inferiority. Although the authors did claim non-inferiority, their data does not support this conclusions as the null hypothesis can’t be rejected as the upper limit of the 95% confidence interval for the difference in outcomes between TAVR and surgery is 8.8% exceeding the prespecified non-inferiority margin of 5% points.

How then did the authors reach their conclusion?

The authors apparently ignored their non-inferiority design and analysed their study with conventional null hypothesis significance testing (NHST) and a null hypothesis \(\theta_{TAVI} - \theta_{surgery} = 0\) which they were unable to reject it since the p value (0.3) exceeded the conventional \(\alpha\) level (0.05)
NHST limitations and its tendency to cause cognitive errors has been described countless times in the medical literature[Wasserstein and Lazar (2016)](Wasserstein, Schirm, and Lazar 2019). This particular cognitive has even been described with the pithy aphorism absence of evidence is not evidence of absence(Altman and Bland 1995).

What is the strength of this conclusion?

With the NHST paradigm, it is impossible to quantify the evidence in favor of the null hypothesis. A non-significant finding can occur due to low power or a truly absent effect and the reporting of a p value simply can’t disentangled these two possibilities. An alternative to NHST is Null hypothesis Bayesian testing (NHBT) which allows the strength of evidence for (or against) \(H_O\) and \(H_A\) to be directly compared. This is most commonly achieved with Bayes factors, which quantifies the relative probabilities of the data under \(H_O\) and \(H_A\).

BFs

The relationship between BFs and the relative support for for (or against) \(H_O\) and \(H_A\) is reflected in the following graphic.

Code

# Load necessary libraries
library(ggplot2)
library(grid)

g <- rasterGrob(c("lightgreen", "yellow", "orange", "red"), 
                width=unit(1,"npc"), height = unit(1,"npc"), 
                interpolate = TRUE) 
# Create a continuous range for the y-axis
y_values <- seq(0.01, 1000, length.out = 10000)

# Create data frame for plotting
data <- data.frame(
  BayesFactor = y_values
)

# Define a continuous color gradient from a deeper yellow to green to blue
color_gradient <- c("#FFD700", "#00B300", "#0000FF")

# Plot the data
ggplot(data, aes(y = BayesFactor, fill = BayesFactor)) +
  geom_tile(aes(x = 0.5), width = 0.2) +  # Greatly reduce the x-direction plot area
  scale_y_continuous(trans = "log10", 
                     breaks = c(0.01, 1/3, 1, 3, 10, 30, 100, 1000), 
                     labels = c("0", "1/3", "1", "3", "10", "30", "100", "∞"),
                     expand = c(0, 0)) +
  #scale_fill_gradientn(colors = color_gradient, name = NULL) +
  geom_segment(aes(x = 0.4, xend = 0.6, y = 1/3, yend = 1/3), linetype = "dashed", color = "black") +
  geom_segment(aes(x = 0.4, xend = 0.6, y = 3, yend = 3), linetype = "dashed", color = "black") +
  geom_segment(aes(x = 0.4, xend = 0.6, y = 10, yend = 10), linetype = "dashed", color = "black") +
  geom_segment(aes(x = 0.4, xend = 0.6, y = 30, yend = 30), linetype = "dashed", color = "black") +
  annotate("text", x = 0.5, y = sqrt(0.01 * 1/3), label = "Evidence against\ntreatment effect", size = 4.5, hjust = 0.5, vjust = 0.5) +
  annotate("text", x = 0.5, y = 2, label = "Not enough data to\nknow if the drug works", size = 4.5, hjust = 0.5, vjust = 0.5) +
  annotate("text", x = 0.5, y = 20, label = "Evidence for\ntreatment effect", size = 4.5, hjust = 0.5, vjust = 0.5) +
  annotate("text", x = 0.65, y = sqrt(30 * 500), label = "{very strong pro-alternative}", hjust = 0, size = 4.5) +
  annotate("text", x = 0.65, y = sqrt(10 * 30), label = "{strong pro-alternative}", hjust = 0, size = 4.5) +
  annotate("text", x = 0.65, y = sqrt(3 * 10), label = "{moderate pro-alternative}", hjust = 0, size = 4.5) +
  annotate("text", x = 0.65, y = sqrt(1/3 * 3), label = "{ambiguous}", hjust = 0, size = 4.5) +
  annotate("text", x = 0.65, y = sqrt(0.01 * 1/3), label = "{pro-null}", hjust = 0, size = 4.5) +
  coord_cartesian(xlim = c(0.4, 1), ylim = c(0.01, 1000), clip = "off") +  # Limit x-axis and ensure no clipping of text
  theme_minimal() +
  theme(
    axis.title.x = element_blank(),
    axis.title.y = element_text(size = 12),
    axis.text.y = element_text(size = 12),
    axis.text.x = element_blank(),
    axis.ticks = element_blank(),
    panel.grid = element_blank(),
    legend.position = "none",
    plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), "cm")  # Reduced overall plot area
  ) +
  labs(y = "Bayes Factor") +
  annotation_custom(g, xmin=0.4, xmax=0.6, ymin=-2, ymax=Inf) +
  ggtitle("Bayes factors and evidential strength")

ggsave("output/BF.pdf", dpi = 600, device = "pdf")
ggsave("output/BF.png", dpi = 600, device = "png")

What is the Bayes Factor for the Notion-2 survival data?

An approach to calculating BFs within Cox proportional hazards survival models has been developed by Linde(Linde, Tendeiro, and Ravenzwaaij 2022) and operationalized with the baymedr package(Linde, van Ravenzwaaij, and Tendeiro 2022).

Code

# simulate Notion-2 dataset
sim_data <- coxph_data_sim( # See ?coxph_data_sim for details
  n_data = 100, # Number of data sets to be simulated
  ns_c = 183, # Sample size (control condition)
  ns_e = 187, # Sample size (experimental condition)
  ne_c = 13, # Number of events (control condition)
  ne_e = 19, # Number of events (experimental
  # condition)
  cox_hr = c(1.4, 0.7, 2.9), # HR, lower bound CI, upper bound CI
  cox_hr_ci_level = 0.95, # Confidence level CI
  maxit = 300, # Max number of PSO iterations (for
  # psoptim())
  maxit.stagnate = ceiling(300 / 5), # Max number of PSO iterations without
  # reduction in loss (for psoptim())
  cores = 5 # Number of cores to be used
)
save(sim_data, file = "output/sim_data.RData")

Code

load("~/Desktop/current/notion2/output/sim_data.RData")

sim_bf <- coxph_bf( # See ?coxph_bf for details
  data = sim_data, # Object containing the data
  null_value = 0, # H0 value
  alternative = "two.sided", # H1 type (one- or two-sided)
  direction = NULL, # H1 direction (low or high)
  prior_mean = 0, # Beta prior mean
  prior_sd = 1 # Beta prior SD
)

sim_bf

******************************
Cox proportional hazards analysis
---------------------------------
H0:              beta == 0
H1:              beta != 0
Normal prior:    Mean = 0.000
                 SD = 1.000

    Median BF10 = 0.511
    MAD SD BF10 = 1.439e-04
******************************

The median BF is 0.511 falls in the “ambiguous” range revealing that there is little evidence to support the null hypothesis of no difference between the two treatments.

This result can be checked (approximately) by ignoring any time dependency and simply considering the outcome data in the form of a 2X2 contingency table. Here is the Notion-2 data in tabular form for the primary outcome.

Code

#deaths
#TAVR 187, 19
#SAVR 183,
dat <- matrix(c(19,168,13,170), nrow = 2, byrow = TRUE,
              dimnames = list(c("TAVR", "SAVR"),
                              c("E+", "E-")))
kable(dat)

	E+	E-
TAVR	19	168
SAVR	13	170

The BF for this data structure is available with the contingencyTableBF function from the BayesFactor package(Morey and Rouder 2024)

Code

library(BayesFactor)
bf = contingencyTableBF(dat, sampleType = "indepMulti", fixedMargin = "cols")
bf

Bayes factor analysis
--------------
[1] Non-indep. (a=1) : 0.39 ±0%

Against denominator:
  Null, independence, a = 1 
---
Bayes factor type: BFcontingencyTable, independent multinomial

The result is, as expected, consistent with the earlier result again underlining the lack of any strong evidence to support the null hypothesis.
Of course, a close and proper interpretation of a standard statistical analysis provides the same inferences as shown by the binomial risk ratios and their 95% confidence intervals as shown below and published in the original manuscript.

Code

library(epiR)
epi.2by2(dat, method = "cohort.count", digits = 2, conf.level = 0.95, 
         units = 100, interpret = FALSE, outcome = "as.columns")

             Outcome +    Outcome -      Total                 Inc risk *
Exposed +           19          168        187      10.16 (6.23 to 15.41)
Exposed -           13          170        183       7.10 (3.84 to 11.84)
Total               32          338        370       8.65 (5.99 to 11.99)

Point estimates and 95% CIs:
-------------------------------------------------------------------
Inc risk ratio                                 1.43 (0.73, 2.81)
Inc odds ratio                                 1.48 (0.71, 3.09)
Attrib risk in the exposed *                   3.06 (-2.65, 8.77)
Attrib fraction in the exposed (%)            30.08 (-37.37, 64.42)
Attrib risk in the population *                1.54 (-3.15, 6.24)
Attrib fraction in the population (%)         17.86 (-22.71, 45.02)
-------------------------------------------------------------------
Uncorrected chi2 test that OR = 1: chi2(1) = 1.094 Pr>chi2 = 0.296
Fisher exact test that OR = 1: Pr>chi2 = 0.356
 Wald confidence limits
 CI: confidence interval
 * Outcomes per 100 population units

Full Bayesian analysis

The BF approach has the advantage of not requiring a prior belief on the the relative risk of two interventions. However this comes at the expense of not being able to calculate the posterior probability distribution for the risk difference or ratio.
As an initial approach one can assume a non-informative prior with a Beta(1,1) distribution so that the posterior distribution is completely determined by the observed Notion2 data.

Code

pacman::p_load(brms, tidyverse, tidybayes, ggdist)
data_bin <- data.frame(N = c(183,187), y = c(13,19), grp2 = as.factor(c("Surgery","TAVR"))) 
f = bf(y | trials(N) ~ 0 + grp2)

#get_prior(formula = f,data = data_bin,family = binomial(link = "identity"))

m <- brm(
  formula = f,
  data = data_bin,
  family = binomial(link = "identity"),
  prior = c(prior(beta(1, 1), class = b, lb = 0, ub = 1)), # gets rid of a bunch of unhelpful warnings
  chains = 4, warmup = 1000, iter = 2000, seed = 123,
  refresh = 0
)

save(m, file = "output/m_brms")

This produces the following results for the risk differences

Code

load("output/m_brms")
summary(m)

 Family: binomial 
  Links: mu = identity 
Formula: y | trials(N) ~ 0 + grp2 
   Data: data_bin (Number of observations: 2) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
            Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
grp2Surgery     0.08      0.02     0.04     0.12 1.00     3036     2467
grp2TAVR        0.11      0.02     0.07     0.15 1.00     3176     2066

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

These results can also be shown graphically

Code

draws <- brms::as_draws_df(m)
draws <- draws %>% 
  # rename and drop the unneeded columns
  transmute(p0 = b_grp2Surgery,
            p1 = b_grp2TAVR) %>% 
  # compute the OR
  mutate(rr = p1 / p0, diff = p1-p0 )

library(tidyverse)
library(RColorBrewer)


# function that approximates the density at the provided values
approxdens <- function(x) {
  dens <- density(x)
  f <- with(dens, approxfun(x, y))
  f(x)
}

probs <- c(0.145, 1) # sum(draws$diff>0)/4000

draws1 <- draws %>%
  mutate(dy = approxdens(diff),                         # calculate density
         p = percent_rank(diff),                        # percentile rank 
         pcat = as.factor(cut(p, breaks = probs,         # percentile category based on probs
                              include.lowest = TRUE)))

ggplot(draws1, aes(diff, dy) ) +
  geom_ribbon(aes(ymin = 0, ymax = dy, fill = pcat), alpha=.2) +
  geom_line() +
  scale_fill_brewer(guide = "none", palette = "Set2") +
  labs(x = "Risk difference (TAVR - surgery))",
       y = NULL) +
  theme_classic() +
  labs(title = "Notion2 trial results with vague non-informative prior", subtitle = "Shaded area = probability increased TAVR risk (85.5%)")

This Bayesian analysis has provided a quantitative answer to the risk differences between TAVR and surgery and highlights the uncertainty that was missing from the original published conclusion. The BF approach to hypothesis testing provides an improvement over the traditional NHST by calculating the strength of the evidence, very weak support of the null hypothesis of no difference in this case.

Of course, rather than fixating on the p value, an examination of the 95% confidence interval (0.73, 2.81) shows that while this contains the null effect of 1, it is also compatible with a possible 27% reduction or 181% increase in risk with TAVR compared to SAVR. As these are meaningful differences, if becomes obvious that a claim of similarity in outcomes between the two procedures is not supported by the data, both from a frequentist and Bayesian viewpoints.

References

Altman, D. G., and J. M. Bland. 1995. “Absence of Evidence Is Not Evidence of Absence.” Journal Article. BMJ 311 (7003): 485. https://doi.org/10.1136/bmj.311.7003.485.

Jorgensen, T. H., H. G. H. Thyregod, M. Savontaus, Y. Willemen, O. Bleie, M. Tang, M. Niemela, et al. 2024. “Transcatheter Aortic Valve Implantation in Low-Risk Tricuspid or Bicuspid Aortic Stenosis: The NOTION-2 Trial.” Journal Article. Eur Heart J. https://doi.org/10.1093/eurheartj/ehae331.

Linde, Maximilian, Jorge N. Tendeiro, and Don van Ravenzwaaij. 2022. “Bayes Factors for Two-Group Comparisons in Cox Regression.” Journal Article. medRxiv, 2022.11.02.22281762. https://doi.org/10.1101/2022.11.02.22281762.

Linde, Maximilian, Don van Ravenzwaaij, and Jorge N. Tendeiro. 2022. Baymedr: Computation of Bayes Factors for Common Biomedical Designs. https://github.com/maxlinde/baymedr.

Morey, Richard D., and Jeffrey N. Rouder. 2024. BayesFactor: Computation of Bayes Factors for Common Designs. https://CRAN.R-project.org/package=BayesFactor.

Wasserstein, RL., and NA. Lazar. 2016. “The ASA’s Statement on p-Values: Context, Process, and Purpose,” Journal Article. The American Statistician 70:2: 129–33.

Wasserstein, RL., AL. Schirm, and NA. Lazar. 2019. “Moving to a World Beyond ‘p < 0.05’.” Journal Article. The American Statistician 73: 1–19.

Citation

BibTeX citation:

@online{brophy2024,
  author = {Brophy, Jay},
  title = {TAVR Vs. Surgery - {NHST} Gets It Wrong (Again)},
  date = {2024-08-05},
  url = {https://brophyj.github.io/posts/2024-02-19-my-blog-post/},
  langid = {en}
}

For attribution, please cite this work as:

Brophy, Jay. 2024. “TAVR Vs. Surgery - NHST Gets It Wrong (Again).” August 5, 2024. https://brophyj.github.io/posts/2024-02-19-my-blog-post/.