The Notion-2 trial(Jorgensen et al. 2024) was recently fast tracked for publication in the European Heart Journal. This study randomized low risk patients (≤75 years of age and median Society of Thoracic Surgeons (STS) risk score of 1.1%) with severe aortic stenosis (AS) to Transcatheter Aortic Valve Implantation (TAVI) or to conventional aortic valve surgery. The study population included both tricuspid and bicuspid AS.
The primary endpoint was a composite of all-cause mortality, stroke, or rehospitalization (related to the procedure, valve, or heart failure) at 12 months.
A total of 370 patients were enrolled and 1-year incidence of the primary endpoint was 10.2% in the TAVI group and 7.1% in the surgery group [absolute risk difference 3.1%; 95% confidence interval (CI), −2.7% to 8.8%; hazard ratio (HR) 1.4; 95% CI, 0.7–2.9; P = .3]. The authors concluded
> “Among low-risk patients aged ≤75 years with severe symptomatic AS, the rate of the composite of death, stroke, or rehospitalization at 1 year was similar between TAVI and surgery.”
How was the study designed?
The study was designed assuming a sample of 372 patients would provide the trial 90% power to show the non-inferiority of TAVI to surgery with regard to the primary endpoint at 1 year, assuming a Kaplan–Meier estimate of the primary endpoint of 10% in the TAVI group and 15% in the surgery group. The authors stated “To test for non-inferiority, we determined whether the upper boundary of the 95% confidence interval (CI) for the difference in the rate of the primary endpoint between the TAVI and surgery group was less than the pre-specified non-inferiority margin of 5% points.”
In other words, the null hypothesis (\(H_O\)) is \(\theta_{TAVI} - \theta_{surgery} > 5%\) where \(\theta\) is the proportion of outcomes in the respective treatment arms. One hopes to reject this null hypothesis and accept the alternative hypothesis (\(H_A\)) that the difference is < 5% and therefore claim non-inferiority. Although the authors did claim non-inferiority, their data does not support this conclusions as the null hypothesis can’t be rejected as the upper limit of the 95% confidence interval for the difference in outcomes between TAVR and surgery is 8.8% exceeding the prespecified non-inferiority margin of 5% points.
How then did the authors reach their conclusion?
The authors apparently ignored their non-inferiority design and analysed their study with conventional null hypothesis significance testing (NHST) and a null hypothesis \(\theta_{TAVI} - \theta_{surgery} = 0\) which they were unable to reject it since the p value (0.3) exceeded the conventional \(\alpha\) level (0.05)
NHST limitations and its tendency to cause cognitive errors has been described countless times in the medical literature[Wasserstein and Lazar (2016)](Wasserstein, Schirm, and Lazar 2019). This particular cognitive has even been described with the pithy aphorism absence of evidence is not evidence of absence(Altman and Bland 1995).
What is the strength of this conclusion?
With the NHST paradigm, it is impossible to quantify the evidence in favor of the null hypothesis. A non-significant finding can occur due to low power or a truly absent effect and the reporting of a p value simply can’t disentangled these two possibilities. An alternative to NHST is Null hypothesis Bayesian testing (NHBT) which allows the strength of evidence for (or against) \(H_O\) and \(H_A\) to be directly compared. This is most commonly achieved with Bayes factors, which quantifies the relative probabilities of the data under \(H_O\) and \(H_A\).
BFs
The relationship between BFs and the relative support for for (or against) \(H_O\) and \(H_A\) is reflected in the following graphic.
Code
# Load necessary librarieslibrary(ggplot2)library(grid)g<-rasterGrob(c("lightgreen", "yellow", "orange", "red"), width=unit(1,"npc"), height =unit(1,"npc"), interpolate =TRUE)# Create a continuous range for the y-axisy_values<-seq(0.01, 1000, length.out =10000)# Create data frame for plottingdata<-data.frame( BayesFactor =y_values)# Define a continuous color gradient from a deeper yellow to green to bluecolor_gradient<-c("#FFD700", "#00B300", "#0000FF")# Plot the dataggplot(data, aes(y =BayesFactor, fill =BayesFactor))+geom_tile(aes(x =0.5), width =0.2)+# Greatly reduce the x-direction plot areascale_y_continuous(trans ="log10", breaks =c(0.01, 1/3, 1, 3, 10, 30, 100, 1000), labels =c("0", "1/3", "1", "3", "10", "30", "100", "∞"), expand =c(0, 0))+#scale_fill_gradientn(colors = color_gradient, name = NULL) +geom_segment(aes(x =0.4, xend =0.6, y =1/3, yend =1/3), linetype ="dashed", color ="black")+geom_segment(aes(x =0.4, xend =0.6, y =3, yend =3), linetype ="dashed", color ="black")+geom_segment(aes(x =0.4, xend =0.6, y =10, yend =10), linetype ="dashed", color ="black")+geom_segment(aes(x =0.4, xend =0.6, y =30, yend =30), linetype ="dashed", color ="black")+annotate("text", x =0.5, y =sqrt(0.01*1/3), label ="Evidence against\ntreatment effect", size =4.5, hjust =0.5, vjust =0.5)+annotate("text", x =0.5, y =2, label ="Not enough data to\nknow if the drug works", size =4.5, hjust =0.5, vjust =0.5)+annotate("text", x =0.5, y =20, label ="Evidence for\ntreatment effect", size =4.5, hjust =0.5, vjust =0.5)+annotate("text", x =0.65, y =sqrt(30*500), label ="{very strong pro-alternative}", hjust =0, size =4.5)+annotate("text", x =0.65, y =sqrt(10*30), label ="{strong pro-alternative}", hjust =0, size =4.5)+annotate("text", x =0.65, y =sqrt(3*10), label ="{moderate pro-alternative}", hjust =0, size =4.5)+annotate("text", x =0.65, y =sqrt(1/3*3), label ="{ambiguous}", hjust =0, size =4.5)+annotate("text", x =0.65, y =sqrt(0.01*1/3), label ="{pro-null}", hjust =0, size =4.5)+coord_cartesian(xlim =c(0.4, 1), ylim =c(0.01, 1000), clip ="off")+# Limit x-axis and ensure no clipping of texttheme_minimal()+theme( axis.title.x =element_blank(), axis.title.y =element_text(size =12), axis.text.y =element_text(size =12), axis.text.x =element_blank(), axis.ticks =element_blank(), panel.grid =element_blank(), legend.position ="none", plot.margin =unit(c(0.5, 0.5, 0.5, 0.5), "cm")# Reduced overall plot area)+labs(y ="Bayes Factor")+annotation_custom(g, xmin=0.4, xmax=0.6, ymin=-2, ymax=Inf)+ggtitle("Bayes factors and evidential strength")ggsave("output/BF.pdf", dpi =600, device ="pdf")ggsave("output/BF.png", dpi =600, device ="png")
What is the Bayes Factor for the Notion-2 survival data?
# simulate Notion-2 datasetsim_data<-coxph_data_sim(# See ?coxph_data_sim for details n_data =100, # Number of data sets to be simulated ns_c =183, # Sample size (control condition) ns_e =187, # Sample size (experimental condition) ne_c =13, # Number of events (control condition) ne_e =19, # Number of events (experimental# condition) cox_hr =c(1.4, 0.7, 2.9), # HR, lower bound CI, upper bound CI cox_hr_ci_level =0.95, # Confidence level CI maxit =300, # Max number of PSO iterations (for# psoptim()) maxit.stagnate =ceiling(300/5), # Max number of PSO iterations without# reduction in loss (for psoptim()) cores =5# Number of cores to be used)save(sim_data, file ="output/sim_data.RData")
Code
load("~/Desktop/current/notion2/output/sim_data.RData")sim_bf<-coxph_bf(# See ?coxph_bf for details data =sim_data, # Object containing the data null_value =0, # H0 value alternative ="two.sided", # H1 type (one- or two-sided) direction =NULL, # H1 direction (low or high) prior_mean =0, # Beta prior mean prior_sd =1# Beta prior SD)sim_bf
The median BF is 0.511 falls in the “ambiguous” range revealing that there is little evidence to support the null hypothesis of no difference between the two treatments.
This result can be checked (approximately) by ignoring any time dependency and simply considering the outcome data in the form of a 2X2 contingency table. Here is the Notion-2 data in tabular form for the primary outcome.
The result is, as expected, consistent with the earlier result again underlining the lack of any strong evidence to support the null hypothesis.
Of course, a close and proper interpretation of a standard statistical analysis provides the same inferences as shown by the binomial risk ratios and their 95% confidence intervals as shown below and published in the original manuscript.
Outcome + Outcome - Total Inc risk *
Exposed + 19 168 187 10.16 (6.23 to 15.41)
Exposed - 13 170 183 7.10 (3.84 to 11.84)
Total 32 338 370 8.65 (5.99 to 11.99)
Point estimates and 95% CIs:
-------------------------------------------------------------------
Inc risk ratio 1.43 (0.73, 2.81)
Inc odds ratio 1.48 (0.71, 3.09)
Attrib risk in the exposed * 3.06 (-2.65, 8.77)
Attrib fraction in the exposed (%) 30.08 (-37.37, 64.42)
Attrib risk in the population * 1.54 (-3.15, 6.24)
Attrib fraction in the population (%) 17.86 (-22.71, 45.02)
-------------------------------------------------------------------
Uncorrected chi2 test that OR = 1: chi2(1) = 1.094 Pr>chi2 = 0.296
Fisher exact test that OR = 1: Pr>chi2 = 0.356
Wald confidence limits
CI: confidence interval
* Outcomes per 100 population units
Full Bayesian analysis
The BF approach has the advantage of not requiring a prior belief on the the relative risk of two interventions. However this comes at the expense of not being able to calculate the posterior probability distribution for the risk difference or ratio.
As an initial approach one can assume a non-informative prior with a Beta(1,1) distribution so that the posterior distribution is completely determined by the observed Notion2 data.
Code
pacman::p_load(brms, tidyverse, tidybayes, ggdist)data_bin<-data.frame(N =c(183,187), y =c(13,19), grp2 =as.factor(c("Surgery","TAVR")))f=bf(y|trials(N)~0+grp2)#get_prior(formula = f,data = data_bin,family = binomial(link = "identity"))m<-brm( formula =f, data =data_bin, family =binomial(link ="identity"), prior =c(prior(beta(1, 1), class =b, lb =0, ub =1)), # gets rid of a bunch of unhelpful warnings chains =4, warmup =1000, iter =2000, seed =123, refresh =0)save(m, file ="output/m_brms")
This produces the following results for the risk differences
Family: binomial
Links: mu = identity
Formula: y | trials(N) ~ 0 + grp2
Data: data_bin (Number of observations: 2)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
grp2Surgery 0.08 0.02 0.04 0.12 1.00 3036 2467
grp2TAVR 0.11 0.02 0.07 0.15 1.00 3176 2066
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
These results can also be shown graphically
Code
draws<-brms::as_draws_df(m)draws<-draws%>%# rename and drop the unneeded columnstransmute(p0 =b_grp2Surgery, p1 =b_grp2TAVR)%>%# compute the ORmutate(rr =p1/p0, diff =p1-p0)library(tidyverse)library(RColorBrewer)# function that approximates the density at the provided valuesapproxdens<-function(x){dens<-density(x)f<-with(dens, approxfun(x, y))f(x)}probs<-c(0.145, 1)# sum(draws$diff>0)/4000draws1<-draws%>%mutate(dy =approxdens(diff), # calculate density p =percent_rank(diff), # percentile rank pcat =as.factor(cut(p, breaks =probs, # percentile category based on probs include.lowest =TRUE)))ggplot(draws1, aes(diff, dy))+geom_ribbon(aes(ymin =0, ymax =dy, fill =pcat), alpha=.2)+geom_line()+scale_fill_brewer(guide ="none", palette ="Set2")+labs(x ="Risk difference (TAVR - surgery))", y =NULL)+theme_classic()+labs(title ="Notion2 trial results with vague non-informative prior", subtitle ="Shaded area = probability increased TAVR risk (85.5%)")
This Bayesian analysis has provided a quantitative answer to the risk differences between TAVR and surgery and highlights the uncertainty that was missing from the original published conclusion. The BF approach to hypothesis testing provides an improvement over the traditional NHST by calculating the strength of the evidence, very weak support of the null hypothesis of no difference in this case.
Of course, rather than fixating on the p value, an examination of the 95% confidence interval (0.73, 2.81) shows that while this contains the null effect of 1, it is also compatible with a possible 27% reduction or 181% increase in risk with TAVR compared to SAVR. As these are meaningful differences, if becomes obvious that a claim of similarity in outcomes between the two procedures is not supported by the data, both from a frequentist and Bayesian viewpoints.
References
Altman, D. G., and J. M. Bland. 1995. “Absence of Evidence Is Not Evidence of Absence.” Journal Article. BMJ 311 (7003): 485. https://doi.org/10.1136/bmj.311.7003.485.
Jorgensen, T. H., H. G. H. Thyregod, M. Savontaus, Y. Willemen, O. Bleie, M. Tang, M. Niemela, et al. 2024. “Transcatheter Aortic Valve Implantation in Low-Risk Tricuspid or Bicuspid Aortic Stenosis: The NOTION-2 Trial.” Journal Article. Eur Heart J. https://doi.org/10.1093/eurheartj/ehae331.
Linde, Maximilian, Jorge N. Tendeiro, and Don van Ravenzwaaij. 2022. “Bayes Factors for Two-Group Comparisons in Cox Regression.” Journal Article. medRxiv, 2022.11.02.22281762. https://doi.org/10.1101/2022.11.02.22281762.
Linde, Maximilian, Don van Ravenzwaaij, and Jorge N. Tendeiro. 2022. Baymedr: Computation of Bayes Factors for Common Biomedical Designs. https://github.com/maxlinde/baymedr.
Wasserstein, RL., and NA. Lazar. 2016. “The ASA’s Statement on p-Values: Context, Process, and Purpose,” Journal Article. The American Statistician 70:2: 129–33.
Wasserstein, RL., AL. Schirm, and NA. Lazar. 2019. “Moving to a World Beyond ‘p < 0.05’.” Journal Article. The American Statistician 73: 1–19.
Citation
BibTeX citation:
@online{brophy2024,
author = {Brophy, Jay},
title = {TAVR Vs. Surgery - {NHST} Gets It Wrong (Again)},
date = {2024-08-05},
url = {https://brophyj.github.io/posts/2024-02-19-my-blog-post/},
langid = {en}
}