Recent study
Smoking is a well-established risk factor for endothelial damage and increased thrombogenesis, yet Presch(Presch et al. 2025) reports paradoxically that smoking is associated with a lower risk of repeat revascularization. It is hard to conceive why the endothelium would react differently if it had experienced a previous revascularization procedure. More likely the observed protective role is due to collider risk stratification bias which has successfully explained the obesity, aspirin, and previous smoking paradoxes among many others[Etminan et al. (2021)](Banack and Kaufman 2014)(Choi et al. 2014). This bias arises when study selection (in this case undergoing revascularization) is a collider, affected by both smoking and other (unmeasured) risk factors, creating a spurious association via an induced back door pathway. Study selection here was determined by the performance of revascularization which is not randomly assigned. Smokers are more likely to undergo initial revascularization and therefore non-smokers who receive revascularization tend to have substantially higher levels of other (unmeasured) risk factors, strong predictors for future revascularization. A simulation study assigning a smoking HR = 1.36 is reduced to 0.87 with 95% CI (0.82 to 0.92) when the general population is conditioned on the performance of an initial revascularization (the collider)(Brophy 2025). This reversal is a classic example of collider bias, where conditioning on a collider variable, influenced by both the exposure and an unmeasured confounder distorts the true effect.
Background on collider risk stratification
Consider a simplified data generating mechanism (DGM) for the development of coronary artery disease, reequiring revascularizaion, in the general population. Let’s assume that causal factors are smoking and all other risk factors are combined into one other variable. The causal diagram is therefore as follows
Here for the general population we have assumed that smoking and the other risk factors are independent (no connecting arrow). But look what happens if we select (stratify or condition) on receiving a revascularization procedure.
Code
img <- ggdag_dseparated(revasc_dag,
controlling_for = "m",
text = FALSE, use_labels = "label"
)
img
Code
ggsave("images/collider/preview-image.jpg")
This has induced a relationship between smoking and the other risk factors. If the patient is a non-smoker then they must have other risk factors since they required a revascularization procedure. This means that in general that amount the sampe population of those have received an initial revascularization procedure, the non-smokers ill be sicker than the smokers. Alternatively the smokers will look to be protected by their smoking even though the underlying data generating mechaism (with an increased risk due to smoking) has not been altered.
Collider-stratification bias is responsible for many cases of bias, including selection bias, missing data, and publication bias.
Simulated data
To concretely demonstrate this spurious association we will perform a numerical simulation. Assume 10,000 individuals with a 30% smoking prevalence and a normal disribution of the other (unmeasured) risk factors, (U). The probability of receiving a percutaneous coronary intervention (PCI) is modeled using a logistic model where both smoking and U increase the chance of PCI.
Code
# Set a seed for reproducibility
options(digits = 2)
# ---------------------------
# Chunk 1: Generate Full Population Data
# ---------------------------
set.seed(123)
# Population parameters for outcome data generating model (DGM):
n <- 100000 # Total individuals
p_smoking <- 0.3 # Smoking prevalence
beta_smoking <- 0.3 # True effect of smoking (log-odds); OR ≈ exp(0.3) ≈ 1.35
beta_U <- 1 # Effect of U on outcome
intercept_outcome <- -2 # Outcome intercept for repeat PCI
# Simulate predictors:
smoking <- rbinom(n, 1, p_smoking)
U <- rnorm(n) # Unobserved risk factor
# Generate outcome (repeat PCI) for full population using the same DGM:
logit_outcome <- intercept_outcome + beta_smoking * smoking + beta_U * U
p_outcome <- exp(logit_outcome) / (1 + exp(logit_outcome))
repeat_PCI <- rbinom(n, 1, p_outcome)
# ---------------------------
# Chunk 2: Generate PCI via a Deterministic (Threshold) Rule
# ---------------------------
# Define PCI as follows:
# For smokers (smoking == 1): PCI = 1 if U > 0.5.
# For non-smokers (smoking == 0): PCI = 1 if U > 1.5.
PCI <- ifelse(smoking == 1, as.numeric(U > 0.5), as.numeric(U > 1.1))
# Combine into one data frame:
df <- data.frame(smoking, U, PCI, repeat_PCI)
# ---------------------------
# Chunk 3: Full Population Analysis
# ---------------------------
# Fit a logistic regression for repeat PCI on smoking (ignoring U) in the full population:
model_full <- glm(repeat_PCI ~ smoking, data = df, family = binomial)
temp <- summary(model_full)
summary(model_full)
Call:
glm(formula = repeat_PCI ~ smoking, family = binomial, data = df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.6896 0.0104 -162.2 <2e-16 ***
smoking 0.2717 0.0179 15.2 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 90396 on 99999 degrees of freedom
Residual deviance: 90171 on 99998 degrees of freedom
AIC: 90175
Number of Fisher Scoring iterations: 4
This confirms the DGM is performing as expected with an increased smoking risk = 1.31 with 95% CI (1.27 to 1.36) Note in this illustrative case, a deterministic model for PCI with non-smokers having higher (non-measured) risk factors has been used. In the real world this world, a probabilistic mnodel would be employed
Now let’s see the effect of smoking on the study population, i.e. only those that underwent an initial revascularization procedure. Here we assume the same DGM for a future repeat PCI as initially applied to the general population i.e. smoking risk = 1.3
Code
# ---------------------------
# Chunk 4: Analysis in PCI-Selected Sample
# ---------------------------
# Restrict the data to individuals with PCI == 1:
data_PCI <- subset(df, PCI == 1)
# Fit the logistic regression model for repeat PCI on smoking in the PCI-selected sample:
model_PCI <- glm(repeat_PCI ~ smoking, data = data_PCI, family = binomial)
temp1 <- summary(model_PCI)
summary(model_PCI)
Call:
glm(formula = repeat_PCI ~ smoking, family = binomial, data = data_PCI)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.3680 0.0206 -17.84 < 2e-16 ***
smoking -0.1367 0.0298 -4.58 4.6e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 25340 on 18905 degrees of freedom
Residual deviance: 25319 on 18904 degrees of freedom
AIC: 25323
Number of Fisher Scoring iterations: 4
Using the exact same data generating mechanism, with increased risk with smoking (RR =1.3) for a repeat revascularization procedure, the smoking parameter in the study sample has been reduced to 0.87 with 95% CI (0.82 to 0.92) due to collider risk stratification bias
This simulation illustrates collider-stratification bias: even though the true effect of smoking in the full population is harmful (OR ≈ 1.35) and remains harmful for a repeat procedure, when you condition on PCI smoking looks protective simply because the non-smokers are otherwise at increased risk due to the spurious induced associated between the other (unmeasured) risk factors and smoking.
Visual interpretation
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Code
library(patchwork) # for side-by-side plots
# Assume df and data_PCI already exist, with columns:
# U: unobserved risk factor
# smoking: 0/1
# PCI: 0/1
# repeat_PCI: outcome
# Create a "Group" variable to label each dataset
df$Group <- "Full Population"
data_PCI$Group <- "PCI Selected"
# Plot 1: Full population
p1 <- ggplot(filter(df, Group == "Full Population"),
aes(x = U, fill = factor(smoking))) +
geom_density(alpha = 0.5) +
coord_cartesian(xlim = c(-3, 3)) + # adjust as needed
labs(
title = "Distribution of U in the Full Population",
x = "U (Unobserved Risk Factor)",
y = "Density",
fill = "Smoking\n(0=No,1=Yes)"
) +
theme_minimal()
# Plot 2: PCI-selected, x-axis starting at 0
p2 <- ggplot(filter(data_PCI, Group == "PCI Selected"),
aes(x = U, fill = factor(smoking))) +
geom_density(alpha = 0.5) +
coord_cartesian(xlim = c(0, NA)) + # start x-axis at 0
labs(
title = "Distribution of U in PCI-Selected Sample",
x = "U (Unobserved Risk Factor)",
y = "Density",
fill = "Smoking\n(0=No,1=Yes)"
) +
theme_minimal()
# Display plots side by side
p1 + p2
Full Population (left plot) shows how the unobserved risk factor U is distributed equally among both smokers (red) and non-smokers (blue) in the entire dataset. in th estudy sample, PCI-selected sample (right plot) because of the selection mechanism you can more clearly see that non-smokers in the PCI sample have U-values > the values in smokers. Hence the false impression that smoking is protective.
References
Citation
@online{brophy2025,
author = {Brophy, Jay},
title = {Collider Risk Stratification Bias},
date = {2025-04-04},
url = {https://brophyj.com/posts/2024-12-30-my-blog-post/},
langid = {en}
}