Collider risk stratification bias

Recent study

Smoking is a well-established risk factor for endothelial damage and increased thrombogenesis, yet Presch(Presch et al. 2025) reports paradoxically that smoking is associated with a lower risk of repeat revascularization. It is hard to conceive why the endothelium would react differently if it had experienced a previous revascularization procedure. More likely the observed protective role is due to collider risk stratification bias which has successfully explained the obesity, aspirin, and previous smoking paradoxes among many others[Etminan et al. (2021)](Banack and Kaufman 2014)(Choi et al. 2014). This bias arises when study selection (in this case undergoing revascularization) is a collider, affected by both smoking and other (unmeasured) risk factors, creating a spurious association via an induced back door pathway. Study selection here was determined by the performance of revascularization which is not randomly assigned. Smokers are more likely to undergo initial revascularization and therefore non-smokers who receive revascularization tend to have substantially higher levels of other (unmeasured) risk factors, strong predictors for future revascularization. A simulation study assigning a smoking HR = 1.36 is reduced to 0.87 with 95% CI (0.82 to 0.92) when the general population is conditioned on the performance of an initial revascularization (the collider)(Brophy 2025). This reversal is a classic example of collider bias, where conditioning on a collider variable, influenced by both the exposure and an unmeasured confounder distorts the true effect.

Background on collider risk stratification

Consider a simplified data generating mechanism (DGM) for the development of coronary artery disease, reequiring revascularizaion, in the general population. Let’s assume that causal factors are smoking and all other risk factors are combined into one other variable. The causal diagram is therefore as follows

Code

library(ggplot2)
library(ggdag)

revasc_dag <- collider_triangle(
  x = "Smoking",
  y = "Other (unmeasured) risk factors",
  m = "Revascularization"
)

ggdag(revasc_dag, text = FALSE, use_labels = "label")

Here for the general population we have assumed that smoking and the other risk factors are independent (no connecting arrow). But look what happens if we select (stratify or condition) on receiving a revascularization procedure.

Code

img <- ggdag_dseparated(revasc_dag,
                 controlling_for = "m",
                 text = FALSE, use_labels = "label"
)
img

Code

ggsave("images/collider/preview-image.jpg")

This has induced a relationship between smoking and the other risk factors. If the patient is a non-smoker then they must have other risk factors since they required a revascularization procedure. This means that in general that amount the sampe population of those have received an initial revascularization procedure, the non-smokers ill be sicker than the smokers. Alternatively the smokers will look to be protected by their smoking even though the underlying data generating mechaism (with an increased risk due to smoking) has not been altered.
Collider-stratification bias is responsible for many cases of bias, including selection bias, missing data, and publication bias.

Simulated data

To concretely demonstrate this spurious association we will perform a numerical simulation. Assume 10,000 individuals with a 30% smoking prevalence and a normal disribution of the other (unmeasured) risk factors, (U). The probability of receiving a percutaneous coronary intervention (PCI) is modeled using a logistic model where both smoking and U increase the chance of PCI.

Code

# Set a seed for reproducibility
options(digits = 2)
# ---------------------------
# Chunk 1: Generate Full Population Data
# ---------------------------
set.seed(123)

# Population parameters for outcome data generating model (DGM):
n <- 100000               # Total individuals
p_smoking <- 0.3          # Smoking prevalence
beta_smoking <- 0.3       # True effect of smoking (log-odds); OR ≈ exp(0.3) ≈ 1.35
beta_U <- 1               # Effect of U on outcome
intercept_outcome <- -2   # Outcome intercept for repeat PCI

# Simulate predictors:
smoking <- rbinom(n, 1, p_smoking)
U <- rnorm(n)             # Unobserved risk factor

# Generate outcome (repeat PCI) for full population using the same DGM:
logit_outcome <- intercept_outcome + beta_smoking * smoking + beta_U * U
p_outcome <- exp(logit_outcome) / (1 + exp(logit_outcome))
repeat_PCI <- rbinom(n, 1, p_outcome)

# ---------------------------
# Chunk 2: Generate PCI via a Deterministic (Threshold) Rule
# ---------------------------
# Define PCI as follows:
#   For smokers (smoking == 1): PCI = 1 if U > 0.5.
#   For non-smokers (smoking == 0): PCI = 1 if U > 1.5.
PCI <- ifelse(smoking == 1, as.numeric(U > 0.5), as.numeric(U > 1.1))

# Combine into one data frame:
df <- data.frame(smoking, U, PCI, repeat_PCI)

# ---------------------------
# Chunk 3: Full Population Analysis
# ---------------------------
# Fit a logistic regression for repeat PCI on smoking (ignoring U) in the full population:
model_full <- glm(repeat_PCI ~ smoking, data = df, family = binomial)
temp <- summary(model_full)
summary(model_full)


Call:
glm(formula = repeat_PCI ~ smoking, family = binomial, data = df)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.6896     0.0104  -162.2   <2e-16 ***
smoking       0.2717     0.0179    15.2   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 90396  on 99999  degrees of freedom
Residual deviance: 90171  on 99998  degrees of freedom
AIC: 90175

Number of Fisher Scoring iterations: 4

Code

# exp(coef(model_full))  # Display odds ratios

smoke_est <- exp(temp$coefficients[2,1])
smoke_CI <-exp(temp$coefficients[2,1] + c(-1,1)*1.96*temp$coefficients[2,2])

This confirms the DGM is performing as expected with an increased smoking risk = 1.31 with 95% CI (1.27 to 1.36) Note in this illustrative case, a deterministic model for PCI with non-smokers having higher (non-measured) risk factors has been used. In the real world this world, a probabilistic mnodel would be employed

Now let’s see the effect of smoking on the study population, i.e. only those that underwent an initial revascularization procedure. Here we assume the same DGM for a future repeat PCI as initially applied to the general population i.e. smoking risk = 1.3

Code

# ---------------------------
# Chunk 4: Analysis in PCI-Selected Sample
# ---------------------------
# Restrict the data to individuals with PCI == 1:
data_PCI <- subset(df, PCI == 1)

# Fit the logistic regression model for repeat PCI on smoking in the PCI-selected sample:
model_PCI <- glm(repeat_PCI ~ smoking, data = data_PCI, family = binomial)
temp1 <- summary(model_PCI)
summary(model_PCI)


Call:
glm(formula = repeat_PCI ~ smoking, family = binomial, data = data_PCI)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.3680     0.0206  -17.84  < 2e-16 ***
smoking      -0.1367     0.0298   -4.58  4.6e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 25340  on 18905  degrees of freedom
Residual deviance: 25319  on 18904  degrees of freedom
AIC: 25323

Number of Fisher Scoring iterations: 4

Code

# exp(coef(model_PCI))  # Display odds ratios in the selected sample


smoke_est_sample <- exp(temp1$coefficients[2,1])
smoke_CI_sample <- exp(temp1$coefficients[2,1] + c(-1,1)*1.96*temp1$coefficients[2,2])

Using the exact same data generating mechanism, with increased risk with smoking (RR =1.3) for a repeat revascularization procedure, the smoking parameter in the study sample has been reduced to 0.87 with 95% CI (0.82 to 0.92) due to collider risk stratification bias

This simulation illustrates collider-stratification bias: even though the true effect of smoking in the full population is harmful (OR ≈ 1.35) and remains harmful for a repeat procedure, when you condition on PCI smoking looks protective simply because the non-smokers are otherwise at increased risk due to the spurious induced associated between the other (unmeasured) risk factors and smoking.

Visual interpretation

Code

library(ggplot2)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Code

library(patchwork)  # for side-by-side plots

# Assume df and data_PCI already exist, with columns:
#   U: unobserved risk factor
#   smoking: 0/1
#   PCI: 0/1
#   repeat_PCI: outcome

# Create a "Group" variable to label each dataset
df$Group <- "Full Population"
data_PCI$Group <- "PCI Selected"

# Plot 1: Full population
p1 <- ggplot(filter(df, Group == "Full Population"),
             aes(x = U, fill = factor(smoking))) +
  geom_density(alpha = 0.5) +
  coord_cartesian(xlim = c(-3, 3)) +  # adjust as needed
  labs(
    title = "Distribution of U in the Full Population",
    x = "U (Unobserved Risk Factor)",
    y = "Density",
    fill = "Smoking\n(0=No,1=Yes)"
  ) +
  theme_minimal()

# Plot 2: PCI-selected, x-axis starting at 0
p2 <- ggplot(filter(data_PCI, Group == "PCI Selected"),
             aes(x = U, fill = factor(smoking))) +
  geom_density(alpha = 0.5) +
  coord_cartesian(xlim = c(0, NA)) +  # start x-axis at 0
  labs(
    title = "Distribution of U in PCI-Selected Sample",
    x = "U (Unobserved Risk Factor)",
    y = "Density",
    fill = "Smoking\n(0=No,1=Yes)"
  ) +
  theme_minimal()

# Display plots side by side
p1 + p2

Full Population (left plot) shows how the unobserved risk factor U is distributed equally among both smokers (red) and non-smokers (blue) in the entire dataset. in th estudy sample, PCI-selected sample (right plot) because of the selection mechanism you can more clearly see that non-smokers in the PCI sample have U-values > the values in smokers. Hence the false impression that smoking is protective.

References

Banack, H. R., and J. S. Kaufman. 2014. “The Obesity Paradox: Understanding the Effect of Obesity on Mortality Among Individuals with Cardiovascular Disease.” Journal Article. Prev Med 62: 96–102. https://doi.org/10.1016/j.ypmed.2014.02.003.

Brophy, J. M. 2025. “Simulation Example.” Web Page. https://github.com/brophyj/collider-bias.git.

Choi, H. K., U. S. Nguyen, J. Niu, G. Danaei, and Y. Zhang. 2014. “Selection Bias in Rheumatic Disease Research.” Journal Article. Nat Rev Rheumatol 10 (7): 403–12. https://doi.org/10.1038/nrrheum.2014.36.

Etminan, M., J. M. Brophy, G. Collins, M. Nazemipour, and M. A. Mansournia. 2021. “To Adjust or Not to Adjust: The Role of Different Covariates in Cardiovascular Observational Studies.” Journal Article. Am Heart J 237: 62–67. https://doi.org/10.1016/j.ahj.2021.03.008.

Presch, A., J. J. Coughlan, S. Bar, S. Brugaletta, M. Maeng, S. Kufner, L. Ortega-Paz, et al. 2025. “Smoking Status at Baseline and 10-Year Outcomes After Drug-Eluting Stent Implantation: Insights from the DECADE Cooperation.” Journal Article. JACC Cardiovasc Interv. https://doi.org/10.1016/j.jcin.2024.12.028.

Citation

BibTeX citation:

@online{brophy2025,
  author = {Brophy, Jay},
  title = {Collider Risk Stratification Bias},
  date = {2025-04-04},
  url = {https://brophyj.com/posts/2024-12-30-my-blog-post/},
  langid = {en}
}

For attribution, please cite this work as:

Brophy, Jay. 2025. “Collider Risk Stratification Bias.” April 4, 2025. https://brophyj.com/posts/2024-12-30-my-blog-post/.