Mediation analysis with nuisance variables.

Mediation Analysis with Nuisance Variables

1 Introduction

Mediation analysis is a simple idea that is easy to perform but can be a minefield to interpret correctly. The principle is straightforward: one explanatory variable may correlate with the dependent variable by influencing another explanatory variable.

For example, suppose you find that bullet wounds correlate with death, and that blood loss also correlates with death. You could use a mediation analysis to test whether bullet wounds lead to death by increasing blood loss. You would find that they do (sorry about the gruesome example).

One absolute no-no in mediation analysis is when the direction of causation could be the other way around – that is, if the dependent variable causes the explanatory variables. Mediation analysis breaks down completely in such cases, so don’t do it. (Cough like this paper did – https://pubmed.ncbi.nlm.nih.gov/29567761/ – cough.)

For example, if you modelled bullet wounds as the dependent variable and tried to explain them using death and blood loss as predictors (with blood loss as a mediator), you would find that death causes bullet wounds directly and indirectly through blood loss. Obviously, that doesn’t make sense. This is different from correlation analysis, which does not care about the direction of causality.

In this example, we are testing whether the age of a patient correlates with inflammation through the variable bacterial count. The idea is that being older might not only directly cause more inflammation, but also make someone more susceptible to bacterial growth, which could lead to more bacteria in the blood and, in turn, more inflammation.

To complicate things further (as data often does), what if you wanted to adjust for some nuisance variables first, before looking at your variables of interest? Okay – here we go.

2 Load Required Packages

# Function for CRAN packages
load_cran_pkg <- function(pkgs) {
  invisible(lapply(pkgs, function(pkg) {
    if (!requireNamespace(pkg, quietly = TRUE)) {
      install.packages(pkg)
    }
    library(pkg, character.only = TRUE)
  }))
}

# Function for GitHub packages
load_github_pkg <- function(pkgs) {
  if (!requireNamespace("remotes", quietly = TRUE)) install.packages("remotes")
  
  invisible(lapply(pkgs, function(pkg) {
    if (!requireNamespace(pkg, quietly = TRUE)) {
      remotes::install_github(pkg)
    }
    # Extract package name from "user/repo"
    pkg_name <- sub(".*/", "", pkg)
    library(pkg_name, character.only = TRUE)
  }))
}

# ---- Usage ----
cran_packages <- c("ggplot2", "viridisLite", "sjPlot", "MASS", "mediation", "gt", "knitr","grid")
github_packages <- c("sinhrks/ggfortify")

load_cran_pkg(cran_packages)
load_github_pkg(github_packages)

3 Citations

References
Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York
Garnier, Simon, Ross, Noam, Rudis, Robert, Camargo, Pedro A, Sciaini, Marco, Scherer, Cédric (2023). viridis(Lite) - Colorblind-Friendly Color Maps for R., viridisLite package version 0.4.2
Lüdecke D (2024). sjPlot: Data Visualization for Statistics in Social Science, R package version 2.8.17
Venables WN, Ripley BD (2002). Modern Applied Statistics with S, Fourth edition. Springer, New York
Tingley D, Yamamoto T, Hirose K, Keele L, Imai K (2014). mediation: R Package for Causal Mediation Analysis. Journal of Statistical Software, 59(5), 1-38
Imai K, Keele L, Yamamoto T (2010). Identification, Inference, and Sensitivity Analysis for Causal Mediation Effects. Statistical Science, 25(1), 51-71
Imai K, Keele L, Tingley D (2010). A General Approach to Causal Mediation Analysis. Psychological Methods, 15(4), 309-334
Imai K, Keele L, Tingley D, Yamamoto T (2011). Unpacking the Black Box of Causality: Learning about Causal Mechanisms from Experimental and Observational Studies. American Political Science Review, 105(4), 765-789
Imai K, Yamamoto T (2013). Identification and Sensitivity Analysis for Multiple Causal Mechanisms: Revisiting Evidence from Framing Experiments. Political Analysis, 21(2), 141-171
Imai K, Keele L, Tingley D, Yamamoto T (2010). Causal Mediation Analysis Using R. In Vinod HD, Advances in Social Science Research Using R. Springer-Verlag, New York
Iannone R, Cheng J, Schloerke B, Hughes E, Lauer A, Seo J, Brevoort K, Roy O (2025). gt: Easily Create Presentation-Ready Display Tables, R package version 1.0.0
Xie Y (2025). knitr: A General-Purpose Package for Dynamic Report Generation in R, R package version 1.50
Xie Y (2015). Dynamic Documents with R and knitr, 2nd edition. Chapman and Hall/CRC, Boca Raton, Florida
Xie Y (2014). knitr: A Comprehensive Tool for Reproducible Research in R. In Stodden V, Leisch F, Peng RD, Implementing Reproducible Computational Research. Chapman and Hall/CRC
R Core Team (2024). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria
Tang Y, Horikoshi M, Li W (2016). ggfortify: Unified Interface to Visualize Statistical Result of Popular R Packages. The R Journal, 8(2), 474-485
Horikoshi M, Tang Y (2018). ggfortify: Data Visualization Tools for Statistical Analysis Results

4 Load and Prepare Data

First load the data and packages. Always check the data to ensure it makes sense – for example, are the numerical variables actually stored as numeric?

# Example data
dat<- read.table(url("https://jackauty.com/wp-content/uploads/2020/05/Stepwise-regression-data.txt"),header=T)

#Example data check
tab_df(head(dat))
Inflammation Age Sex Blood.pressure Infection Bacterial.count
4.40 7.16 M 151 No 75.00
1218.07 8.88 F 145 Yes 5718.30
5.32 3.23 F 104 No 27.00
3072.58 7.91 F 156 Yes 4439.05
108.86 7.98 M 143 No 224.00
563.63 2.93 F 159 No 480.00

5 Distribution of the dependent and explanatory variables

Next, check if the dependent variable is normally distributed. It does not have to be perfectly normal, but analyses will generally run more smoothly if it is.

In this example, the distribution of the dependent variable was improved by a square-root transformation, so a new transformed variable was created.

Age looks fine without transformation. Bacterial count, however, benefits from a stronger shrinking transformation. In this case, a log transformation was applied.

hist(dat$Inflammation, breaks=6, cex.main = 0.6)

hist(sqrt(dat$Inflammation), breaks=6, cex.main = 0.6)

#Square root looks better, so let's create a variable to use in our model
dat$Inflammation.transformed<-sqrt(dat$Inflammation)

# Age looks good!
hist(dat$Age, cex.main = 0.6)

# Bacterial count needs a strong shrinking transformation
hist(dat$Bacterial.count, cex.main = 0.6)

# Transforming the Bacterial Count Data works
dat$Bact_count_trans <- log10(dat$Bacterial.count+1)

hist(dat$Bact_count_trans, breaks = 2, cex.main = 0.6)

6 Adjust the data for nuisance variables

Next, we build a model including the nuisance variables (Sex and Blood Pressure) and extract the standardised residuals into a new column in the data frame. Standardised residuals represent the variation left over after accounting for the nuisance variables. This residual variation can then be explained by the variables of interest. Always check model assumptions and transform variables if necessary.

#Set up adjusted model
adjusted_model<-lm(Inflammation.transformed ~ Sex*Blood.pressure, data = dat)
summary(adjusted_model)
## 
## Call:
## lm(formula = Inflammation.transformed ~ Sex * Blood.pressure, 
##     data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -31.1436  -8.2012  -0.9615   9.9437  29.0713 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)  
## (Intercept)           3.9407    15.8002   0.249   0.8036  
## SexM                 38.1009    22.2552   1.712   0.0901 .
## Blood.pressure        0.2117     0.1178   1.797   0.0755 .
## SexM:Blood.pressure  -0.3093     0.1685  -1.836   0.0695 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.52 on 96 degrees of freedom
## Multiple R-squared:  0.04677,    Adjusted R-squared:  0.01698 
## F-statistic:  1.57 on 3 and 96 DF,  p-value: 0.2017
 #Diagnostic plots 
autoplot(adjusted_model,which = c(1,2,4))

#Extract the standardized residuals and turn them into a column in your data
dat$residuals<-stdres(adjusted_model)

7 Standardise explanatory variables

In mediation analysis, you will often report that one variable has a direct effect on the dependent variable, and that some proportion of the effect is mediated through another variable. For this proportion to be meaningful, the explanatory variables must be on the same scale.

We standardise them into z-scores (units in standard deviations from the mean) so they can be compared directly.

dat$Age_z <- as.numeric(scale(dat$Age))
dat$Bact_count_Z <- as.numeric(scale(dat$Bact_count_trans))

8 Build base models and check assumptions

We now build our base models. First, we model the primary explanatory variable (Age) against the mediator (Bacterial count) as the dependent variable. Next, we model the primary explanatory variable and the mediator together as predictors of the residuals (inflammation) from the nuisance-adjusted model.

Because of the earlier transformations and scaling, both models meet the assumptions well.

#Build base models 
fit_Mediator_model<-lm(Bact_count_Z ~ Age_z, data=dat)

fit_full_model<-lm(residuals ~ Age_z + Bact_count_Z, data=dat)
autoplot(fit_Mediator_model,which = c(1,2,4))

autoplot(fit_full_model,which = c(1,2,4))

9 Mediation analysis

Now we run the mediation analysis. In this example, we find a significant effect of Age on inflammation, and an indirect effect of Age on inflammation through its effect on bacterial count.

We repeat the analysis using a bootstrap method, which is more robust and less sensitive to non-normality in the residuals, and obtain the same result.

The results suggest that more than 50% of the effect of Age on inflammation is statistically mediated by bacterial count. This implies that being older increases the risk of bacterial infection, and that this accounts for a substantial proportion of the higher inflammation observed in older individuals. (Note: this dataset is simulated.)

#Run mediation analysis 

mediation<-mediate(fit_Mediator_model, fit_full_model, treat="Age_z", mediator = "Bact_count_Z",  boot = TRUE, sims = 1000)
summary_med <- summary(mediation)

# Build data frame
df_med <- data.frame(
  Effect = c("ACME (indirect effect)",
             "ADE (direct effect)",
             "Total Effect",
             "Proportion Mediated"),
  Estimate = c(summary_med$d0, summary_med$z0, summary_med$tau.coef, summary_med$n0),
  CI_lower = c(summary_med$d0.ci[1], summary_med$z0.ci[1],
               summary_med$tau.ci[1], summary_med$n0.ci[1]),
  CI_upper = c(summary_med$d0.ci[2], summary_med$z0.ci[2],
               summary_med$tau.ci[2], summary_med$n0.ci[2]),
  p_value  = c(summary_med$d0.p, summary_med$z0.p,
               summary_med$tau.p, summary_med$n0.p)
)

# Format p-values
df_med$p_value <- ifelse(df_med$p_value < 0.0001,
                         "<0.0001",
                         sprintf("%.4f", df_med$p_value))

# Combine CIs into one column
df_med$CI <- paste0("[", round(df_med$CI_lower, 3), ", ", round(df_med$CI_upper, 3), "]")

# Keep only desired columns
df_med <- df_med[, c("Effect", "Estimate", "CI", "p_value")]

# Show table
sjPlot::tab_df(df_med,
               title = "Mediation Analysis Results",
               col.header = c("Effect", "Estimate", "95% CI", "p-value"))
Mediation Analysis Results
Effect Estimate 95% CI p-value
ACME (indirect effect) 0.15 [0.019, 0.285] 0.0280
ADE (direct effect) 0.20 [0.081, 0.305] <0.0001
Total Effect 0.35 [0.165, 0.53] <0.0001
Proportion Mediated 0.43 [0.09, 0.699] 0.0280

9.1 Plotting mediation analysis

Visualising mediation results can be challenging. Two useful approaches are:

Forest plots, which provide a quick overview of effect sizes and confidence intervals, making it easy to identify which effects are statistically significant.

Path diagrams, which illustrate the relationships between variables, showing both the size and significance of direct and indirect effects within the mediation model.

9.1.1 Forest plot

sm <- summary(mediation)

df_plot <- data.frame(
  Effect   = c("ACME (indirect)", "ADE (direct)", "Total Effect", "Prop. Mediated"),
  Estimate = c(sm$d0, sm$z0, sm$tau.coef, sm$n0),
  CI_lower = c(sm$d0.ci[1], sm$z0.ci[1], sm$tau.ci[1], sm$n0.ci[1]),
  CI_upper = c(sm$d0.ci[2], sm$z0.ci[2], sm$tau.ci[2], sm$n0.ci[2])
)

ggplot(df_plot, aes(x = Effect, y = Estimate)) +
  geom_hline(yintercept = 0, linetype = "dotted") +     # 0 reference line
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = CI_lower, ymax = CI_upper), width = 0.1) +
  coord_flip() +
  scale_y_continuous(limits = c(-0.1, NA)) +           # force min = -0.1
  theme_classic() +
  labs(y = "Estimate", x = "", title = "Mediation Analysis Effects")

9.1.2 Path diagram

p_stars <- function(p) ifelse(p < 0.001, "***",
                       ifelse(p < 0.01, "**",
                       ifelse(p < 0.05, "*", "")))

plot_mediation_triangle <- function(
  med_fit, out_fit,
  treat, mediator,
  x_label = treat, m_label = mediator, y_label = "Outcome",

  # colours & style
  box_fills = c(X = "#E6F0FA", M = "#FDE6E6", Y = "#F1E6FA"),
  edge_col  = "#222222",
  text_col  = "#222222",
  line_width = 1,
  arrow_head_size = 0.25,    # size of arrowhead only
  line_inset = 1.8,          # pull back arrow line from box centre

  # nudges
  a_nudge = c(-1.6, 0.8),
  b_nudge = c( 1.8, 0.8),
  c_nudge = c( 0.0,-0.7),

  digits_b = 3, digits_t = 2
) {
  sm_m <- summary(med_fit)$coefficients
  sm_y <- summary(out_fit)$coefficients

  a_est <- sm_m[treat, "Estimate"]; a_t <- sm_m[treat, "t value"]; a_df <- df.residual(med_fit); a_p <- 2*pt(-abs(a_t), a_df)
  b_est <- sm_y[mediator, "Estimate"]; b_t <- sm_y[mediator, "t value"]; b_df <- df.residual(out_fit); b_p <- 2*pt(-abs(b_t), b_df)
  c_est <- sm_y[treat,  "Estimate"]; c_t <- sm_y[treat,  "t value"]; c_df <- df.residual(out_fit); c_p <- 2*pt(-abs(c_t), c_df)

  a_lab <- sprintf("(a)\nb = %.*f, t(%d) = %.*f%s", 
                   digits_b, a_est, a_df, digits_t, a_t, p_stars(a_p))
  b_lab <- sprintf("(b)\nb = %.*f, t(%d) = %.*f%s", 
                   digits_b, b_est, b_df, digits_t, b_t, p_stars(b_p))
  c_lab <- sprintf("(c')\nb = %.*f, t(%d) = %.*f%s", 
                   digits_b, c_est, c_df, digits_t, c_t, p_stars(c_p))

  nodes <- data.frame(
    name = c(x_label, m_label, y_label),
    id   = c("X","M","Y"),
    x    = c(0, 5, 10),
    y    = c(0, 4, 0)
  )

  # shorten segments more so arrowheads are clear of boxes
  a_x  <- c(nodes$x[1] + line_inset*0.35, nodes$x[2] - line_inset*0.35)
  a_y  <- c(nodes$y[1] + line_inset*0.28, nodes$y[2] - line_inset*0.28)
  b_x  <- c(nodes$x[2] + line_inset*0.35, nodes$x[3] - line_inset*0.35)
  b_y  <- c(nodes$y[2] - line_inset*0.28, nodes$y[3] + line_inset*0.28)
  c_x  <- c(nodes$x[1] + line_inset, nodes$x[3] - line_inset)
  c_y  <- c(0, 0)

  # label midpoints + nudges
  mid_a <- c(mean(a_x) + a_nudge[1], mean(a_y) + a_nudge[2])
  mid_b <- c(mean(b_x) + b_nudge[1], mean(b_y) + b_nudge[2])
  mid_c <- c(mean(c_x) + c_nudge[1], mean(c_y) + c_nudge[2])

  arrow_style <- arrow(type = "closed", 
                       length = unit(arrow_head_size, "cm"))

  ggplot() +
    # a path
    geom_segment(aes(x = a_x[1], y = a_y[1], 
                     xend = a_x[2], yend = a_y[2]),
                 arrow = arrow_style, 
                 linewidth = line_width, colour = edge_col) +
    annotate("text", x = mid_a[1], y = mid_a[2], label = a_lab,
             size = 3.8, hjust = 0.5, colour = text_col) +

    # b path
    geom_segment(aes(x = b_x[1], y = b_y[1], 
                     xend = b_x[2], yend = b_y[2]),
                 arrow = arrow_style, 
                 linewidth = line_width, colour = edge_col) +
    annotate("text", x = mid_b[1], y = mid_b[2], label = b_lab,
             size = 3.8, hjust = 0.5, colour = text_col) +

    # c′ path
    geom_segment(aes(x = c_x[1], y = c_y[1], 
                     xend = c_x[2], yend = c_y[2]),
                 arrow = arrow_style, 
                 linewidth = line_width, colour = edge_col) +
    annotate("text", x = mid_c[1], y = mid_c[2], label = c_lab,
             size = 3.8, hjust = 0.5, colour = text_col) +

    # nodes
    geom_label(
      data = nodes,
      aes(x = x, y = y, label = name, fill = id),
      size = 5,
      colour = "black",
      label.size = 0.6,
      label.r = unit(0.55, "lines"),           # corner roundness
      label.padding = unit(0.5, "lines")       # inner text padding
    ) +
    scale_fill_manual(values = setNames(box_fills, c("X","M","Y")), guide = "none") +
    coord_fixed() + theme_void() +
    xlim(-1, 11) + ylim(-1.6, 5.4)
}
plot_mediation_triangle(
  med_fit = fit_Mediator_model,
  out_fit = fit_full_model,
  treat   = "Age_z",
  mediator= "Bact_count_Z",
  x_label = "Age",
  m_label = "Bacterial Count",
  y_label = "Inflammation"
)

Leave a Reply

Your email address will not be published. Required fields are marked *

R Code

Previous article

Stepwise Multiple Regression
R Code

Next article

How to make a simple bar graph in R