Mediation analysis with nuisance variables.
Mediation Analysis with Nuisance Variables
Jack Auty
2025-08-06
1 Introduction
Mediation analysis is a simple idea that is easy to perform but can be a minefield to interpret correctly. The principle is straightforward: one explanatory variable may correlate with the dependent variable by influencing another explanatory variable.
For example, suppose you find that bullet wounds correlate with death, and that blood loss also correlates with death. You could use a mediation analysis to test whether bullet wounds lead to death by increasing blood loss. You would find that they do (sorry about the gruesome example).
One absolute no-no in mediation analysis is when the direction of causation could be the other way around – that is, if the dependent variable causes the explanatory variables. Mediation analysis breaks down completely in such cases, so don’t do it. (Cough like this paper did – https://pubmed.ncbi.nlm.nih.gov/29567761/ – cough.)
For example, if you modelled bullet wounds as the dependent variable and tried to explain them using death and blood loss as predictors (with blood loss as a mediator), you would find that death causes bullet wounds directly and indirectly through blood loss. Obviously, that doesn’t make sense. This is different from correlation analysis, which does not care about the direction of causality.
In this example, we are testing whether the age of a patient correlates with inflammation through the variable bacterial count. The idea is that being older might not only directly cause more inflammation, but also make someone more susceptible to bacterial growth, which could lead to more bacteria in the blood and, in turn, more inflammation.
To complicate things further (as data often does), what if you wanted to adjust for some nuisance variables first, before looking at your variables of interest? Okay – here we go.
2 Load Required Packages
# Function for CRAN packages
load_cran_pkg <- function(pkgs) {
invisible(lapply(pkgs, function(pkg) {
if (!requireNamespace(pkg, quietly = TRUE)) {
install.packages(pkg)
}
library(pkg, character.only = TRUE)
}))
}
# Function for GitHub packages
load_github_pkg <- function(pkgs) {
if (!requireNamespace("remotes", quietly = TRUE)) install.packages("remotes")
invisible(lapply(pkgs, function(pkg) {
if (!requireNamespace(pkg, quietly = TRUE)) {
remotes::install_github(pkg)
}
# Extract package name from "user/repo"
pkg_name <- sub(".*/", "", pkg)
library(pkg_name, character.only = TRUE)
}))
}
# ---- Usage ----
cran_packages <- c("ggplot2", "viridisLite", "sjPlot", "MASS", "mediation", "gt", "knitr","grid")
github_packages <- c("sinhrks/ggfortify")
load_cran_pkg(cran_packages)
load_github_pkg(github_packages)
3 Citations
| References |
|---|
| Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York |
| Garnier, Simon, Ross, Noam, Rudis, Robert, Camargo, Pedro A, Sciaini, Marco, Scherer, Cédric (2023). viridis(Lite) - Colorblind-Friendly Color Maps for R., viridisLite package version 0.4.2 |
| Lüdecke D (2024). sjPlot: Data Visualization for Statistics in Social Science, R package version 2.8.17 |
| Venables WN, Ripley BD (2002). Modern Applied Statistics with S, Fourth edition. Springer, New York |
| Tingley D, Yamamoto T, Hirose K, Keele L, Imai K (2014). mediation: R Package for Causal Mediation Analysis. Journal of Statistical Software, 59(5), 1-38 |
| Imai K, Keele L, Yamamoto T (2010). Identification, Inference, and Sensitivity Analysis for Causal Mediation Effects. Statistical Science, 25(1), 51-71 |
| Imai K, Keele L, Tingley D (2010). A General Approach to Causal Mediation Analysis. Psychological Methods, 15(4), 309-334 |
| Imai K, Keele L, Tingley D, Yamamoto T (2011). Unpacking the Black Box of Causality: Learning about Causal Mechanisms from Experimental and Observational Studies. American Political Science Review, 105(4), 765-789 |
| Imai K, Yamamoto T (2013). Identification and Sensitivity Analysis for Multiple Causal Mechanisms: Revisiting Evidence from Framing Experiments. Political Analysis, 21(2), 141-171 |
| Imai K, Keele L, Tingley D, Yamamoto T (2010). Causal Mediation Analysis Using R. In Vinod HD, Advances in Social Science Research Using R. Springer-Verlag, New York |
| Iannone R, Cheng J, Schloerke B, Hughes E, Lauer A, Seo J, Brevoort K, Roy O (2025). gt: Easily Create Presentation-Ready Display Tables, R package version 1.0.0 |
| Xie Y (2025). knitr: A General-Purpose Package for Dynamic Report Generation in R, R package version 1.50 |
| Xie Y (2015). Dynamic Documents with R and knitr, 2nd edition. Chapman and Hall/CRC, Boca Raton, Florida |
| Xie Y (2014). knitr: A Comprehensive Tool for Reproducible Research in R. In Stodden V, Leisch F, Peng RD, Implementing Reproducible Computational Research. Chapman and Hall/CRC |
| R Core Team (2024). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria |
| Tang Y, Horikoshi M, Li W (2016). ggfortify: Unified Interface to Visualize Statistical Result of Popular R Packages. The R Journal, 8(2), 474-485 |
| Horikoshi M, Tang Y (2018). ggfortify: Data Visualization Tools for Statistical Analysis Results |
4 Load and Prepare Data
First load the data and packages. Always check the data to ensure it makes sense – for example, are the numerical variables actually stored as numeric?
# Example data
dat<- read.table(url("https://jackauty.com/wp-content/uploads/2020/05/Stepwise-regression-data.txt"),header=T)
#Example data check
tab_df(head(dat))
| Inflammation | Age | Sex | Blood.pressure | Infection | Bacterial.count |
|---|---|---|---|---|---|
| 4.40 | 7.16 | M | 151 | No | 75.00 |
| 1218.07 | 8.88 | F | 145 | Yes | 5718.30 |
| 5.32 | 3.23 | F | 104 | No | 27.00 |
| 3072.58 | 7.91 | F | 156 | Yes | 4439.05 |
| 108.86 | 7.98 | M | 143 | No | 224.00 |
| 563.63 | 2.93 | F | 159 | No | 480.00 |
5 Distribution of the dependent and explanatory variables
Next, check if the dependent variable is normally distributed. It does not have to be perfectly normal, but analyses will generally run more smoothly if it is.
In this example, the distribution of the dependent variable was improved by a square-root transformation, so a new transformed variable was created.
Age looks fine without transformation. Bacterial count, however, benefits from a stronger shrinking transformation. In this case, a log transformation was applied.
hist(dat$Inflammation, breaks=6, cex.main = 0.6)
hist(sqrt(dat$Inflammation), breaks=6, cex.main = 0.6)
#Square root looks better, so let's create a variable to use in our model
dat$Inflammation.transformed<-sqrt(dat$Inflammation)
# Age looks good!
hist(dat$Age, cex.main = 0.6)
# Bacterial count needs a strong shrinking transformation
hist(dat$Bacterial.count, cex.main = 0.6)
# Transforming the Bacterial Count Data works
dat$Bact_count_trans <- log10(dat$Bacterial.count+1)
hist(dat$Bact_count_trans, breaks = 2, cex.main = 0.6)
6 Adjust the data for nuisance variables
Next, we build a model including the nuisance variables (Sex and Blood Pressure) and extract the standardised residuals into a new column in the data frame. Standardised residuals represent the variation left over after accounting for the nuisance variables. This residual variation can then be explained by the variables of interest. Always check model assumptions and transform variables if necessary.
#Set up adjusted model
adjusted_model<-lm(Inflammation.transformed ~ Sex*Blood.pressure, data = dat)
summary(adjusted_model)
##
## Call:
## lm(formula = Inflammation.transformed ~ Sex * Blood.pressure,
## data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.1436 -8.2012 -0.9615 9.9437 29.0713
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.9407 15.8002 0.249 0.8036
## SexM 38.1009 22.2552 1.712 0.0901 .
## Blood.pressure 0.2117 0.1178 1.797 0.0755 .
## SexM:Blood.pressure -0.3093 0.1685 -1.836 0.0695 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.52 on 96 degrees of freedom
## Multiple R-squared: 0.04677, Adjusted R-squared: 0.01698
## F-statistic: 1.57 on 3 and 96 DF, p-value: 0.2017
#Diagnostic plots
autoplot(adjusted_model,which = c(1,2,4))
#Extract the standardized residuals and turn them into a column in your data
dat$residuals<-stdres(adjusted_model)
7 Standardise explanatory variables
In mediation analysis, you will often report that one variable has a direct effect on the dependent variable, and that some proportion of the effect is mediated through another variable. For this proportion to be meaningful, the explanatory variables must be on the same scale.
We standardise them into z-scores (units in standard deviations from the mean) so they can be compared directly.
dat$Age_z <- as.numeric(scale(dat$Age))
dat$Bact_count_Z <- as.numeric(scale(dat$Bact_count_trans))
8 Build base models and check assumptions
We now build our base models. First, we model the primary explanatory variable (Age) against the mediator (Bacterial count) as the dependent variable. Next, we model the primary explanatory variable and the mediator together as predictors of the residuals (inflammation) from the nuisance-adjusted model.
Because of the earlier transformations and scaling, both models meet the assumptions well.
#Build base models
fit_Mediator_model<-lm(Bact_count_Z ~ Age_z, data=dat)
fit_full_model<-lm(residuals ~ Age_z + Bact_count_Z, data=dat)
autoplot(fit_Mediator_model,which = c(1,2,4))
autoplot(fit_full_model,which = c(1,2,4))
9 Mediation analysis
Now we run the mediation analysis. In this example, we find a significant effect of Age on inflammation, and an indirect effect of Age on inflammation through its effect on bacterial count.
We repeat the analysis using a bootstrap method, which is more robust and less sensitive to non-normality in the residuals, and obtain the same result.
The results suggest that more than 50% of the effect of Age on inflammation is statistically mediated by bacterial count. This implies that being older increases the risk of bacterial infection, and that this accounts for a substantial proportion of the higher inflammation observed in older individuals. (Note: this dataset is simulated.)
#Run mediation analysis
mediation<-mediate(fit_Mediator_model, fit_full_model, treat="Age_z", mediator = "Bact_count_Z", boot = TRUE, sims = 1000)
summary_med <- summary(mediation)
# Build data frame
df_med <- data.frame(
Effect = c("ACME (indirect effect)",
"ADE (direct effect)",
"Total Effect",
"Proportion Mediated"),
Estimate = c(summary_med$d0, summary_med$z0, summary_med$tau.coef, summary_med$n0),
CI_lower = c(summary_med$d0.ci[1], summary_med$z0.ci[1],
summary_med$tau.ci[1], summary_med$n0.ci[1]),
CI_upper = c(summary_med$d0.ci[2], summary_med$z0.ci[2],
summary_med$tau.ci[2], summary_med$n0.ci[2]),
p_value = c(summary_med$d0.p, summary_med$z0.p,
summary_med$tau.p, summary_med$n0.p)
)
# Format p-values
df_med$p_value <- ifelse(df_med$p_value < 0.0001,
"<0.0001",
sprintf("%.4f", df_med$p_value))
# Combine CIs into one column
df_med$CI <- paste0("[", round(df_med$CI_lower, 3), ", ", round(df_med$CI_upper, 3), "]")
# Keep only desired columns
df_med <- df_med[, c("Effect", "Estimate", "CI", "p_value")]
# Show table
sjPlot::tab_df(df_med,
title = "Mediation Analysis Results",
col.header = c("Effect", "Estimate", "95% CI", "p-value"))
| Effect | Estimate | 95% CI | p-value |
|---|---|---|---|
| ACME (indirect effect) | 0.15 | [0.019, 0.285] | 0.0280 |
| ADE (direct effect) | 0.20 | [0.081, 0.305] | <0.0001 |
| Total Effect | 0.35 | [0.165, 0.53] | <0.0001 |
| Proportion Mediated | 0.43 | [0.09, 0.699] | 0.0280 |
9.1 Plotting mediation analysis
Visualising mediation results can be challenging. Two useful approaches are:
Forest plots, which provide a quick overview of effect sizes and confidence intervals, making it easy to identify which effects are statistically significant.
Path diagrams, which illustrate the relationships between variables, showing both the size and significance of direct and indirect effects within the mediation model.
9.1.1 Forest plot
sm <- summary(mediation)
df_plot <- data.frame(
Effect = c("ACME (indirect)", "ADE (direct)", "Total Effect", "Prop. Mediated"),
Estimate = c(sm$d0, sm$z0, sm$tau.coef, sm$n0),
CI_lower = c(sm$d0.ci[1], sm$z0.ci[1], sm$tau.ci[1], sm$n0.ci[1]),
CI_upper = c(sm$d0.ci[2], sm$z0.ci[2], sm$tau.ci[2], sm$n0.ci[2])
)
ggplot(df_plot, aes(x = Effect, y = Estimate)) +
geom_hline(yintercept = 0, linetype = "dotted") + # 0 reference line
geom_point(size = 3) +
geom_errorbar(aes(ymin = CI_lower, ymax = CI_upper), width = 0.1) +
coord_flip() +
scale_y_continuous(limits = c(-0.1, NA)) + # force min = -0.1
theme_classic() +
labs(y = "Estimate", x = "", title = "Mediation Analysis Effects")
9.1.2 Path diagram
p_stars <- function(p) ifelse(p < 0.001, "***",
ifelse(p < 0.01, "**",
ifelse(p < 0.05, "*", "")))
plot_mediation_triangle <- function(
med_fit, out_fit,
treat, mediator,
x_label = treat, m_label = mediator, y_label = "Outcome",
# colours & style
box_fills = c(X = "#E6F0FA", M = "#FDE6E6", Y = "#F1E6FA"),
edge_col = "#222222",
text_col = "#222222",
line_width = 1,
arrow_head_size = 0.25, # size of arrowhead only
line_inset = 1.8, # pull back arrow line from box centre
# nudges
a_nudge = c(-1.6, 0.8),
b_nudge = c( 1.8, 0.8),
c_nudge = c( 0.0,-0.7),
digits_b = 3, digits_t = 2
) {
sm_m <- summary(med_fit)$coefficients
sm_y <- summary(out_fit)$coefficients
a_est <- sm_m[treat, "Estimate"]; a_t <- sm_m[treat, "t value"]; a_df <- df.residual(med_fit); a_p <- 2*pt(-abs(a_t), a_df)
b_est <- sm_y[mediator, "Estimate"]; b_t <- sm_y[mediator, "t value"]; b_df <- df.residual(out_fit); b_p <- 2*pt(-abs(b_t), b_df)
c_est <- sm_y[treat, "Estimate"]; c_t <- sm_y[treat, "t value"]; c_df <- df.residual(out_fit); c_p <- 2*pt(-abs(c_t), c_df)
a_lab <- sprintf("(a)\nb = %.*f, t(%d) = %.*f%s",
digits_b, a_est, a_df, digits_t, a_t, p_stars(a_p))
b_lab <- sprintf("(b)\nb = %.*f, t(%d) = %.*f%s",
digits_b, b_est, b_df, digits_t, b_t, p_stars(b_p))
c_lab <- sprintf("(c')\nb = %.*f, t(%d) = %.*f%s",
digits_b, c_est, c_df, digits_t, c_t, p_stars(c_p))
nodes <- data.frame(
name = c(x_label, m_label, y_label),
id = c("X","M","Y"),
x = c(0, 5, 10),
y = c(0, 4, 0)
)
# shorten segments more so arrowheads are clear of boxes
a_x <- c(nodes$x[1] + line_inset*0.35, nodes$x[2] - line_inset*0.35)
a_y <- c(nodes$y[1] + line_inset*0.28, nodes$y[2] - line_inset*0.28)
b_x <- c(nodes$x[2] + line_inset*0.35, nodes$x[3] - line_inset*0.35)
b_y <- c(nodes$y[2] - line_inset*0.28, nodes$y[3] + line_inset*0.28)
c_x <- c(nodes$x[1] + line_inset, nodes$x[3] - line_inset)
c_y <- c(0, 0)
# label midpoints + nudges
mid_a <- c(mean(a_x) + a_nudge[1], mean(a_y) + a_nudge[2])
mid_b <- c(mean(b_x) + b_nudge[1], mean(b_y) + b_nudge[2])
mid_c <- c(mean(c_x) + c_nudge[1], mean(c_y) + c_nudge[2])
arrow_style <- arrow(type = "closed",
length = unit(arrow_head_size, "cm"))
ggplot() +
# a path
geom_segment(aes(x = a_x[1], y = a_y[1],
xend = a_x[2], yend = a_y[2]),
arrow = arrow_style,
linewidth = line_width, colour = edge_col) +
annotate("text", x = mid_a[1], y = mid_a[2], label = a_lab,
size = 3.8, hjust = 0.5, colour = text_col) +
# b path
geom_segment(aes(x = b_x[1], y = b_y[1],
xend = b_x[2], yend = b_y[2]),
arrow = arrow_style,
linewidth = line_width, colour = edge_col) +
annotate("text", x = mid_b[1], y = mid_b[2], label = b_lab,
size = 3.8, hjust = 0.5, colour = text_col) +
# c′ path
geom_segment(aes(x = c_x[1], y = c_y[1],
xend = c_x[2], yend = c_y[2]),
arrow = arrow_style,
linewidth = line_width, colour = edge_col) +
annotate("text", x = mid_c[1], y = mid_c[2], label = c_lab,
size = 3.8, hjust = 0.5, colour = text_col) +
# nodes
geom_label(
data = nodes,
aes(x = x, y = y, label = name, fill = id),
size = 5,
colour = "black",
label.size = 0.6,
label.r = unit(0.55, "lines"), # corner roundness
label.padding = unit(0.5, "lines") # inner text padding
) +
scale_fill_manual(values = setNames(box_fills, c("X","M","Y")), guide = "none") +
coord_fixed() + theme_void() +
xlim(-1, 11) + ylim(-1.6, 5.4)
}
plot_mediation_triangle(
med_fit = fit_Mediator_model,
out_fit = fit_full_model,
treat = "Age_z",
mediator= "Bact_count_Z",
x_label = "Age",
m_label = "Bacterial Count",
y_label = "Inflammation"
)
