Author Archives: jackrrivers

How to make a Rain Cloud Plot (aka a Rotated Violin plot)

These are great plots. They look good and do not hide any information. They provide quick insights into the data, including its spread, any differences present, and the rough distributions. It is important to choose a graph that matches the statistical test conducted. For instance, if you show mean +/- SEM or SD, you should have performed a parametric test, like an ANOVA. On the other hand, if you display medians and quartiles (box plot), you should have used a non-parametric test, like a Mann-Whitney test. However, the Rain Cloud plot displays all the necessary information, making it suitable for any type of statistical test. In general, these plots are great. However, their main limitation is that they really only work with a sufficiently large sample size (N). If you have an N of less than 8 in each group, it is advisable to skip the distribution part of the graph. Below is the code.

#Load the packages
install.packages("PupillometryR")
install.packages("ggplot2")
library("PupillometryR")
library("ggplot2")


#Load the data
example<- read.table(url("https://jackauty.com/wp-content/uploads/2021/02/barchart3.txt"), header=T) 


#Editing the variables to work with the graphs
##Making Treatment a factor 
example$Treatment<-as.factor(example$Treatment) 

# Generating a color palette named "colours" with four colors from the "viridis" palette
colours <- viridis(4, begin = 0.25, end = 0.75, alpha = 0.5)

#Reording the factors to make sense (from bottom to top)
example$Treatment<-factor(example$Treatment, levels = c("Disease_Drug","Disease","Drug","Control")) 

# Creating a ggplot object called "rain_cloud_plot" using the data frame "example"

rain_cloud_plot <- ggplot(example, aes(Treatment, Dependent_Variable, fill = Treatment, col = Treatment)) +

# Adding a flat violin plot with some horizontal position adjustment and transparency
  geom_flat_violin(position = position_nudge(x = 0.1, y = 0), alpha = 0.8, width = 0.5) +

# Adding points to the plot with jittering and dodging for better visualization, along with some transparency
  geom_point(position = position_jitterdodge(dodge.width = 0.15, jitter.width = 0.15), size = 2, alpha = 0.6) +

# Adding a box plot with black color, no outliers, some transparency, and horizontal position adjustment
  geom_boxplot(color = "black", width = 0.1, outlier.shape = NA, alpha = 0.5, position = position_nudge(x = -0.2, y = 0)) +

# Flipping the x and y axes (transposing) to create a horizontal plot
  coord_flip() +

# Setting the x and y axis labels and adjusting the y-axis scale to a square root transformation
  labs(x = "", y = "Relative levels") +
  scale_y_continuous(trans = "sqrt") +

# Applying a classic theme to the plot
  theme_classic() +

# Removing the legend from the plot
  theme(legend.position = "none") +

# Adjusting the font size for the text elements in the plot
  theme(text = element_text(size = 14)) +

# Setting the color palette for the fill aesthetic
  scale_fill_manual(values = colours) +

# Setting the color palette for the color aesthetic
  scale_colour_manual(values = colours)+
# Relabel x axis
  scale_x_discrete(labels=c("Disease + Drug", "Disease","Drug", "Control"))

# Displaying the plot
rain_cloud_plot
Just note this has a transformed axis. This is one of the nice things about a rain cloud plot is you can demonstrate the efficacy of a transformation to create normality. (Obviously this just has an of 3, so that doesn’t really make sense, but the code is there if you need it).

Outreach and public engagement

I am passionate about sharing the wonders of science with others and thrive on the challenge of getting people excited about it. As a researcher, I believe it is my duty to not only lead scientific discovery but also to share that knowledge with the public. Throughout my career, I have been actively involved in outreach efforts, including designing and delivering presentations and demonstrations to schoolchildren through programs like the University of Manchester’s Pivots of Change and the Pint of Science international outreach program. I’ve even taken my passion for dementia research to local pubs! I’ve also created two websites, one for more scientific audiences and one for the general public, and have worked with my university’s social media team to create videos and other online content. My outreach videos alone have been viewed over 150,000 times! Come check out some of my other outreach contributions and let’s explore the exciting world of science together.

deanoaks-outreach-1

School visit with Dr. Michael Daniels. Toxic compounds (hacky-sacks) and the Brain (rope) can be seen in the background.

Some more examples:

  • Interviewed by Smithsonian magazine on the effects of plastics on wildlife 2021.
  • Interviewed by ABC New South Wales radio on the effects of plastics on wildlife 2021.
  • Ocean project podcast. The negative effects of plastics on humans and wildlife Listen here 2021.
  • Presenter at the House of Lords event to raise awareness about the need for funding for dementia research. June 2019. Details of the event can be found here.

  • Speaker for the University of Manchester Stories event, January 2019. This event was to over 400 PhD students.

DSC_6491

The organisers invited excellent speakers to tell their stories to inspire the next generation of thinkers. Speakers included Dame Prof. Nancy Rothwell and Chancellor Lemn Sissay. The title of my talk was: Usain Bolt shaves his legs.

  • Dementia research podcast – career development as an ECR in dementia -> listen here
  • Presenter for the British Science Association outreach event- Science at the Movies. Topic: Do we use 10% of a brain and can we drug ourselves smart? February 2018.
  • Presenter for the Manchester branch of Pint of Science, an international science outreach event. Topic: Why do we get old? Ageing, Alzheimer’s and Animal Models. January 2018.
  • Presenter for the Manchester branch of Pint of Science, an international science outreach event. Topic: The coast of England, radioactive brains, bleach and other Alzheimer’s disease related things. May 2015.
  • Speaker and experiment demonstrator for the Pivots of Change outreach programme. Class and assembly presentations at St. Luke RC Primary School and St. Ambrose Barlow RC High School. Topic: The history and future of science. February 2015.
  • University of Otago Christchurch Seminar Series. Topic: Cannabinoids and brain damage. June 2012.

A complete guide to ANOVAs in R

This again is more for my use than anyone else’s. The focus of this post is doing ANOVAs (or linear models, after all everything is a linear model), in a nice readable and publishable way with transformations, post-hocs and graphs. Mostly for appearance reasons, this processes uses lots of packages. Packages should normally be avoided, but one thing they’re good at is making things pretty. So that’s what they’re used for here.

Install packages

Note for this I’ve put this if(!require) function in. This means it won’t install the package if it doesn’t need to. This method just saves time and allows you to run all your code without the delays of reinstalling packages.

if(!require(MASS)) install.packages("MASS")
if(!require(sjPlot)) install.packages("sjPlot")
if(!require(ggplot2)) install.packages("ggplot2")
if(!require(emmeans)) install.packages("emmeans")
if(!require(viridis)) install.packages("viridis")
if (!require("remotes")) install.packages("remotes")
remotes::install_github('sinhrks/ggfortify') 

library(ggplot2); library(sjPlot); library(ggfortify); library(MASS); library(emmeans); 

Load the data

Here we are just loading the example data set and having a look. The data has two explanatory variables, Sex and Drug, with one dependent variable. So we could imagine this study is testing the response to a drug or placebo in male and females just to see if sex effects how individuals respond to a drug. This is a classic two-way ANOVA design. Two factors -> Sex and Drug.

ANOVA_example<-read.table(url("https://jackauty.com/wp-content/uploads/2022/11/ANOVA_example2.txt"), header =T)
View(ANOVA_example)

Quick graph

I would always recommend doing a quick and ugly graph just to orientate yourself with the data.

stripchart(Dependent~Sex*Drug, data=ANOVA_example, pch=16, vertical =T)
I told you it was ugly! But we can see that females don’t respond to the drug, while males start high and then drop down in response to the drug.

Run the base ANOVA

Here we are running the base ANOVA (linear model) and then checking the assumptions. It’s good to check the assumptions before seeing if you have any significance. That way you aren’t tempted to go “hey it’s significant, let’s not do any transformations”.

What we can see in the diagnostic plots is that a transformation is needed.

The top left plot is residuals vs fitted, this tests for homoscedasticity, what we want is an even scattering of data, what we see is the data gets more variable as the fitted values increase. i.e. the variation increases as the mean increases.

Next in the top right we have a Q-Q plot. This plots the residuals vs a theoretical normal distribution (dotted line). The dots should be on the line. We can see the residuals are not normally distributed as they depart from the dotted line at the extremes.

Lastly, we have Cooks distance. This is a measure of data with extreme influence and effect on the model (aka outliers). Now the rule of thumb is you don’t want a Cook’s distance of more than 4/n. This study has n=10, so more than 0.4 is bad. However, there’s nothing you can do in experimental biology with outliers. The only thing would be to go back and look at your lab book. You cannot just remove data you don’t like, but, if you go back and your lab book says “I think I gave the wrong drug to this individual”, or something, THEN you have grounds to remove it.

Note: Using ggfortify’s autoplot function makes them look nice and automatically lays them out in a grid (no need for par(mfrow=c(2,2)), like in base r).

#Running base ANOVA
aov_base<-lm(Dependent~Sex*Drug, data=ANOVA_example)

#Diagnostic plots
autoplot(aov_base,which = c(1,2,4))


BoxCox transformation

So you can just have a guess at a transformation needed. The wedge shaped residual vs fitted plot above tells us we need a shrinking transformation like a LOG10 or square root. However, if we can take the researcher out of the process this is better. What if the LOG10 and square root have similar diagnostic plots, but the LOG10 gives you significant results and the square root transformation gives you p=0.051. You might be tempted to choose the LOG10 transformation because it gives you significance.

What we can do is a BoxCox transformation. This is an automated process that transforms your data by a range of powers and evaluates the normality of the residuals. Then it chooses the power (Selected.lamda) that produces the most normally distributed residuals. Typically, when you solve normality, you also solve homoscedasticity.

 #BoxCox transformation 
boxcox<-boxcox(aov_base,lambda = seq(-5, 5, 1/1000), plotit = TRUE )

Selected.lambda <-boxcox$x[boxcox$y==max(boxcox$y)]
Selected.lambda

#Run the transformed model
aov_base_trans<-lm(Dependent^Selected.lambda ~Sex*Drug, data=ANOVA_example)

Assumption checking the transformed model

And we can see running the diagnostic plots, the BoxCox transformation has totally worked. Those look perfect.

 #Diagnostic plots autoplot
autoplot(aov_base_trans, which = c(1,2,4))

Looking at your final model

Using the tab_model function from sjPlot makes your summary tables pretty. This is looking at your analysis like a linear model, so it gives you the coefficients. This is nice, because it allows you to see the magnitude and direction of the effects (ANOVA summaries do not). So we can see everything is significant. But the large positive coefficient of the interaction term tells us that males respond to the drug more than females. Note estimate direction is a little difficult because of the transformation. Because it’s a negative lambda we actually have to flip the estimates direction, so the positive coefficient for the interaction term tells us that Males have a large decrease in response to the drug.

 #Summary of the linear model gives you the coefficients 
tab_model(aov_base_trans)

However, sometimes you might have many levels within a factor. For example you could have low dose, medium dose and high dose of a drug and you don’t want to see the effects of each level, you want to know does the drug overall have an effect (explain variation in the data). For this an ANOVA table is more useful. From this we can see there is a significant main effect of Sex, a significant main effect of Drug and a significant interaction between the two. Again using tab_df from sjPlot makes the table look pretty.

 #ANOVA of the linear model
tab_df(anova(aov_base_trans),show.rownames =T)

Post hoc analysis

OK this is a Post-hoc analysis with a priori selected comparisons. We don’t want to do all the comparisons, because some are useless. Why compare Females on Drug with Males on Placebo? That doesn’t make sense. So in this, we select our comparisons and adjust for only those comparisons.

#Posthoc/Planned contrasts 
post <- lsmeans(aov_base_trans, pairwise ~ Sex*Drug, adjust="none")

posthoc<-summary(post)

tab_df(posthoc$contrasts[,c(1,4)],show.rownames =T)

This code gives us a list of potential comparisons. We then select the one’s we want. In this case 1, 2, 5 and 6. We can see that comparisons of Male Drug vs Female Placebo and Male Placebo vs Female Drug, really do not make any biological sense to compare. So we don’t want to do these needless comparisons that cost us statistical power.

#Build a dataframe of all the comparisons (contrasts) and the unadjusted p values
p.value<-data.frame(cbind(posthoc$contrasts["contrast"],posthoc$contrasts["p.value"]))

#Select the comparison you are interested in
comparisons<-c(1,2,5,6)
ptable<-p.value[comparisons,]

#Create a column of adjusted p values using the Holm-Sidak method
ptable$padj<-p.adjust(ptable$p.value, method="holm")

#View the table in a pretty way
tab_df(ptable,digits = 4)

Graphing your results

Now the first thing we need to do is manipulate our data a little. R will automatically work alphabetically and so Drug will be first (left side) on the graph, then Placebo. By convention, Placebo and Controls are always first (left). We also need to make a single column that explains what groups all the data belong to. So here’s the code for this data manipulation.

#Editing the variables to work with the graphs
##Making Drug a factor 
ANOVA_example$Drug<-as.factor(ANOVA_example$Drug) 
##Reorder the factor so Placebo is first
ANOVA_example$Drug<-factor(ANOVA_example$Drug, levels = c("Placebo", "Drug")) 

##Making a grouping column that describes the data
ANOVA_example$Group<-as.factor(paste(ANOVA_example$Sex,ANOVA_example$Drug, sep= " ")) 

##Reording the factors to make sense
ANOVA_example$Group<-factor(ANOVA_example$Group, levels = levels(ANOVA_example$Group)[c(2,1,4,3)]) 

ANOVA_example$Drug<-as.factor(ANOVA_example$Drug)
ANOVA_example$Drug<-factor(ANOVA_example$Drug, levels = levels(ANOVA_example$Drug)[c(2,1)])

Making the graph

So here is all the finicky details for making the graph. I like showing the points. I think it’s important to have the raw data there. Also, it’s important to use the scale_fill_viridis_d function as this makes it colour blind friendly.

bar_plot <- ggplot(ANOVA_example, aes(Sex, Dependent, fill = Drug))+  
#Make bars using the mean   
geom_bar(position = 'dodge', stat = 'summary', fun = 'mean', alpha=0.6, colour=1) +   

#Make error bars using SEM   
geom_errorbar(stat = 'summary',fun.data= "mean_se", position=position_dodge(.9), width = 0.075, colour=1, size=1) +   

#Add data points   
geom_point(aes(group = Group), shape = 21, position = position_dodge(0.9), alpha= 0.25, stroke = NA, size=3, fill =1)+      

#Remove background colour and gridlines   
theme_classic()+      

#Ensure axes join at 0,0 and set the y axis limits   
scale_y_continuous(expand = c(0, 0), limits = c(0,1.2*max(ANOVA_example$Dependent)))+        

#Make axis lines thicker   
theme(axis.line = element_line(size=1)) +      

#Label Axis   
labs(x = "Sex", y="Dependent variable (unit)")+       

#Make axis lines thicker   
theme(axis.line = element_line(size=1)) +       

#Adjust axis font sizes and font colour   
theme(axis.title = element_text(size = 15), axis.text.x = element_text(size=12, color="black"),          axis.text.y = element_text(size=12, color="black")) +       

#Make ticks black   
theme(axis.ticks=element_line(color="black", lineend = "square")) +      

#Fill viridis   
scale_fill_viridis_d(begin=0.25, end=0.75, alpha=0.5)  

#View plot 
bar_plot   

Saving the graph

Now I like to save as a PDF. This is a vector image which can be opened in illustrator. High res images can be taken of the PDF by zooming in a taking a snapshot (it sounds crazy but it works). Also, the pdf function works with everything, R basic, Plotly, ggplot2 -> everything. So it’s easy.

#Save as PDF 
pdf("example.pdf", height=4, width=4.5)
bar_plot
dev.off()

Reference list

The last thing I like to do is make a nice reference list. This can be finicky with GitHub packages. But it’s easy with CRAN packages.

#Making a reference list 
pkg <- c("MASS","sjPlot","ggplot2","emmeans","ggfortify")

#Create a loop to build a dataframe of the references
citation_list<-factor()

for(i in pkg){ citation_list<- c(citation_list,paste(capture.output(print(citation(i), style ="text",bibtext=F)), collapse="\n")) }

#Display the dataframe nicely with sjPlot's tab_df
tab_df( data.frame(citation_list), col.header="References", sep=" ")

The full code all in one

Just if you want an easy to copy and paste full code set. Here you go.


if(!require(MASS)) install.packages("MASS")
if(!require(sjPlot)) install.packages("sjPlot")
if(!require(ggplot2)) install.packages("ggplot2")
if(!require(emmeans)) install.packages("emmeans")
if(!require(ggplot2)) install.packages("ggplot2")
if (!require("remotes")) install.packages("remotes")
remotes::install_github('sinhrks/ggfortify') 


library(ggplot2); library(sjPlot); library(ggfortify); library(MASS); library(emmeans); 

ANOVA_example<-read.table(url("https://jackauty.com/wp-content/uploads/2022/11/ANOVA_example2.txt"), header =T)

View(ANOVA_example)

stripchart(Dependent~Sex*Drug, data=ANOVA_example, pch=16, vertical =T)

#Running base ANOVA
aov_base<-lm(Dependent~Sex*Drug, data=ANOVA_example)

#Diagnostic plots 
autoplot(aov_base,which = c(1,2,4))

#BoxCox transformation 
boxcox<-boxcox(aov_base,lambda = seq(-5, 5, 1/1000), plotit = TRUE ) 

Selected.lambda <-boxcox$x[boxcox$y==max(boxcox$y)] 
Selected.lambda 


#Run the transformed model 
aov_base_trans<-lm(Dependent^Selected.lambda ~Sex*Drug, data=ANOVA_example) 


#Diagnostic plots autoplot
autoplot(aov_base_trans, which = c(1,2,4))

#Summary of the linear model gives you the coefficients 
tab_model(aov_base_trans) 

#ANOVA of the linear model
tab_df(anova(aov_base_trans),show.rownames =T) 

#Posthoc/Planned contrasts 
post <- lsmeans(aov_base_trans, pairwise ~ Sex*Drug, adjust="none") 
posthoc<-summary(post) 

tab_df(posthoc$contrasts[,c(1,4)],show.rownames =T)

#Build a dataframe of all the comparisons (contrasts) and the unadjusted p values
p.value<-data.frame(cbind(posthoc$contrasts["contrast"],posthoc$contrasts["p.value"]))

#Select the comparison you are interested in
comparisons<-c(1,2,5,6)
ptable<-p.value[comparisons,]

#Create a column of adjusted p values using the Holm-Sidak method
ptable$padj<-p.adjust(ptable$p.value, method="holm") 

#View the table in a pretty way
tab_df(ptable,digits = 4)  

##Making a grouping column that describes the data
ANOVA_example$Group<-as.factor(paste(ANOVA_example$Sex,ANOVA_example$Drug, sep= " ")) 

##Reording the factors to make sense
ANOVA_example$Group<-factor(ANOVA_example$Group, levels = levels(ANOVA_example$Group)[c(2,1,4,3)])
ANOVA_example$Drug<-as.factor(ANOVA_example$Drug)
ANOVA_example$Drug<-factor(ANOVA_example$Drug, levels = levels(ANOVA_example$Drug)[c(2,1)])


bar_plot <- ggplot(ANOVA_example, aes(Sex, Dependent, fill = Drug))+  
  #Make bars using the mean   
  geom_bar(position = 'dodge', stat = 'summary', fun = 'mean', alpha=0.6, colour=1) +   
  
  #Make error bars using SEM   
  geom_errorbar(stat = 'summary',fun.data= "mean_se", position=position_dodge(.9), width = 0.075, colour=1, size=1) +   
  
  #Add data points   
  geom_point(aes(group = Group), shape = 21, position = position_dodge(0.9), alpha= 0.25, stroke = NA, size=3, fill =1)+      
  
  #Remove background colour and gridlines   
  theme_classic()+      
  
  #Ensure axes join at 0,0 and set the y axis limits   
  scale_y_continuous(expand = c(0, 0), limits = c(0,1.2*max(ANOVA_example$Dependent)))+        
  
  #Make axis lines thicker   theme(axis.line = element_line(size=1)) +      
  
  #Label Axis   labs(x = "Sex", y="Dependent variable (unit)")+       
  
  #Make axis lines thicker   
  theme(axis.line = element_line(size=1)) +       
  
  #Adjust axis font sizes and font colour   
  theme(axis.title = element_text(size = 15), axis.text.x = element_text(size=12, color="black"),          axis.text.y = element_text(size=12, color="black")) +       
  
  #Make ticks black   
  theme(axis.ticks=element_line(color="black", lineend = "square")) +      
  
  #Fill viridis   
  scale_fill_viridis_d(begin=0.25, end=0.75, alpha=0.5)  

#View plot 
bar_plot   

#Save as PDF 
pdf("example22.pdf", height=4, width=4.5) 
bar_plot
dev.off() 


#Making a reference list 
pkg <- c("MASS","sjPlot","ggplot2","emmeans","ggfortify") 

#Create a loop to build a dataframe of the references
citation_list<-factor() 

for(i in pkg){   citation_list<- c(citation_list,paste(capture.output(print(citation(i), style ="text",bibtext=F)), collapse="\n")) }

#Display the dataframe nicely with sjPlot's tab_df
tab_df(   data.frame(citation_list), col.header="References", sep=" ") 



Comments

Clear form->
https://neuroform.nfshost.com/poll/clearpoll

<p></p>
<p>Quick vote..... choose wisely</p>
<p><iframe src="https://neuroform.nfshost.com/poll/210527" style="height: 200px; width: 100%; border: none; overflow: hidden;"></iframe></p>

<p>Comment below</p>
<p><iframe src="https://neuroform.nfshost.com/post/210527" style="height: 500px; width: 100%; border: none; overflow: hidden;"> &gt;</iframe></p>
<p></p>

How to make a simple bar graph in R

OK, I’ll be honest, this post is mostly for me. I always forget how to do all of this and end up spending hours on google. So now I can do what all coders do, copy – paste – adjust.

So here we go -> how to do a simple bar chart with SEMs, that looks publication quality.

I feel like the blog is a little like this meme…. sorry.
meme

Load the data and look at it to understand it and make sure it makes sense. This data is very simple

#Obviously you'll need ggplot2
library(ggplot2)

#Load and look at the data
example<- read.table(url("https://jackauty.com/wp-content/uploads/2021/02/barchart3.txt"), header=T)
example
Barchart data

##Reording the factors to make sense
example$Treatment<-as.factor(example$Treatment)
example$Treatment<-factor(example$Treatment, levels = levels(example$Treatment)[c(1,4,2,3)])

#Create the plot and do all the annoying adjustments (draw the rest of the owl lol)

bar_plot <- ggplot(example, aes(Treatment, Dependent_Variable, fill = Treatment))+  
  #Make bars using the mean   
  geom_bar(position = 'dodge', stat = 'summary', fun = 'mean', alpha=0.6, colour=1) +   
  
  #Make error bars using SEM   
  geom_errorbar(stat = 'summary',fun.data= "mean_se", position=position_dodge(.9), width = 0.075, colour=1, size=1) +   
  
  #Add data points   
  geom_point( shape = 21, position = position_dodge(0.9), alpha= 0.25, stroke = NA, size=3, fill =1)+      
  
  #Remove background colour and gridlines   
  theme_classic()+      
  
  #Ensure axes join at 0,0 and set the y axis limits   
  scale_y_continuous(expand = c(0, 0), limits = c(0,1.2*max(example$Dependent_Variable)))+ 
  
 
  # Relabel x axis
  scale_x_discrete(labels=c("Control","Drug", "Disease", "Disease + Drug"))+
  
  #Make axis lines thicker   
  theme(axis.line = element_line(linewidth=1)) +      
  
  #Label Axis   
  labs(x = "", y="Dependent variable (unit)")+       
  
  #Make axis lines thicker   
  theme(axis.line = element_line(size=1)) +       
  
  #Adjust axis font sizes and font colour   
  theme(axis.title = element_text(size = 15), axis.text.x = element_text(size=12, color="black"), 
  axis.text.y = element_text(size=12, color="black")) +       
  
  #Make ticks black   
  theme(axis.ticks=element_line(color="black", lineend = "square")) +      
  
  #Fill viridis   
  scale_fill_viridis_d(begin=0.25, end=0.75, alpha=0.5)  + 
  
  #Remove legend
  theme(legend.position = "none")


#View plot 
bar_plot   
#View plot
bar_plot
 

Output it as a PDF

#PDF Plot
ggsave("SuperCoolBarchart.pdf", device="pdf")

AND HERE IT IS…… MY GOSH IT’S…. boring but publishable.

Bonus stripchart fun!

This is great for a very quick plot that looks OK

#Create plot
par(mar = c(9, 5, 2, 2),bty = 'n')
stripchart(example$Dependent_Variable~example$Treatment, vertical = T, pch=16, 
 method="jitter", las=2, col="#4299D0",ylim=c(0, 1.2*max(example$Dependent_Variable)),
 xaxs = "i",yaxs = "i", ylab=expression(paste("IL-1", beta, " (pg/ml)")))
abline(0,0)

#Also in here is how to put beta (β) into your graph

Bonus boxplot fun!

This is great for a very quick plot that looks OK

#Export as pdf
pdf("boxplot.pdf")

#Increase margins
par(mar = c(9, 5, 2, 2),bty = 'n')

#Set up text labels
labs<-as.character(unique(example$Treatment))

#Make plot
boxplot(example$Dependent_Variable~example$Treatment,xlab="" 
 , las=2, col="#4299D066",ylim=c(0, 1.2*max(example$Dependent_Variable)), axes=F,
 , ylab=expression(paste("IL-1", beta, " (pg/ml)")))

#Add axis markers, labels and lines
axis(side = 1,at=c(1:length(labs)), labels = c(rep("", length(labs))))
text(cex=1, x=c(1:length(labs)), y=-0.75, labs, xpd=TRUE, srt=45, adj=1)
axis(side = 2)
box(bty="l", lwd=2)

#Turn off PDF device
dev.off()

#Also in here is how to put beta (β) into your graph
#Also in here is how to export as a PDF and how to rotate your labels 45 degrees

Mediation analysis with nuisance variables.

Mediation analysis is a simple idea that is easy to do, but an absolute minefield to interpret correctly. The principal is easy – perhaps one explanatory variable correlates with the dependent variable by mediating (effecting) an other explanatory variable. So say you find out that bullet wounds correlates with death and blood loss correlates with death, you could do a mediation analysis to see if bullet wounds correlates with death by affecting blood loss – and you would find that it does (sorry about the gruesome example). One absolute NO NO for mediation analyses is if the direction of causation could be the other way. i.e. the dependent variable causes the explanatory variables. All mediation analyses breaks down at this point, so don’t do it (cough – like this paper did – https://pubmed.ncbi.nlm.nih.gov/29567761/ – cough). For example if you modelled bullet wounds as the dependent variables and explained it with death and blood loss as a mediation factor, you would find that death causes bullet wounds directly and indirectly through blood loss. This obviously doesn’t make sense (this is different to just correlation analyses that doesn’t care about the direction of causality).

In this example we are seeing whether or not the age of a patient correlates with inflammation through the variable bacterial count. The idea being that being old not only directly causes inflammation, but perhaps makes you more susceptible to bacteria growth, leading to larger numbers of bacteria in your blood and this leads to more inflammation. Now to make this even more complicated (just because that is how data is) what if you wanted to get rid of some nuisance variables first, before looking at your explanatory variables of interest? Wow, OK, here we go.

 

First Load the data and packages. Make sure the data makes sense (you should always do this),  e.g. are the numerical variables numerical?

#Load package
library(MASS)
library("mediation")
#Load data
dat<- read.table(url("https://jackauty.com/wp-content/uploads/2020/05/Stepwise-regression-data.txt"),header=T)

#Check data, make sure it makes sense
str(dat)

'data.frame': 100 obs. of 8 variables:
 $ Inflammation : num 4.4 1218.07 5.32 3072.58 108.86 ...
 $ Age : num 7.16 8.88 3.23 7.91 7.98 2.93 9.12 9.41 7.71 2.91 ...
 $ Sex : Factor w/ 2 levels "F","M": 2 1 1 1 2 1 1 2 1 1 ...
 $ Blood.pressure : num 151 145 104 156 143 159 114 140 149 114 ...
 $ Infection : Factor w/ 2 levels "No","Yes": 1 2 1 2 1 1 1 1 2 1 ...
 $ Bacterial.count : num 75 5718 27 4439 224 ...

Next, check if the dependent variable is normally distributed. It doesn’t have to be, but things will go smoother if it is. In this example the dependent variable distribution was improved by a square-root transformation. So I created a new transformed variable.

#Check dependent variable, ideally it's normally distributed. Transform if need
hist(dat$Inflammation, breaks=6)

scewed data

hist(sqrt(dat$Inflammation), breaks=6)
Better data

#Square root looks better, so let's create a variable to use in our model
dat$Inflammation.transformed<-sqrt(dat$Inflammation)

Next let’s build a model with the nuisance variables (Sex and Blood Pressure) and extract the standardized residuals into a new column of our dataframe. Residuals (Standardized) are the left over variation after the accounting for the nuisance variables. This residual variation will then hopefully be mostly explained by our variables of interest. Check the assumptions and transform if needed.

#Set up adjusted model
adjusted.model<-lm(Inflammation.transformed ~ Sex*Blood.pressure, data = dat)
summary(adjusted.model)

#Check Assumptions
par(mfrow=c(2,2))
plot(adjusted.model)

Adjusted model

#Extract the standardized residuals and turn them into a column in your data
dat$residuals<-stdres(adjusted.model)

#Check your data
dat[1:20,]

Next let’s build our base models with our two explanatory variables. First with our primary explanatory variable (Age), then with our primary explanatory variable and our mediator variable (Bacterial count). Then check your assumptions. Here it looks like we need to transform our explanatory variable of bacterial count. Often things like bacterial count need a log transformation and it worked!

#Build base models

fit.direct.model<-lm(residuals~Age, data=dat)
fit.mediator.variable.model<-lm(residuals~Age+Bacterial.count, data=dat)
summary(fit.direct.model)
summary(fit.mediator.variable.model)

#Assumption check
par(mfrow=c(2,2)) 
plot(fit.direct.model)
plot(fit.mediator.variable.model)

Fitted base model
#Top left you can clearly see the relationship isn't linear.

#Transform predicting variable
dat$log.bacterial.count<-log(dat$Bacterial.count)

fit.direct.model<-lm(residuals~Age, data=dat)
fit.mediator.variable.model<-lm(residuals~Age + log.bacterial.count, data=dat)

summary(fit.direct.model)
summary(fit.mediator.variable.model)

#Assumption check
par(mfrow=c(2,2)) 

plot(fit.direct.model)
plot(fit.mediator.variable.model)

Fitted base model transformed
#NNNIIIICCCEEEEE

Now we need to centre the variables. This is important as coefficients are only comparable when the explanatory variables are centred. In this example you might get a small coefficient for bacterial counts, because bacterial counts are in the millions and you might get a large coefficient for Age because that maxes out at 100 (ish). But if you centre both variables (turning them into Z scores), then they become more comparable as they’re on the same scale.

#Centering variables
dat$log.bact.centred<-(dat$log.bacterial.count-mean(dat$log.bacterial.count))/sd(dat$log.bacterial.count)
dat$Age.centred<-(dat$Age-mean(dat$Age))/sd(dat$Age)

Next we get to run our mediation analysis and boomtown we get a significant effect of Age on inflammation and a mediation effect of Age on inflammation through the the effect of age on bacterial count. We ran it again with a boot-strap method that is more robust and less susceptible to a lack of normality in the residuals and we got the same result. Our results show that more than 50% of the effect of Age on inflammation can be explained by bacterial count. This suggests that being elderly increases the risk of a bacterial infection and this explains a good proportion of why the elderly have more inflammation (note this data is made up).

#Running models with centred variables
fit.direct.model.centred<-lm(residuals~Age.centred, data=dat)
fit.mediator.variable.model.centred<-lm(residuals~Age.centred + log.bact.centred, data=dat)



#Run mediation analysis
mediation<-mediate(fit.direct.model.centred, fit.mediator.variable.model.centred, treat="Age.centred", mediator = "log.bact.centred")

summary(mediation)

Model summary

#Bootsrap method for extra robustness
boot.mediation<-mediate(fit.direct.model.centred, fit.mediator.variable.model.centred, treat="Age.centred", mediator = "log.bact.centred", boot=T, sims=5000)

summary(boot.mediation)
boot model summary

Stepwise Multiple Regression

Often you have a truck load of potential explanatory variables, that all might interact with each other, giving a multitude of potential ways the explanatory variables could relate to the dependent variable. You could painstakingly create every possible model or you could do a step-wise regression. Step-wise regression automatically adds and subtracts variables from your model and checks whether it improves the model using AIC as a selection criteria (AIC is a relative score of goodness).

Below is how I would do it. In this example we are trying to explain inflammation with a number of variables such as Age, Sex, Bacterial Count etc.

First Load the data and packages. Make sure the data makes sense (you should always do this),  e.g. are the numerical variables numerical?

#Load package
library(MASS)

#Load data
dat<- read.table(url("https://jackauty.com/wp-content/uploads/2020/05/Stepwise-regression-data.txt"),header=T)

#Check data, make sure it makes sense
str(dat)

'data.frame': 100 obs. of 8 variables:
 $ Inflammation : num 4.4 1218.07 5.32 3072.58 108.86 ...
 $ Age : num 7.16 8.88 3.23 7.91 7.98 2.93 9.12 9.41 7.71 2.91 ...
 $ Sex : Factor w/ 2 levels "F","M": 2 1 1 1 2 1 1 2 1 1 ...
 $ Blood.pressure : num 151 145 104 156 143 159 114 140 149 114 ...
 $ Infection : Factor w/ 2 levels "No","Yes": 1 2 1 2 1 1 1 1 2 1 ...
 $ Bacterial.count : num 75 5718 27 4439 224 ...
 $ Inflammation.transformed : num 2.1 34.9 2.31 55.43 10.43 ...
 $ inflammation.boxcox.transformed: num 2.43 71.02 2.73 123.74 16.68 ...

Next, check if the dependent variable is normally distributed. It doesn’t have to be, but things will go smoother if it is. In this example the dependent variable distribution was improved by a square-root transformation. So I created a new transformed variable.

#Check dependent variable, ideally it's normally distributed. Transform if need
hist(dat$Inflammation, breaks=6)

scewed data

hist(sqrt(dat$Inflammation), breaks=6)
Better data

#Square root looks better, so let's create a variable to use in our model
dat$Inflammation.transformed<-sqrt(dat$Inflammation)

Next let’s build the full model. The full model is a model with all the explanatory variables interacting with each other! and then run a step-wise regression in both directions (stepping forward and backward)

#Set up full model
full.model<-lm(Inflammation.transformed ~ Age*Sex*Blood.pressure*Infection*Bacterial.count, data = dat)
summary(full.model)

fullmodel
#Full model is crazy (note red notes generated in paint :))

#Perform stepwise regression in both directions
step.model<-stepAIC(full.model, direction="both", trace = F)

summary(step.model)

Slimmed model

Now given the marginal significance of the three way interaction term, and the difficulty in interpreting three-way interaction terms, I’d be tempted to make a model without that term and check if there is a substantial difference in the model (using AIC or adjusted Rsquared). If the difference is minor, I’d go with the more parsimonious model.

One last step! Check that your residuals are heteroskedastic, normally distributed and there are no major outliers.

#Make it so you can see four graphs in one plot area
par(mfrow=c(2,2)) 

#Plot your assumption graphs
plot(step.model)

Assumptions




Top and bottom left – nice and flat. Heteroskedastic and no relationship between fitted value and the variability. Nice

Bottom right – no outliers (you’ll see dots outside of the dotted line, if their were)

Top right – ideally all the dots are on the line (normally distributed). So let’s try a better transformation of our dependent variable using a BoxCox transformation!

Box-Cox transformation!

The Box-Cox transformation is where the Box-Cox algorithm works out the best transformation to achieve normality.

First run the stepwise model as above but with the un-transformed variable. Then run the final model through the Box-Cox algorithm setting the potential lamba (power) to between -5 and 5. Lambda is the power that you are going to raise your dependent variable to.

Box-Cox algorithm will essentially run the model and score the normality over and over again. Each time transforming the dependent variable to every power between -5 and 5 going in steps on 0.001. The optimal power is where the normality reaches its maximum. So you then create the variable “Selected.Power” where the normality score was at it’s max. You then raise your dependent variable to that power and run your models again!

#BoxCox transformation
full.model.untransformed<-lm(Inflammation ~ Age*Sex*Blood.pressure*Infection*Bacterial.count, data = dat)
step.model.untransformed<-stepAIC(full.model, direction="both", trace = F)

boxcox<-boxcox(step.model.untransformed,lambda = seq(-5, 5, 1/1000),plotit = TRUE )
Selected.Power<-boxcox$x[boxcox$y==max(boxcox$y)]
Selected.Power
> 0.536

dat$inflammation.boxcox.transformed<-dat$Inflammation^Selected.Power

The Box-Cox algorithm chose 0.536, which is really close to a square-root transformation. But this minor difference does effect the distribution of the residuals.

 

#Running again with BoxCox transformed variable
#Set up full model
full.model<-lm(inflammation.boxcox.transformed ~ Age*Sex*Blood.pressure*Infection*Bacterial.count, data = dat)

#Perform stepwise regression in both directions
step.model<-stepAIC(full.model, direction="both", trace = F)

summary(step.model)


#Checking assumptions 
par(mfrow=c(2,2)) 
plot(step.model)
BoxCox graph

Top right – Yes that is pretty good. Normality achieved!

Top and bottom left – Ok, not great. Things get a bit more variable as the predicted values go up. Guess you can’t win them all.

Bottom right – no outliers beyond the dotted line (1). Though one is close.

The inflammatory nature of microplastics

Plastics are inexpensive, durable, lightweight, versatile material composed of long hydrocarbon chains.­­ These properties have led to widespread use of plastics particularly in single-use packaging. Unfortunately, this has caused a global plastic pollution problem with between 5 and 13 million tonnes of plastic entering our oceans annually. Microplastics are particles of submillimetre plastic. Microplastics have particular biological importance as they can enter and sequester in organs and can be taken up by individual cells, where they are having, as yet, unknown health consequences. Primary microplastics are manufactured in the micron-scale such as microfiber clothing, while secondary microplastics are macroplastics that are broken down into micron-scale plastics through exposure to U.V. light, erosion and digestive fragmentation. My research has discovered that microplastic can activate the inflammatory receptors resulting in inflammation. This depends on both the composition of the plastic and the size of the microplastic. However, the health consequences of this are not known. Our research is investigating the health consequences of microplastic exposure in both humans and wildlife. Using cell culture we can investigate the consequences of plastic exposure to human cells and we can also look at pathology that occurs in wildlife that are exposed to high plastic due to their eating habits. These methods help us understand the microplastics problem on a global scale from wildlife to humans.

Email me if you are interested in doing a PhD with me and the wonderful Dr. Jenn Lavers (IMAS, UTAS) and the brilliant Dr. Alex Bond (Natural History Museum, UK), on this topic.

[email protected]

Plastics bird data

Figure. A) In cell culture models microplastics were inflammatory and this response differed between polymer composition. B and C) In seabirds, plastic ingestion was associated with significantly increased cholesterol (B) and uric acid (C) in the plasma. D) Seabirds fatally ingest macroplastics, but the sublethal effects of microplastics are unknown (Scale bars 10µm) (Lavers et al. 2014).