R Code

How to make a Rain Cloud Plot (aka a Rotated Violin plot)

These are great plots. They look good and do not hide any information. They provide quick insights into the data, including its spread, any differences present, and the rough distributions. It is important to choose a graph that matches the statistical test conducted. For instance, if you show mean +/- SEM or SD, you should have performed a parametric test, like an ANOVA. On the other hand, if you display medians and quartiles (box plot), you should have used a non-parametric test, like a Mann-Whitney test. However, the Rain Cloud plot displays all the necessary information, making it suitable for any type of statistical test. In general, these plots are great. However, their main limitation is that they really only work with a sufficiently large sample size (N). If you have an N of less than 8 in each group, it is advisable to skip the distribution part of the graph. Below is the code.

#Load the packages
install.packages("PupillometryR")
install.packages("ggplot2")
library("PupillometryR")
library("ggplot2")


#Load the data
example<- read.table(url("https://jackauty.com/wp-content/uploads/2021/02/barchart3.txt"), header=T) 


#Editing the variables to work with the graphs
##Making Treatment a factor 
example$Treatment<-as.factor(example$Treatment) 

# Generating a color palette named "colours" with four colors from the "viridis" palette
colours <- viridis(4, begin = 0.25, end = 0.75, alpha = 0.5)

#Reording the factors to make sense (from bottom to top)
example$Treatment<-factor(example$Treatment, levels = c("Disease_Drug","Disease","Drug","Control")) 

# Creating a ggplot object called "rain_cloud_plot" using the data frame "example"

rain_cloud_plot <- ggplot(example, aes(Treatment, Dependent_Variable, fill = Treatment, col = Treatment)) +

# Adding a flat violin plot with some horizontal position adjustment and transparency
  geom_flat_violin(position = position_nudge(x = 0.1, y = 0), alpha = 0.8, width = 0.5) +

# Adding points to the plot with jittering and dodging for better visualization, along with some transparency
  geom_point(position = position_jitterdodge(dodge.width = 0.15, jitter.width = 0.15), size = 2, alpha = 0.6) +

# Adding a box plot with black color, no outliers, some transparency, and horizontal position adjustment
  geom_boxplot(color = "black", width = 0.1, outlier.shape = NA, alpha = 0.5, position = position_nudge(x = -0.2, y = 0)) +

# Flipping the x and y axes (transposing) to create a horizontal plot
  coord_flip() +

# Setting the x and y axis labels and adjusting the y-axis scale to a square root transformation
  labs(x = "", y = "Relative levels") +
  scale_y_continuous(trans = "sqrt") +

# Applying a classic theme to the plot
  theme_classic() +

# Removing the legend from the plot
  theme(legend.position = "none") +

# Adjusting the font size for the text elements in the plot
  theme(text = element_text(size = 14)) +

# Setting the color palette for the fill aesthetic
  scale_fill_manual(values = colours) +

# Setting the color palette for the color aesthetic
  scale_colour_manual(values = colours)+
# Relabel x axis
  scale_x_discrete(labels=c("Disease + Drug", "Disease","Drug", "Control"))

# Displaying the plot
rain_cloud_plot

Just note this has a transformed axis. This is one of the nice things about a rain cloud plot is you can demonstrate the efficacy of a transformation to create normality. (Obviously this just has an of 3, so that doesn’t really make sense, but the code is there if you need it).

A complete guide to ANOVAs in R

Leave a reply

This again is more for my use than anyone else’s. The focus of this post is doing ANOVAs (or linear models, after all everything is a linear model), in a nice readable and publishable way with transformations, post-hocs and graphs. Mostly for appearance reasons, this processes uses lots of packages. Packages should normally be avoided, but one thing they’re good at is making things pretty. So that’s what they’re used for here.

Install packages

Note for this I’ve put this if(!require) function in. This means it won’t install the package if it doesn’t need to. This method just saves time and allows you to run all your code without the delays of reinstalling packages.

if(!require(MASS)) install.packages("MASS")
if(!require(sjPlot)) install.packages("sjPlot")
if(!require(ggplot2)) install.packages("ggplot2")
if(!require(emmeans)) install.packages("emmeans")
if(!require(viridis)) install.packages("viridis")
if (!require("remotes")) install.packages("remotes")
remotes::install_github('sinhrks/ggfortify') 

library(ggplot2); library(sjPlot); library(ggfortify); library(MASS); library(emmeans);

Load the data

Here we are just loading the example data set and having a look. The data has two explanatory variables, Sex and Drug, with one dependent variable. So we could imagine this study is testing the response to a drug or placebo in male and females just to see if sex effects how individuals respond to a drug. This is a classic two-way ANOVA design. Two factors -> Sex and Drug.

ANOVA_example<-read.table(url("https://jackauty.com/wp-content/uploads/2022/11/ANOVA_example2.txt"), header =T)
View(ANOVA_example)

Quick graph

I would always recommend doing a quick and ugly graph just to orientate yourself with the data.

stripchart(Dependent~Sex*Drug, data=ANOVA_example, pch=16, vertical =T)

I told you it was ugly! But we can see that females don’t respond to the drug, while males start high and then drop down in response to the drug.

Run the base ANOVA

Here we are running the base ANOVA (linear model) and then checking the assumptions. It’s good to check the assumptions before seeing if you have any significance. That way you aren’t tempted to go “hey it’s significant, let’s not do any transformations”.

What we can see in the diagnostic plots is that a transformation is needed.

The top left plot is residuals vs fitted, this tests for homoscedasticity, what we want is an even scattering of data, what we see is the data gets more variable as the fitted values increase. i.e. the variation increases as the mean increases.

Next in the top right we have a Q-Q plot. This plots the residuals vs a theoretical normal distribution (dotted line). The dots should be on the line. We can see the residuals are not normally distributed as they depart from the dotted line at the extremes.

Lastly, we have Cooks distance. This is a measure of data with extreme influence and effect on the model (aka outliers). Now the rule of thumb is you don’t want a Cook’s distance of more than 4/n. This study has n=10, so more than 0.4 is bad. However, there’s nothing you can do in experimental biology with outliers. The only thing would be to go back and look at your lab book. You cannot just remove data you don’t like, but, if you go back and your lab book says “I think I gave the wrong drug to this individual”, or something, THEN you have grounds to remove it.

Note: Using ggfortify’s autoplot function makes them look nice and automatically lays them out in a grid (no need for par(mfrow=c(2,2)), like in base r).

#Running base ANOVA
aov_base<-lm(Dependent~Sex*Drug, data=ANOVA_example)

 #Diagnostic plots 
autoplot(aov_base,which = c(1,2,4))

BoxCox transformation

So you can just have a guess at a transformation needed. The wedge shaped residual vs fitted plot above tells us we need a shrinking transformation like a LOG10 or square root. However, if we can take the researcher out of the process this is better. What if the LOG10 and square root have similar diagnostic plots, but the LOG10 gives you significant results and the square root transformation gives you p=0.051. You might be tempted to choose the LOG10 transformation because it gives you significance.

What we can do is a BoxCox transformation. This is an automated process that transforms your data by a range of powers and evaluates the normality of the residuals. Then it chooses the power (Selected.lamda) that produces the most normally distributed residuals. Typically, when you solve normality, you also solve homoscedasticity.

 #BoxCox transformation 
boxcox<-boxcox(aov_base,lambda = seq(-5, 5, 1/1000), plotit = TRUE ) 

Selected.lambda <-boxcox$x[boxcox$y==max(boxcox$y)] 
Selected.lambda 

#Run the transformed model 
aov_base_trans<-lm(Dependent^Selected.lambda ~Sex*Drug, data=ANOVA_example)

Assumption checking the transformed model

And we can see running the diagnostic plots, the BoxCox transformation has totally worked. Those look perfect.

 #Diagnostic plots autoplot
autoplot(aov_base_trans, which = c(1,2,4))

Looking at your final model

Using the tab_model function from sjPlot makes your summary tables pretty. This is looking at your analysis like a linear model, so it gives you the coefficients. This is nice, because it allows you to see the magnitude and direction of the effects (ANOVA summaries do not). So we can see everything is significant. But the large positive coefficient of the interaction term tells us that males respond to the drug more than females. Note estimate direction is a little difficult because of the transformation. Because it’s a negative lambda we actually have to flip the estimates direction, so the positive coefficient for the interaction term tells us that Males have a large decrease in response to the drug.

 #Summary of the linear model gives you the coefficients 
tab_model(aov_base_trans)

However, sometimes you might have many levels within a factor. For example you could have low dose, medium dose and high dose of a drug and you don’t want to see the effects of each level, you want to know does the drug overall have an effect (explain variation in the data). For this an ANOVA table is more useful. From this we can see there is a significant main effect of Sex, a significant main effect of Drug and a significant interaction between the two. Again using tab_df from sjPlot makes the table look pretty.

 #ANOVA of the linear model
tab_df(anova(aov_base_trans),show.rownames =T)

Post hoc analysis

OK this is a Post-hoc analysis with a priori selected comparisons. We don’t want to do all the comparisons, because some are useless. Why compare Females on Drug with Males on Placebo? That doesn’t make sense. So in this, we select our comparisons and adjust for only those comparisons.

#Posthoc/Planned contrasts 
post <- lsmeans(aov_base_trans, pairwise ~ Sex*Drug, adjust="none") 

posthoc<-summary(post) 

tab_df(posthoc$contrasts[,c(1,4)],show.rownames =T)

This code gives us a list of potential comparisons. We then select the one’s we want. In this case 1, 2, 5 and 6. We can see that comparisons of Male Drug vs Female Placebo and Male Placebo vs Female Drug, really do not make any biological sense to compare. So we don’t want to do these needless comparisons that cost us statistical power.

#Build a dataframe of all the comparisons (contrasts) and the unadjusted p values
p.value<-data.frame(cbind(posthoc$contrasts["contrast"],posthoc$contrasts["p.value"]))

#Select the comparison you are interested in
comparisons<-c(1,2,5,6)
ptable<-p.value[comparisons,]

#Create a column of adjusted p values using the Holm-Sidak method
ptable$padj<-p.adjust(ptable$p.value, method="holm") 

#View the table in a pretty way
tab_df(ptable,digits = 4)

Graphing your results

Now the first thing we need to do is manipulate our data a little. R will automatically work alphabetically and so Drug will be first (left side) on the graph, then Placebo. By convention, Placebo and Controls are always first (left). We also need to make a single column that explains what groups all the data belong to. So here’s the code for this data manipulation.

#Editing the variables to work with the graphs
##Making Drug a factor 
ANOVA_example$Drug<-as.factor(ANOVA_example$Drug) 
##Reorder the factor so Placebo is first
ANOVA_example$Drug<-factor(ANOVA_example$Drug, levels = c("Placebo", "Drug")) 

##Making a grouping column that describes the data
ANOVA_example$Group<-as.factor(paste(ANOVA_example$Sex,ANOVA_example$Drug, sep= " ")) 

##Reording the factors to make sense
ANOVA_example$Group<-factor(ANOVA_example$Group, levels = levels(ANOVA_example$Group)[c(2,1,4,3)]) 

ANOVA_example$Drug<-as.factor(ANOVA_example$Drug)
ANOVA_example$Drug<-factor(ANOVA_example$Drug, levels = levels(ANOVA_example$Drug)[c(2,1)])

Making the graph

So here is all the finicky details for making the graph. I like showing the points. I think it’s important to have the raw data there. Also, it’s important to use the scale_fill_viridis_d function as this makes it colour blind friendly.

bar_plot <- ggplot(ANOVA_example, aes(Sex, Dependent, fill = Drug))+  
#Make bars using the mean   
geom_bar(position = 'dodge', stat = 'summary', fun = 'mean', alpha=0.6, colour=1) +   

#Make error bars using SEM   
geom_errorbar(stat = 'summary',fun.data= "mean_se", position=position_dodge(.9), width = 0.075, colour=1, size=1) +   

#Add data points   
geom_point(aes(group = Group), shape = 21, position = position_dodge(0.9), alpha= 0.25, stroke = NA, size=3, fill =1)+      

#Remove background colour and gridlines   
theme_classic()+      

#Ensure axes join at 0,0 and set the y axis limits   
scale_y_continuous(expand = c(0, 0), limits = c(0,1.2*max(ANOVA_example$Dependent)))+        

#Make axis lines thicker   
theme(axis.line = element_line(size=1)) +      

#Label Axis   
labs(x = "Sex", y="Dependent variable (unit)")+       

#Make axis lines thicker   
theme(axis.line = element_line(size=1)) +       

#Adjust axis font sizes and font colour   
theme(axis.title = element_text(size = 15), axis.text.x = element_text(size=12, color="black"),          axis.text.y = element_text(size=12, color="black")) +       

#Make ticks black   
theme(axis.ticks=element_line(color="black", lineend = "square")) +      

#Fill viridis   
scale_fill_viridis_d(begin=0.25, end=0.75, alpha=0.5)  

#View plot 
bar_plot

Saving the graph

Now I like to save as a PDF. This is a vector image which can be opened in illustrator. High res images can be taken of the PDF by zooming in a taking a snapshot (it sounds crazy but it works). Also, the pdf function works with everything, R basic, Plotly, ggplot2 -> everything. So it’s easy.

#Save as PDF 
pdf("example.pdf", height=4, width=4.5) 
bar_plot
dev.off()

Reference list

The last thing I like to do is make a nice reference list. This can be finicky with GitHub packages. But it’s easy with CRAN packages.

#Making a reference list 
pkg <- c("MASS","sjPlot","ggplot2","emmeans","ggfortify") 

#Create a loop to build a dataframe of the references
citation_list<-factor() 

for(i in pkg){   citation_list<- c(citation_list,paste(capture.output(print(citation(i), style ="text",bibtext=F)), collapse="\n")) }

#Display the dataframe nicely with sjPlot's tab_df
tab_df(   data.frame(citation_list), col.header="References", sep=" ")

The full code all in one

Just if you want an easy to copy and paste full code set. Here you go.


if(!require(MASS)) install.packages("MASS")
if(!require(sjPlot)) install.packages("sjPlot")
if(!require(ggplot2)) install.packages("ggplot2")
if(!require(emmeans)) install.packages("emmeans")
if(!require(ggplot2)) install.packages("ggplot2")
if (!require("remotes")) install.packages("remotes")
remotes::install_github('sinhrks/ggfortify') 


library(ggplot2); library(sjPlot); library(ggfortify); library(MASS); library(emmeans); 

ANOVA_example<-read.table(url("https://jackauty.com/wp-content/uploads/2022/11/ANOVA_example2.txt"), header =T)

View(ANOVA_example)

stripchart(Dependent~Sex*Drug, data=ANOVA_example, pch=16, vertical =T)

#Running base ANOVA
aov_base<-lm(Dependent~Sex*Drug, data=ANOVA_example)

#Diagnostic plots 
autoplot(aov_base,which = c(1,2,4))

#BoxCox transformation 
boxcox<-boxcox(aov_base,lambda = seq(-5, 5, 1/1000), plotit = TRUE ) 

Selected.lambda <-boxcox$x[boxcox$y==max(boxcox$y)] 
Selected.lambda 


#Run the transformed model 
aov_base_trans<-lm(Dependent^Selected.lambda ~Sex*Drug, data=ANOVA_example) 


#Diagnostic plots autoplot
autoplot(aov_base_trans, which = c(1,2,4))

#Summary of the linear model gives you the coefficients 
tab_model(aov_base_trans) 

#ANOVA of the linear model
tab_df(anova(aov_base_trans),show.rownames =T) 

#Posthoc/Planned contrasts 
post <- lsmeans(aov_base_trans, pairwise ~ Sex*Drug, adjust="none") 
posthoc<-summary(post) 

tab_df(posthoc$contrasts[,c(1,4)],show.rownames =T)

#Build a dataframe of all the comparisons (contrasts) and the unadjusted p values
p.value<-data.frame(cbind(posthoc$contrasts["contrast"],posthoc$contrasts["p.value"]))

#Select the comparison you are interested in
comparisons<-c(1,2,5,6)
ptable<-p.value[comparisons,]

#Create a column of adjusted p values using the Holm-Sidak method
ptable$padj<-p.adjust(ptable$p.value, method="holm") 

#View the table in a pretty way
tab_df(ptable,digits = 4)  

##Making a grouping column that describes the data
ANOVA_example$Group<-as.factor(paste(ANOVA_example$Sex,ANOVA_example$Drug, sep= " ")) 

##Reording the factors to make sense
ANOVA_example$Group<-factor(ANOVA_example$Group, levels = levels(ANOVA_example$Group)[c(2,1,4,3)])
ANOVA_example$Drug<-as.factor(ANOVA_example$Drug)
ANOVA_example$Drug<-factor(ANOVA_example$Drug, levels = levels(ANOVA_example$Drug)[c(2,1)])


bar_plot <- ggplot(ANOVA_example, aes(Sex, Dependent, fill = Drug))+  
  #Make bars using the mean   
  geom_bar(position = 'dodge', stat = 'summary', fun = 'mean', alpha=0.6, colour=1) +   
  
  #Make error bars using SEM   
  geom_errorbar(stat = 'summary',fun.data= "mean_se", position=position_dodge(.9), width = 0.075, colour=1, size=1) +   
  
  #Add data points   
  geom_point(aes(group = Group), shape = 21, position = position_dodge(0.9), alpha= 0.25, stroke = NA, size=3, fill =1)+      
  
  #Remove background colour and gridlines   
  theme_classic()+      
  
  #Ensure axes join at 0,0 and set the y axis limits   
  scale_y_continuous(expand = c(0, 0), limits = c(0,1.2*max(ANOVA_example$Dependent)))+        
  
  #Make axis lines thicker   theme(axis.line = element_line(size=1)) +      
  
  #Label Axis   labs(x = "Sex", y="Dependent variable (unit)")+       
  
  #Make axis lines thicker   
  theme(axis.line = element_line(size=1)) +       
  
  #Adjust axis font sizes and font colour   
  theme(axis.title = element_text(size = 15), axis.text.x = element_text(size=12, color="black"),          axis.text.y = element_text(size=12, color="black")) +       
  
  #Make ticks black   
  theme(axis.ticks=element_line(color="black", lineend = "square")) +      
  
  #Fill viridis   
  scale_fill_viridis_d(begin=0.25, end=0.75, alpha=0.5)  

#View plot 
bar_plot   

#Save as PDF 
pdf("example22.pdf", height=4, width=4.5) 
bar_plot
dev.off() 


#Making a reference list 
pkg <- c("MASS","sjPlot","ggplot2","emmeans","ggfortify") 

#Create a loop to build a dataframe of the references
citation_list<-factor() 

for(i in pkg){   citation_list<- c(citation_list,paste(capture.output(print(citation(i), style ="text",bibtext=F)), collapse="\n")) }

#Display the dataframe nicely with sjPlot's tab_df
tab_df(   data.frame(citation_list), col.header="References", sep=" ")

How to make a simple bar graph in R

Leave a reply

OK, I’ll be honest, this post is mostly for me. I always forget how to do all of this and end up spending hours on google. So now I can do what all coders do, copy – paste – adjust.

So here we go -> how to do a simple bar chart with SEMs, that looks publication quality.

I feel like the blog is a little like this meme…. sorry.

Load the data and look at it to understand it and make sure it makes sense. This data is very simple

#Obviously you'll need ggplot2
library(ggplot2)

#Load and look at the data
example<- read.table(url("https://jackauty.com/wp-content/uploads/2021/02/barchart3.txt"), header=T)
example


##Reording the factors to make sense
example$Treatment<-as.factor(example$Treatment)
example$Treatment<-factor(example$Treatment, levels = levels(example$Treatment)[c(1,4,2,3)])

#Create the plot and do all the annoying adjustments (draw the rest of the owl lol)

bar_plot <- ggplot(example, aes(Treatment, Dependent_Variable, fill = Treatment))+  
  #Make bars using the mean   
  geom_bar(position = 'dodge', stat = 'summary', fun = 'mean', alpha=0.6, colour=1) +   
  
  #Make error bars using SEM   
  geom_errorbar(stat = 'summary',fun.data= "mean_se", position=position_dodge(.9), width = 0.075, colour=1, size=1) +   
  
  #Add data points   
  geom_point( shape = 21, position = position_dodge(0.9), alpha= 0.25, stroke = NA, size=3, fill =1)+      
  
  #Remove background colour and gridlines   
  theme_classic()+      
  
  #Ensure axes join at 0,0 and set the y axis limits   
  scale_y_continuous(expand = c(0, 0), limits = c(0,1.2*max(example$Dependent_Variable)))+ 
  
 
  # Relabel x axis
  scale_x_discrete(labels=c("Control","Drug", "Disease", "Disease + Drug"))+
  
  #Make axis lines thicker   
  theme(axis.line = element_line(linewidth=1)) +      
  
  #Label Axis   
  labs(x = "", y="Dependent variable (unit)")+       
  
  #Make axis lines thicker   
  theme(axis.line = element_line(size=1)) +       
  
  #Adjust axis font sizes and font colour   
  theme(axis.title = element_text(size = 15), axis.text.x = element_text(size=12, color="black"), 
  axis.text.y = element_text(size=12, color="black")) +       
  
  #Make ticks black   
  theme(axis.ticks=element_line(color="black", lineend = "square")) +      
  
  #Fill viridis   
  scale_fill_viridis_d(begin=0.25, end=0.75, alpha=0.5)  + 
  
  #Remove legend
  theme(legend.position = "none")


#View plot 
bar_plot   
#View plot
bar_plot

Output it as a PDF

#PDF Plot
ggsave("SuperCoolBarchart.pdf", device="pdf")

AND HERE IT IS…… MY GOSH IT’S…. boring but publishable.

Bonus stripchart fun!

This is great for a very quick plot that looks OK

#Create plot
par(mar = c(9, 5, 2, 2),bty = 'n')
stripchart(example$Dependent_Variable~example$Treatment, vertical = T, pch=16, 
 method="jitter", las=2, col="#4299D0",ylim=c(0, 1.2*max(example$Dependent_Variable)),
 xaxs = "i",yaxs = "i", ylab=expression(paste("IL-1", beta, " (pg/ml)")))
abline(0,0)

#Also in here is how to put beta (β) into your graph

Bonus boxplot fun!

This is great for a very quick plot that looks OK

#Export as pdf
pdf("boxplot.pdf")

#Increase margins
par(mar = c(9, 5, 2, 2),bty = 'n')

#Set up text labels
labs<-as.character(unique(example$Treatment))

#Make plot
boxplot(example$Dependent_Variable~example$Treatment,xlab="" 
 , las=2, col="#4299D066",ylim=c(0, 1.2*max(example$Dependent_Variable)), axes=F,
 , ylab=expression(paste("IL-1", beta, " (pg/ml)")))

#Add axis markers, labels and lines
axis(side = 1,at=c(1:length(labs)), labels = c(rep("", length(labs))))
text(cex=1, x=c(1:length(labs)), y=-0.75, labs, xpd=TRUE, srt=45, adj=1)
axis(side = 2)
box(bty="l", lwd=2)

#Turn off PDF device
dev.off()

#Also in here is how to put beta (β) into your graph
#Also in here is how to export as a PDF and how to rotate your labels 45 degrees

Mediation analysis with nuisance variables.

Leave a reply

Mediation analysis is a simple idea that is easy to do, but an absolute minefield to interpret correctly. The principal is easy – perhaps one explanatory variable correlates with the dependent variable by mediating (effecting) an other explanatory variable. So say you find out that bullet wounds correlates with death and blood loss correlates with death, you could do a mediation analysis to see if bullet wounds correlates with death by affecting blood loss – and you would find that it does (sorry about the gruesome example). One absolute NO NO for mediation analyses is if the direction of causation could be the other way. i.e. the dependent variable causes the explanatory variables. All mediation analyses breaks down at this point, so don’t do it (cough – like this paper did – https://pubmed.ncbi.nlm.nih.gov/29567761/ – cough). For example if you modelled bullet wounds as the dependent variables and explained it with death and blood loss as a mediation factor, you would find that death causes bullet wounds directly and indirectly through blood loss. This obviously doesn’t make sense (this is different to just correlation analyses that doesn’t care about the direction of causality).

In this example we are seeing whether or not the age of a patient correlates with inflammation through the variable bacterial count. The idea being that being old not only directly causes inflammation, but perhaps makes you more susceptible to bacteria growth, leading to larger numbers of bacteria in your blood and this leads to more inflammation. Now to make this even more complicated (just because that is how data is) what if you wanted to get rid of some nuisance variables first, before looking at your explanatory variables of interest? Wow, OK, here we go.

First Load the data and packages. Make sure the data makes sense (you should always do this), e.g. are the numerical variables numerical?

#Load package
library(MASS)
library("mediation")
#Load data
dat<- read.table(url("https://jackauty.com/wp-content/uploads/2020/05/Stepwise-regression-data.txt"),header=T)

#Check data, make sure it makes sense
str(dat)

'data.frame': 100 obs. of 8 variables:
 $ Inflammation : num 4.4 1218.07 5.32 3072.58 108.86 ...
 $ Age : num 7.16 8.88 3.23 7.91 7.98 2.93 9.12 9.41 7.71 2.91 ...
 $ Sex : Factor w/ 2 levels "F","M": 2 1 1 1 2 1 1 2 1 1 ...
 $ Blood.pressure : num 151 145 104 156 143 159 114 140 149 114 ...
 $ Infection : Factor w/ 2 levels "No","Yes": 1 2 1 2 1 1 1 1 2 1 ...
 $ Bacterial.count : num 75 5718 27 4439 224 ...

Next, check if the dependent variable is normally distributed. It doesn’t have to be, but things will go smoother if it is. In this example the dependent variable distribution was improved by a square-root transformation. So I created a new transformed variable.

#Check dependent variable, ideally it's normally distributed. Transform if need
hist(dat$Inflammation, breaks=6)



hist(sqrt(dat$Inflammation), breaks=6)


#Square root looks better, so let's create a variable to use in our model
dat$Inflammation.transformed<-sqrt(dat$Inflammation)

Next let’s build a model with the nuisance variables (Sex and Blood Pressure) and extract the standardized residuals into a new column of our dataframe. Residuals (Standardized) are the left over variation after the accounting for the nuisance variables. This residual variation will then hopefully be mostly explained by our variables of interest. Check the assumptions and transform if needed.

#Set up adjusted model
adjusted.model<-lm(Inflammation.transformed ~ Sex*Blood.pressure, data = dat)
summary(adjusted.model)

#Check Assumptions
par(mfrow=c(2,2))
plot(adjusted.model)



#Extract the standardized residuals and turn them into a column in your data
dat$residuals<-stdres(adjusted.model)

#Check your data
dat[1:20,]

Next let’s build our base models with our two explanatory variables. First with our primary explanatory variable (Age), then with our primary explanatory variable and our mediator variable (Bacterial count). Then check your assumptions. Here it looks like we need to transform our explanatory variable of bacterial count. Often things like bacterial count need a log transformation and it worked!

#Build base models

fit.direct.model<-lm(residuals~Age, data=dat)
fit.mediator.variable.model<-lm(residuals~Age+Bacterial.count, data=dat)
summary(fit.direct.model)
summary(fit.mediator.variable.model)

#Assumption check
par(mfrow=c(2,2)) 
plot(fit.direct.model)
plot(fit.mediator.variable.model)


#Top left you can clearly see the relationship isn't linear.

#Transform predicting variable
dat$log.bacterial.count<-log(dat$Bacterial.count)

fit.direct.model<-lm(residuals~Age, data=dat)
fit.mediator.variable.model<-lm(residuals~Age + log.bacterial.count, data=dat)

summary(fit.direct.model)
summary(fit.mediator.variable.model)

#Assumption check
par(mfrow=c(2,2)) 

plot(fit.direct.model)
plot(fit.mediator.variable.model)


#NNNIIIICCCEEEEE

Now we need to centre the variables. This is important as coefficients are only comparable when the explanatory variables are centred. In this example you might get a small coefficient for bacterial counts, because bacterial counts are in the millions and you might get a large coefficient for Age because that maxes out at 100 (ish). But if you centre both variables (turning them into Z scores), then they become more comparable as they’re on the same scale.

#Centering variables
dat$log.bact.centred<-(dat$log.bacterial.count-mean(dat$log.bacterial.count))/sd(dat$log.bacterial.count)
dat$Age.centred<-(dat$Age-mean(dat$Age))/sd(dat$Age)

Next we get to run our mediation analysis and boomtown we get a significant effect of Age on inflammation and a mediation effect of Age on inflammation through the the effect of age on bacterial count. We ran it again with a boot-strap method that is more robust and less susceptible to a lack of normality in the residuals and we got the same result. Our results show that more than 50% of the effect of Age on inflammation can be explained by bacterial count. This suggests that being elderly increases the risk of a bacterial infection and this explains a good proportion of why the elderly have more inflammation (note this data is made up).

#Running models with centred variables
fit.direct.model.centred<-lm(residuals~Age.centred, data=dat)
fit.mediator.variable.model.centred<-lm(residuals~Age.centred + log.bact.centred, data=dat)



#Run mediation analysis
mediation<-mediate(fit.direct.model.centred, fit.mediator.variable.model.centred, treat="Age.centred", mediator = "log.bact.centred")

summary(mediation)



#Bootsrap method for extra robustness
boot.mediation<-mediate(fit.direct.model.centred, fit.mediator.variable.model.centred, treat="Age.centred", mediator = "log.bact.centred", boot=T, sims=5000)

summary(boot.mediation)

Stepwise Multiple Regression

Leave a reply

Often you have a truck load of potential explanatory variables, that all might interact with each other, giving a multitude of potential ways the explanatory variables could relate to the dependent variable. You could painstakingly create every possible model or you could do a step-wise regression. Step-wise regression automatically adds and subtracts variables from your model and checks whether it improves the model using AIC as a selection criteria (AIC is a relative score of goodness).

Below is how I would do it. In this example we are trying to explain inflammation with a number of variables such as Age, Sex, Bacterial Count etc.

First Load the data and packages. Make sure the data makes sense (you should always do this), e.g. are the numerical variables numerical?

#Load package
library(MASS)

#Load data
dat<- read.table(url("https://jackauty.com/wp-content/uploads/2020/05/Stepwise-regression-data.txt"),header=T)

#Check data, make sure it makes sense
str(dat)

'data.frame': 100 obs. of 8 variables:
 $ Inflammation : num 4.4 1218.07 5.32 3072.58 108.86 ...
 $ Age : num 7.16 8.88 3.23 7.91 7.98 2.93 9.12 9.41 7.71 2.91 ...
 $ Sex : Factor w/ 2 levels "F","M": 2 1 1 1 2 1 1 2 1 1 ...
 $ Blood.pressure : num 151 145 104 156 143 159 114 140 149 114 ...
 $ Infection : Factor w/ 2 levels "No","Yes": 1 2 1 2 1 1 1 1 2 1 ...
 $ Bacterial.count : num 75 5718 27 4439 224 ...
 $ Inflammation.transformed : num 2.1 34.9 2.31 55.43 10.43 ...
 $ inflammation.boxcox.transformed: num 2.43 71.02 2.73 123.74 16.68 ...

#Check dependent variable, ideally it's normally distributed. Transform if need
hist(dat$Inflammation, breaks=6)



hist(sqrt(dat$Inflammation), breaks=6)


#Square root looks better, so let's create a variable to use in our model
dat$Inflammation.transformed<-sqrt(dat$Inflammation)

Next let’s build the full model. The full model is a model with all the explanatory variables interacting with each other! and then run a step-wise regression in both directions (stepping forward and backward)

#Set up full model
full.model<-lm(Inflammation.transformed ~ Age*Sex*Blood.pressure*Infection*Bacterial.count, data = dat)
summary(full.model)


#Full model is crazy (note red notes generated in paint :))

#Perform stepwise regression in both directions
step.model<-stepAIC(full.model, direction="both", trace = F)

summary(step.model)

Now given the marginal significance of the three way interaction term, and the difficulty in interpreting three-way interaction terms, I’d be tempted to make a model without that term and check if there is a substantial difference in the model (using AIC or adjusted Rsquared). If the difference is minor, I’d go with the more parsimonious model.

One last step! Check that your residuals are heteroskedastic, normally distributed and there are no major outliers.

#Make it so you can see four graphs in one plot area
par(mfrow=c(2,2)) 

#Plot your assumption graphs
plot(step.model)

Top and bottom left – nice and flat. Heteroskedastic and no relationship between fitted value and the variability. Nice

Bottom right – no outliers (you’ll see dots outside of the dotted line, if their were)

Top right – ideally all the dots are on the line (normally distributed). So let’s try a better transformation of our dependent variable using a BoxCox transformation!

Box-Cox transformation!

The Box-Cox transformation is where the Box-Cox algorithm works out the best transformation to achieve normality.

First run the stepwise model as above but with the un-transformed variable. Then run the final model through the Box-Cox algorithm setting the potential lamba (power) to between -5 and 5. Lambda is the power that you are going to raise your dependent variable to.

Box-Cox algorithm will essentially run the model and score the normality over and over again. Each time transforming the dependent variable to every power between -5 and 5 going in steps on 0.001. The optimal power is where the normality reaches its maximum. So you then create the variable “Selected.Power” where the normality score was at it’s max. You then raise your dependent variable to that power and run your models again!

#BoxCox transformation
full.model.untransformed<-lm(Inflammation ~ Age*Sex*Blood.pressure*Infection*Bacterial.count, data = dat)
step.model.untransformed<-stepAIC(full.model, direction="both", trace = F)

boxcox<-boxcox(step.model.untransformed,lambda = seq(-5, 5, 1/1000),plotit = TRUE )
Selected.Power<-boxcox$x[boxcox$y==max(boxcox$y)]
Selected.Power
> 0.536

dat$inflammation.boxcox.transformed<-dat$Inflammation^Selected.Power

The Box-Cox algorithm chose 0.536, which is really close to a square-root transformation. But this minor difference does effect the distribution of the residuals.

#Running again with BoxCox transformed variable
#Set up full model
full.model<-lm(inflammation.boxcox.transformed ~ Age*Sex*Blood.pressure*Infection*Bacterial.count, data = dat)

#Perform stepwise regression in both directions
step.model<-stepAIC(full.model, direction="both", trace = F)

summary(step.model)


#Checking assumptions 
par(mfrow=c(2,2)) 
plot(step.model)

Top right – Yes that is pretty good. Normality achieved!

Top and bottom left – Ok, not great. Things get a bit more variable as the predicted values go up. Guess you can’t win them all.

Bottom right – no outliers beyond the dotted line (1). Though one is close.

Statistics example datasets

2 Replies

Lecture four datasets

Generalized mixed modelling

Leave a reply

Sometimes you know your data can’t follow a normal distribution even after transformation – such as count data or yes/no data. If this is the case, then generalized mixed modelling is the way to go. Basically it’s very similar to general linear modelling which assumes normality of the population distribution, but instead it assumes a different underlying distribution (and it models using maximum likelihood, not minimizing the residuals). For this analysis we have count data (the number of neutrophils in the brain) and a random effect of siblings. Siblings is treated as a random effect because with experiments involving large numbers that come from one parent pair (like fish or fly data), we tend to think that the siblings aren’t independent and perhaps the parents are effecting the outcome. Therefore, we model this lack of independence by treating siblings as a random effect. In fish data we call a bunch of siblings a “clutch”.

To do the analysis, first select the best model. For count data Poisson is the obvious choice, but there are several potential link functions (similar to transformations). So we must select the optimal link function for the Poisson model. However, Poisson has the assumption that the mean is equal to the variance. This is true if each count is completely independent, like counting cars on a quiet country road. However, in biology this is often not the case, in this example having one neutrophil increases your chances of having two neutrophils because an inflammatory response is happening. This is more like a busy road where one car driving by increases the chances of a second car driving by (is it rush hour?), the counts aren’t independent. In this case a negative binomial is a good family to model the data. It has two parameterization methods (the method by which it predicts the lack of independence of the counts). So now you have to model the Poisson models with the three link functions and the negative binomial model with the two parameterization methods and then see which is best. For this we use the Akaike Information Criteria (AIC). With AIC the lower the better, so we will select the model with the lowest AIC.

#Packages
require("lme4")
require("ggplot2")
require("glmmADMB")



#Loading data
example.count<- read.table(url("https://jackauty.com/wp-content/uploads/2018/10/count.txt"), header=T)

str(example.count)

#Generalized mixed modelling

try(glmm.a<-glmmadmb(Neutrophils~Factor1+ (1|Clutch), data=example.count,family="nbinom"))

try(glmm.b<-glmmadmb(Neutrophils~Factor1+ (1|Clutch), data=example.count,family="nbinom1"))

try(glmm.c<-glmer(Neutrophils~Factor1+ (1|Clutch), data=example.count, family=poisson(link="log")))

try(glmm.d<-glmer(Neutrophils~Factor1+ (1|Clutch), data=example.count, family=poisson(link="identity")))

try(glmm.e<-glmer(Neutrophils~Factor1+ (1|Clutch), data=example.count,family=poisson(link="sqrt")))

AIC(glmm.a,glmm.b,glmm.c,glmm.d,glmm.e)

        df         AIC
glmm.a   4      270.5040
glmm.b   4      273.9180
glmm.c   3      446.7101
glmm.d   3      475.2077
glmm.e   3      461.6097

From this we can see that the negative binomial model is much better and the “nbinom” parameterization is preferred. Now we can continue with our analysis very much like a normal multi-level linear model.

#Build model with and without our independent variable (Factor1)
glmm1<-glmmadmb(Neutrophils~1 + (1|Clutch), data=example.count,family="nbinom")
glmm2<-update(glmm1, .~. +Factor1)
anova(glmm1,glmm2)
Analysis of Deviance Table

Model 1: Neutrophils ~ 1
Model 2: Neutrophils ~ +Factor1
 NoPar LogLik Df Deviance Pr(>Chi) 
1 3 -134.27 
2 4 -131.25 1 6.028 0.01408 *
---

Our factor significantly improves the model! From this we might say in our publication – “There was a significant effect of factor on neutrophil count (χ²₍₁₎=6.028, p=0.014).”. But first! We have to check the assumptions, this is a little bit different to multi-level linear modelling.

Is the data approximately distributed appropriately (negative binomial)?
Are there outlying data points?
Are the residuals centered around zero for each factor?
Is there zero inflation (more zeros than expected, this is common in biological data)?

Is the data approximately distributed appropriately (negative binomial)?

From this we can see that the negative binomial model is much better than the Poisson model and most data points lie between the 95% CI of the expected distribution.

#Is the data approximately distributed appropriately (negative binomial)? 

poisson <- fitdistr(example.count$Neutrophils, "Poisson")
qqp(example.count$Neutrophils, "pois", lambda=poisson$estimate, main="Poisson model")

nbinom<-fitdistr(example.count$Neutrophils, "negative binomial")
qqp(example.count$Neutrophils, "nbinom", size=nbinom$estimate[[1]], mu=nbinom$estimate[[2]], main="Negative binomial model")

Are there outlying data points? AND Are the residuals centered around zero for each factor?

Plotting the residuals against Factor we can see if the model fitted sensible coefficients and if there are outlying data points. Here we are looking to see if the residuals cluster around zero and that there are no extreme residuals.

Here we can see that the residuals around both factor levels are centered around zero. The random effect of “clutch” seems to model well, with no clustering. There is one residual with an absolute value greater than two. This is slightly concerning with a data set this small. It could be worth checking your lab notes for anything peculiar. I don’t tend to remove data points without a practical reason, for example I might read in my lab notes that this zebra fish looked a lot like a mouse, then I would remove it.

#Are there outlying data points? AND Are the residuals centered around zero for each factor?
augDat <- data.frame(example.count,resid=residuals(glmm2,type="pearson"),
 fitted=fitted(glmm2))
ggplot(augDat,aes(x=Factor1,y=resid,col=Clutch))+geom_point()+geom_boxplot(aes(group=Factor1),alpha = 0.1)+coord_flip()

Is there zero inflation (more zeros than expected, this is common in biological data)?

Here we run the same model but with zeroInflation=T, then we check the AIC and find that the zero-inflation model is better! This sucks because now we have to run the whole thing again with zeroInflation=T. On the re-run, the p-value “improved” and outliers were gone, showing that they were due to zero inflation (and not my zebra fish being a mouse)! So I guess it all works out in the end.

#Is there zero inflation
try(glmm.zero<-glmmadmb(Neutrophils~Factor1+ (1|Clutch), data=example.count,family="nbinom", zeroInflation = T))
AIC(glmm.zero,glmm2)
 
          df AIC
glmm.zero  5 268.298
glmm2      4 270.504

#Quick rerun
glmm1.zero<-glmmadmb(round(Neutrophils)~1 + (1|Clutch), data=example.count,family="nbinom", zeroInflation = T)
glmm2.zero<-update(glmm1.zero, .~. +Factor1)
anova(glmm1.zero,glmm2.zero)

Analysis of Deviance Table
Model 1: round(Neutrophils) ~ 1
Model 2: round(Neutrophils) ~ +Factor1
 NoPar LogLik Df Deviance Pr(>Chi) 
   4  -134.14 
   5  -129.15  1  9.976   0.001586 **

#Quick assumption check
#Are there outlying data points? AND Are the residuals centered around zero for each factor?
augDat <- data.frame(data,resid=residuals(glmm2,type="pearson"),
 fitted=fitted(glmm2))
ggplot(augDat,aes(x=Factor1,y=resid,col=Clutch))+geom_point()+geom_boxplot(aes(group=Factor1),alpha = 0.1)+coord_flip()

Publication methods:

For discrete data and data with non-normal distributions, generalized linear mixed modelling was used (GLMM) (lme4, Douglas et al. 2015; glmmADMB Fournier et al. 2012 & Skaug et al. 2016). Again “Clutch” was treated as a random effect modelled with random intercepts for all models. Appropriate families were selected based on the data distribution. Numerous families and link functions were evaluated where necessary and the optimal parameters were selected based on the Akaike information criterion (AIC). For mobile or non-mobile (yes/no) data a logistic regression was used, for count data (number of cells) a negative binomial family was selected. The significance of inclusion of an independent variable or interaction terms were evaluated using log-likelihood ratio. Holm-Sidak post-hocs were then performed for pair-wise comparisons using the least square means (LSmeans, Russel 2016). Pearson residuals were evaluated graphically using predicted vs level plots. All analyses were performed using R (version 3.5.1, R Core Team, 2018).

Douglas Bates, Martin Maechler, Ben Bolker, Steve Walker (2015). Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software, 67(1), 1-48.<doi:10.18637/jss.v067.i01>.

Russell V. Lenth (2016). Least-Squares Means: The R Package lsmeans. Journal of Statistical Software, 69(1), 1-33.<doi:10.18637/jss.v069.i01>

Fournier DA, Skaug HJ, Ancheta J, Ianelli J, Magnusson A, Maunder M, Nielsen A, Sibert J (2012). “AD Model Builder: using automatic differentiation for statistical inference of highly parameterized complex nonlinear models.” _Optim. Methods Softw._, *27*, 233-249.

Skaug H, Fournier D, Bolker B, Magnusson A, Nielsen A (2016). _Generalized Linear Mixed Models using ‘AD Model Builder’_. R package version 0.8.3.3.

R Core Team (2018). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

Finding an outlier using Cook’s distance

Leave a reply

A Cook’s distance greater than 1 is a sign that this data point (or random factor) is having a disproportionate influence on your model and should be looked into. Note: I’m not normally a fan of removing data without a valid reason, for me, you need both a statistical and experimental reason for removal.

#Loading data
example<- read.table(url("https://jackrrivers.com/wp-content/uploads/2018/04/ExampleCook.txt"), header=T)
example$ID<-as.factor(example$ID)

#Running a model
require(lme4)
lme1<-lmer(Dependent~1 + Factor1*Factor2+(1|ID), data=example, na.action=na.omit)

#Looking for large Cook distances
require(influence.ME)
infl <- influence(lme1, obs = TRUE)
cooks.distance(infl)
plot(infl, which = "cook")

Multi-level linear model (repeated measure ANOVA)

4 Replies

When the data is unbalanced or there are missing values, repeated measures ANOVAs fail to report unbiased results. Furthermore, all the data of one individual will be ignored if data from one time point is missing. Multi-level linear models gets around this by comparing models of the data, not the data itself, using log-likelihood/Chi squared distribution analyses. Multi-level linear models also avoid sphericity issues. In general multi-level linear models are better in every way to a repeated measures ANVOA.

#Loading data

example<- read.table(url("https://jackauty.com/wp-content/uploads/2023/11/example4.txt"), header=T)
example$ID<-as.factor(example$ID)

#Installing packages (but only if you need to install them)
if(!require(MASS)) install.packages("MASS")
if(!require(sjPlot)) install.packages("sjPlot")
if(!require(ggplot2)) install.packages("ggplot2")
if(!require(emmeans)) install.packages("emmeans")
if(!require(viridis)) install.packages("viridis")


library(ggplot2); library(sjPlot); library(MASS); library(emmeans);

#Running multi-level linear model 

install.packages("lme4")
require(lme4)

lme1<-lmer(Dependent~1 + (1|ID), data=example, na.action=na.omit)
lme2<-update(lme1, .~. +Factor1)
lme3<-update(lme2, .~. +Factor2)
lme4<-update(lme3, .~. +Factor1*Factor2)

anova(lme1,lme2,lme3,lme4)

Assumption checking

Now with multi-level linear models (AKA repeated measures ANOVA) there are two kinds of residuals. There is the variability of the raw data around the predicted value (this is the conditional residual), and the variability of the random effect (e.g. animal or human). The random effects should be normalish but don’t have to be homoscedastic. The conditional residuals should be both normal and homoscedastic.

#Assumption check of conditional residuals
plot(lme4)

The growing scatter of residuals as the predicted value increases violates the assumption of homoscedasticity. This is super common. Bigger means typically also have bigger variability in biological data.

#Now we check for normality of the conditional  residuals
qqnorm(resid(lme4))

*This should be a single straight 45 degree line. These curves are no good.*

# checking the normality of the random effects (here random intercept):
qqnorm(ranef(lme4)$ID$'(Intercept)', 
       main="Q-Q plot for the random intercept")

Transforming the data

What we can do is a BoxCox transformation. This is an automated process that transforms your data by a range of powers and evaluates the normality of the residuals. Then it chooses the power (Selected.lamda) that produces the most normally distributed residuals. Typically, when you solve normality, you also solve homoscedasticity.

Now you can do this on a multi-level linear model, but often fixing a simple model will also fix the complex model. So we’ll run a simple linear model, calculate what we should transform the data by, then apply that transformation to our multi-level linear model.

#Making the simple linear model
simple.linear<-lm(Dependent~Factor1*Factor2, data=example)

#BoxCox transformation 
boxcox<-boxcox(simple.linear,lambda = seq(-5, 5, 1/1000), plotit = TRUE ) 

Selected.lambda <-boxcox$x[boxcox$y==max(boxcox$y)] 
Selected.lambda 

#Run the transformed model 
lme1<-lmer((Dependent+1)^Selected.lambda~1 + (1|ID), data=example, na.action=na.omit)
lme2<-update(lme1, .~. +Factor1)
lme3<-update(lme2, .~. +Factor2)
lme4<-update(lme3, .~. +Factor1*Factor2)

anova(lme1,lme2,lme3,lme4)

To break this down: Because lme2 is significantly different to lme1 (p=0.005), we would say there is a significant main effect of Factor 1; similarly lme2 vs lme3 (p=0.001) we would say there is a significant main effect of Factor 2; and finally lme3 vs lme4 (p=0.043) we would say there is a significant interaction effect of Factor1 and Factor2.

#Assumption check
plot(lme4)
qqnorm(resid(lme4))

# checking the normality of the random effects (here random intercept):
qqnorm(ranef(lme4)$ID$'(Intercept)', 
       main="Q-Q plot for the random intercept")

*And just like that! By fixing the linear model, we’ve fixed the multi-level model!*

Posthocs!

OK this is a Post-hoc analysis with a priori selected comparisons. Perhaps we don’t want to do all the comparisons, because some are useless. For example, why compare Females on Drug with Males on Placebo? That doesn’t make sense. So in this, we select our comparisons and adjust for only those comparisons.

#Posthoc/Planned contrasts 
post <- emmeans(lme4, pairwise ~ Factor1*Factor2, adjust="none") 

posthoc<-summary(post) 

#Look at the table of potential comparisons
tab_df(posthoc$contrasts[,c(1,4)],show.rownames =T)  

#Select the comparison you are interested in but adding the row numbers below
comparisons<-c(1,2,5,6)

#Build a dataframe of all the comparisons (contrasts) and the unadjusted p values
p.value<-data.frame(cbind(posthoc$contrasts["contrast"],posthoc$contrasts["p.value"]))

ptable<-p.value[comparisons,]

#Create a column of adjusted p values using the Holm-Sidak method
ptable$padj<-p.adjust(ptable$p.value, method="holm") 

#View the table in a pretty way
tab_df(ptable,digits = 4)

*This table is just explaining the comparisons row by row. So the first rows is just comparing Factor1 A & Factor2 C vs Factor1 B & Factor1 C. It is not significant after adjustment (p=0.46).*

How I would publish the above method.

Linear mixed modelling was used to evaluate the effect of explanatory variables on the dependent variable (package lme4, Bates et al. 2015). All factors and interactions were modelled as fixed effects. A within-subject design with random intercepts were used for all models. The significance of inclusion of an independent variable or interaction terms were evaluated using log-likelihood ratio. Holm-Sidak post-hocs were then performed for pair-wise comparisons using the least square means (package “emmeans”, Lenth 2023). Homoskedasticity and normality of the Pearson residuals were evaluated graphically using predicted vs residual and Q-Q plots, respectively, and Box Cox transformations were applied when necessary (package “MASS” Venables 2002). All analyses were performed using R (version 4.3.3).

#Making a reference list 
pkg <- c("MASS","sjPlot","lme4","emmeans") 

#Create a loop to build a dataframe of the references
citation_list<-factor() 

for(i in pkg){   citation_list<- c(citation_list,paste(capture.output(print(citation(i), style ="text",bibtext=F)), collapse="\n")) }

#Display the dataframe nicely with sjPlot's tab_df
tab_df(   data.frame(citation_list), col.header="References", sep=" ")

Venables WN, Ripley BD (2002). _Modern Applied Statistics with S_, Fourth edition. Springer, New York. ISBN 0-387-95457-0, .

Lüdecke D (2023). _sjPlot: Data Visualization for Statistics in Social Science_. R package version 2.8.14, .

Wickham H (2016). _ggplot2: Elegant Graphics for Data Analysis_. Springer-Verlag New York. ISBN 978-3-319-24277-4, .

Lenth R (2023). _emmeans: Estimated Marginal Means, aka Least-Squares Means_. R package version 1.8.6, .

Bates D, Mächler M, Bolker B, Walker S (2015). “Fitting Linear Mixed-Effects Models Using lme4.” _Journal of Statistical Software_, *67*(1), 1-48. doi:10.18637/jss.v067.i01 .

All the code at once! (easier to copy)


#Installing packages (but only if you need to install them)

if(!require(MASS)) install.packages("MASS")
if(!require(sjPlot)) install.packages("sjPlot")
if(!require(ggplot2)) install.packages("ggplot2")
if(!require(emmeans)) install.packages("emmeans")
if(!require(viridis)) install.packages("viridis")


library(ggplot2); library(sjPlot); library(MASS); library(emmeans);

#Loading data

example<- read.table(url("https://jackauty.com/wp-content/uploads/2023/11/example4.txt"), header=T)
example$ID<-as.factor(example$ID)


#Running multi-level linear model 
lme1<-lmer(Dependent~1 + (1|ID), data=example, na.action=na.omit)
lme2<-update(lme1, .~. +Factor1)
lme3<-update(lme2, .~. +Factor2)
lme4<-update(lme3, .~. +Factor1*Factor2)

anova(lme1,lme2,lme3,lme4)
#Assumption check
plot(lme4)
qqnorm(resid(lme4))

# checking the normality of the random effects (here random intercept):
qqnorm(ranef(lme4)$ID$'(Intercept)', 
       main="Q-Q plot for the random intercept")

simple.linear<-lm(Dependent~Factor1*Factor2, data=example)

#BoxCox transformation 
boxcox<-boxcox(simple.linear,lambda = seq(-5, 5, 1/1000), plotit = TRUE ) 

Selected.lambda <-boxcox$x[boxcox$y==max(boxcox$y)] 
Selected.lambda 

#Run the transformed model 
lme1<-lmer((Dependent+1)^Selected.lambda~1 + (1|ID), data=example, na.action=na.omit)
lme2<-update(lme1, .~. +Factor1)
lme3<-update(lme2, .~. +Factor2)
lme4<-update(lme3, .~. +Factor1*Factor2)

tab_df(anova(lme1,lme2,lme3,lme4),show.rownames =T, digits=3)

#Assumption check
plot(lme4)
qqnorm(resid(lme4))

# checking the normality of the random effects (here random intercept):
qqnorm(ranef(lme4)$ID$'(Intercept)', 
       main="Q-Q plot for the random intercept")

#Create a linear model and do a Box-Cox transformation


#Posthoc/Planned contrasts
#Posthoc/Planned contrasts 
post <- emmeans(lme4, pairwise ~ Factor1*Factor2, adjust="none") 

posthoc<-summary(post) 

tab_df(posthoc$contrasts[,c(1,4)],show.rownames =T)  

##Now select the rows of the comparisons you are interested in

comparisons<-c(1,2,4)

#Build a dataframe of all the comparisons (contrasts) and the unadjusted p values
p.value<-data.frame(cbind(posthoc$contrasts["contrast"],posthoc$contrasts["p.value"]))

#Select the comparison you are interested in
comparisons<-c(1,2,5,6)
ptable<-p.value[comparisons,]

#Create a column of adjusted p values using the Holm-Sidak method
ptable$padj<-p.adjust(ptable$p.value, method="holm") 

#View the table in a pretty way
tab_df(ptable,digits = 4)  

#Making a reference list 
pkg <- c("MASS","sjPlot","lme4","emmeans") 

#Create a loop to build a dataframe of the references
citation_list<-factor() 

for(i in pkg){   citation_list<- c(citation_list,paste(capture.output(print(citation(i), style ="text",bibtext=F)), collapse="\n")) }

#Display the dataframe nicely with sjPlot's tab_df
tab_df(   data.frame(citation_list), col.header="References", sep=" ")

PCA and 3D PCA

Leave a reply

A principal component analysis (PCA), is a way to take a large amount of data and plot it on two or three axes. It does this without knowing which groups the data belongs to, so if you perform a PCA, plot it, and the data clusters nicely into the experiment groups, you know there are distinct data signatures in your experimental groups. AKA your experiment worked. Here is a quick guide to PCA for RNAseq data.

#Load data

example<- read.table(url("https://jackrrivers.com/wp-content/uploads/2018/03/exampleRNAseq.txt"), header=T)
rna.mean<-(as.matrix(example[,2:13]))
rna.t<-(t(rna.mean))

#PCA
##Set Colours
col<-c(rep("#00cccc",6),rep("#ff6633",6))

##Intial PCA
PCA<-prcomp(rna.t, scale=T)
summary_PCA<-round(data.frame(summary(PCA)$importance),3)
summary_PCA

## Plot PCA
par(fig=c(0, 1, 0, 1), oma=c(5, 2, 2, 2), mar=c(6, 4, 2, 2))
plot(PCA$x[,1:2], col=col,pch=16, cex=1.5,xlab="",ylab="",xlim=c(-250,150),ylim=c(100,-100))
title(main="PCA Genotype",
xlab=paste0("PCA1 ",(100*summary_PCA$PC1[2]),"%"), 
ylab=paste0("PCA2 ",(100*summary_PCA$PC2[2]),"%"))
legend("topleft", bty = "n", legend = c("WT", "KO"), 
 pch = 16, col = c("#00cccc","#ff6633"), cex = 1, horiz=F)

Rplot02

##Log Transformed PCA
PCA<-prcomp(log(rna.t+1), scale=F)

summary_PCA<-round(data.frame(summary(PCA)$importance),3)
summary_PCA

## Plot PCA par(fig=c(0, 1, 0, 1), oma=c(5, 2, 2, 2), mar=c(6, 4, 2, 2)) plot(PCA$x[,1:2], col=col,pch=16, cex=1.5,xlab="",ylab="",xlim=c(-250,150),ylim=c(100,-100)) title(main="PCA Genotype",
xlab=paste0("PCA1 ",(100*summary_PCA$PC1[2]),"%"), 
ylab=paste0("PCA2 ",(100*summary_PCA$PC2[2]),"%")) legend("topleft", bty = "n", legend = c("WT", "KO"), pch = 16, col = c("#00cccc","#ff6633"), cex = 1, horiz=F)

#3D PCA
##Packages
require(pca3d)
require(magick)
require(rgl)

##Image Attibutes
shape<-c(1,1,1,1,1,1,4,4,4,4,4,4)
col<-c(rep(7,6),rep(4,6))

##Run PCA
PCA<-prcomp(log(rna.t+1), scale=F)

##Make 3D
pca3d(PCA, group=col,radius = 3,shape=shape)

Jack R Auty

Researcher/ Scientist

R Code

How to make a Rain Cloud Plot (aka a Rotated Violin plot)

A complete guide to ANOVAs in R

Install packages

Load the data

Quick graph

Run the base ANOVA

BoxCox transformation

Assumption checking the transformed model

Looking at your final model

Post hoc analysis

Graphing your results

Making the graph

Saving the graph

Reference list

The full code all in one

How to make a simple bar graph in R

AND HERE IT IS…… MY GOSH IT’S…. boring but publishable.

Bonus stripchart fun!

Bonus boxplot fun!

Mediation analysis with nuisance variables.

Stepwise Multiple Regression

Box-Cox transformation!

Statistics example datasets

Generalized mixed modelling

Finding an outlier using Cook’s distance

Multi-level linear model (repeated measure ANOVA)

PCA and 3D PCA