PLSDA-batch is a batch-effect correction method based on Projection to Latent Structures Discriminant Analysis. It estimates latent components associated with treatment and batch effects and removes the batch-related variation prior to any downstream analysis. It also enables ordination and data visualisation. PLSDA-batch is particularly suitable for microbiome data because it is non-parametric and multivariate. When combined with the centered log-ratio transformation to address uneven library sizes and compositionality, PLSDA-batch directly tackles key characteristics of microbiome data that existing correction methods often overlook.
In addition to the main method, the R package includes two extensions:
1/ weighted PLSDA-batch designed for unbalanced batch-by-treatment designs commonly encountered in small-sample microbiome studies;
2/ sparse PLSDA-batch which performs discriminative variable selection to reduce overfitting in classification settings.
These two variants broaden the applicability of PLSDA-batch across diverse data scenarios (Wang and Lê Cao 2023).
This vignette covers microbiome data pre-processing, batch-effect detection and visualisation, the use of the PLSDA-batch family of methods, and post-correction evaluation and variable selection. For guidance on method choice according to experimental objectives and study designs, see “Batch Effects Management in Case Studies”.
Note: A new argument mode has been added to
PLSDA_batch(). We recommend using “regression” mode, but if
you need to reproduce results obtained using PLSDAbatch ≤ 1.6.0, please
explicitly set mode = "canonical".
First, we load the packages required for the analysis and check their versions.
# CRAN
library(pheatmap)
library(vegan)
library(gridExtra)
# Bioconductor
library(mixOmics)
library(Biobase)
library(SummarizedExperiment)
library(PLSDAbatch)
# print package versions
package.version("pheatmap")## [1] "1.0.13"
## [1] "2.7-2"
## [1] "2.3"
## [1] "6.35.0"
## [1] "2.71.0"
## [1] "1.99.1"
We considered a case study to illustrate the application of PLSDA-batch. The study is described as follows:
\(\color{blue}{\bf{\text{Anaerobic digestion.}}}\) This study investigated microbial indicators that could enhance the efficacy of anaerobic digestion (AD) bioprocesses and prevent system failure (Chapleur et al. 2016). The dataset includes 75 samples and 567 microbial variables. Samples were treated with two different ranges of phenol concentration (effect of interest) and processed on five different dates (batch effect). This study exhibits a clear and strong batch effect with an approx. balanced batch x treatment design.
We load the \(\color{blue}{\text{AD
data}}\) stored internally using the data()
function, and then extract the batch and treatment information.
## [1] 75 567
ad.metadata <- rowData(AD_data$FullData)
ad.batch <- factor(ad.metadata$sequencing_run_date,
levels = unique(ad.metadata$sequencing_run_date)
)
ad.trt <- as.factor(ad.metadata$initial_phenol_concentration.regroup)
names(ad.batch) <- names(ad.trt) <- rownames(ad.metadata)The raw \(\color{blue}{\text{AD
data}}\) include 567 OTUs and 75 samples. We then use the
PreFL() function from our \(\color{orange}{\text{PLSDAbatch}}\) R
package to filter the data.
## [1] 75 231
## [1] 0.6328042
## [1] 0.3806638
After filtering, 231 OTUs remained, and the proportion of zeroes decreased from 63% to 38%.
Note: The PreFL() function is intended for raw count
data rather than relative abundance data. We also recommend performing
pre-filtering on raw counts to better mitigate compositionality
issues.
Prior to CLR transformation, we recommend adding 1 as an offset for
raw count data (e.g., \(\color{blue}{\text{AD
data}}\)), and half of the minimum value as an offset for
relative abundance data. We use the logratio.transfo()
function from the \(\color{orange}{\text{mixOmics}}\) package
to apply the CLR transformation.
We apply the pca() function from the \(\color{orange}{\text{mixOmics}}\) package
to the \(\color{blue}{\text{AD data}}\)
and use the Scatter_Density() function from \(\color{orange}{\text{PLSDAbatch}}\) to
display the PCA sample plot with density overlays.
# AD data
ad.pca.before <- mixOmics::pca(ad.clr, ncomp = 3, scale = TRUE)
Scatter_Density(
components = ad.pca.before$variates$X, comp = c(1, 2),
expl.var = ad.pca.before$prop_expl_var$X,
batch = ad.batch, trt = ad.trt,
title = "AD data", trt.legend.title = "Phenol conc."
)PCA sample plot with density overlays for the AD data.
In the figure above, we observed (1) clear separation between samples treated with different phenol concentrations and (2) noticeable differences between samples sequenced on “14/04/2016” and “21/09/2017” compared with the other dates. Therefore, the date-related batch effect needs to be removed.
We first identify the top OTUs driving the major variance in the PCA
using the selectVar() function from the \(\color{orange}{\text{mixOmics}}\) package.
Each selected OTU can then be visualised using boxplots and density
plots with the box_plot() and density_plot()
functions in \(\color{orange}{\text{PLSDAbatch}}\).
ad.OTU.name <- selectVar(ad.pca.before, comp = 1)$name[1]
ad.OTU_batch <- data.frame(value = ad.clr[, ad.OTU.name], batch = ad.batch)
box_plot(
df = ad.OTU_batch, title = paste(ad.OTU.name, "(AD data)"),
x.angle = 30
)Boxplots of sample values for OTU28 before batch-effect correction in the AD data.
Density plots of sample values for OTU28 before batch-effect correction in the AD data.
The boxplot and density plot indicate a strong date-related batch effect, driven by the differences between “14/04/2016”, “21/09/2017”, and the other dates for “OTU28”.
We also apply a linear regression model to “OTU28” using the
linear_regres() function from \(\color{orange}{\text{PLSDAbatch}}\),
including batch and treatment effects as covariates. We set “14/04/2016”
and “21/09/2017” as the reference batches, respectively, using the
relevel() function from \(\color{orange}{\text{stats}}\).
# reference batch: 14/04/2016
ad.batch <- relevel(x = ad.batch, ref = "14/04/2016")
ad.OTU.lm <- linear_regres(
data = ad.clr[, ad.OTU.name],
trt = ad.trt, batch.fix = ad.batch,
type = "linear model"
)
summary(ad.OTU.lm$model$data)##
## Call:
## stats::lm(formula = y ~ trt + batch.fix)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9384 -0.3279 0.1635 0.3849 0.9887
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.8501 0.2196 3.871 0.000243 ***
## trt1-2 -1.6871 0.1754 -9.617 2.27e-14 ***
## batch.fix09/04/2015 1.5963 0.2950 5.410 8.55e-07 ***
## batch.fix01/07/2016 2.0839 0.2345 8.886 4.82e-13 ***
## batch.fix14/11/2016 1.7405 0.2480 7.018 1.24e-09 ***
## batch.fix21/09/2017 1.2646 0.2690 4.701 1.28e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7033 on 69 degrees of freedom
## Multiple R-squared: 0.7546, Adjusted R-squared: 0.7368
## F-statistic: 42.44 on 5 and 69 DF, p-value: < 2.2e-16
# reference batch: 21/09/2017
ad.batch <- relevel(x = ad.batch, ref = "21/09/2017")
ad.OTU.lm <- linear_regres(
data = ad.clr[, ad.OTU.name],
trt = ad.trt, batch.fix = ad.batch,
type = "linear model"
)
summary(ad.OTU.lm$model$data)##
## Call:
## stats::lm(formula = y ~ trt + batch.fix)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9384 -0.3279 0.1635 0.3849 0.9887
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1147 0.2502 8.453 2.97e-12 ***
## trt1-2 -1.6871 0.1754 -9.617 2.27e-14 ***
## batch.fix14/04/2016 -1.2646 0.2690 -4.701 1.28e-05 ***
## batch.fix09/04/2015 0.3317 0.3139 1.056 0.29446
## batch.fix01/07/2016 0.8193 0.2573 3.185 0.00218 **
## batch.fix14/11/2016 0.4759 0.2705 1.760 0.08292 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7033 on 69 degrees of freedom
## Multiple R-squared: 0.7546, Adjusted R-squared: 0.7368
## F-statistic: 42.44 on 5 and 69 DF, p-value: < 2.2e-16
From the linear regression results, we observed P < 0.001 for the regression coefficients associated with all other batches when the reference batch was “14/04/2016”, confirming the differences between samples from “14/04/2016” and those from the other dates, as seen in the previous plots. When the reference batch was “21/09/2017”, we also observed significant differences between “21/09/2017” and “14/04/2016”, as well as between “21/09/2017” and “01/07/2016”. Therefore, a batch effect associated with “21/09/2017” is also present.
We produce a heatmap using the \(\color{orange}{\text{pheatmap}}\) package. The data first need to be scaled across both OTUs and samples.
# scale the clr data on both OTUs and samples
ad.clr.s <- scale(ad.clr, center = TRUE, scale = TRUE)
ad.clr.ss <- scale(t(ad.clr.s), center = TRUE, scale = TRUE)
ad.anno_col <- data.frame(Batch = ad.batch, Treatment = ad.trt)
ad.anno_colors <- list(
Batch = color.mixo(seq_len(5)),
Treatment = pb_color(seq_len(2))
)
names(ad.anno_colors$Batch) <- levels(ad.batch)
names(ad.anno_colors$Treatment) <- levels(ad.trt)
pheatmap(ad.clr.ss,
cluster_rows = FALSE,
fontsize_row = 4,
fontsize_col = 6,
fontsize = 8,
clustering_distance_rows = "euclidean",
clustering_method = "ward.D",
treeheight_row = 30,
annotation_col = ad.anno_col,
annotation_colors = ad.anno_colors,
border_color = "NA",
main = "AD data - Scaled"
)Hierarchical clustering of samples in the AD data.
In the heatmap, samples in the \(\color{blue}{\text{AD data}}\) from the batch dated “14/04/2016” were clustered together and were distinct from the other samples, indicating the presence of a batch effect.
We apply pRDA using the varpart() function from the
\(\color{orange}{\text{vegan}}\) R
package.
# AD data
ad.factors.df <- data.frame(trt = ad.trt, batch = ad.batch)
class(ad.clr) <- "matrix"
ad.rda.before <- varpart(ad.clr, ~trt, ~batch,
data = ad.factors.df, scale = TRUE
)
ad.rda.before$part$indfract## Df R.squared Adj.R.squared Testable
## [a] = X1|X2 1 NA 0.08943682 TRUE
## [b] = X2|X1 4 NA 0.26604420 TRUE
## [c] 0 NA 0.01296248 FALSE
## [d] = Residuals NA NA 0.63155651 FALSE
In the results, X1 and X2 represent the
first and second covariates fitted in the model. [a] and
[b] represent the independent proportions of variance
explained by X1 and X2 respectively, and
[c] represents the shared variance between X1
and X2. In the \(\color{blue}{\text{AD data}}\), batch
variance (X2) was larger than treatment variance
(X1) with some interaction variance (shown in line
[c], Adj.R.squared = 0.013). A larger shared variance
indicates a more unbalanced batch x treatment design is. In this study,
we considered the design to be approx. balanced.
The PLSDA_batch() function is implemented in the \(\color{orange}{\text{PLSDAbatch}}\)
package. To use this function, we need to specify the optimal number of
components related to the treatment (ncomp.trt) and batch
effects (ncomp.bat).
For versions > 1.6.0, we introduced the mode
argument. The default is “regression”, which is also the recommended
mode. In our applications, both Y.trt and
Y.batch are categorical, and our goal is to understand how
X explains and predicts Y: specifically, to
remove the information in X that explains
Y.batch while preserving the information in X
that explains Y.trt.
In the \(\color{blue}{\text{AD
data}}\), ad.trt has 2 classes and
ad.batch has 5 batches, so we choose 1 component for the
treatment effect and 4 components for the batch effect.
In PLS-based methods, a categorical Y variable is
converted into a 0/1 dummy matrix, A Y variable with \(k\) classes becomes a matrix with \(k\) columns, but only \(k-1\) of these are informative because the
final column can always be derived from the others. Mathmetically, the
rank of Y is \(k-1\).
## [1] 1
## [1] 4
The canonical mode is not recommended for batch-effect removal, as it
measures how X and Y co-vary rather then how
X explains Y. For PLSDAbatch <= 1.6.0, the
canonical mode was used by default. Therefore, if you need to reproduce
results obtained with versions ≤ 1.6.0, please explicitly set
mode = "canonical". Do not worry, the results from the two
modes are generally similar, but biologically the “regression” mode is
more appropriate.
The procedure for choosing the optimal number of components is the
same in both modes. In canonical mode, however, we can directly
visualise that a \(k\)-class
Y variable requires only \(k −
1\) informative components, reflecting the effective rank of its
dummy-coded representation.
We first convert ad.trt into a dummy matrix using the
unmap() function from \(\color{orange}{\text{mixOmics}}\), and then
use pls() in “canonical” mode from the same package to
calculate the explained variance of Y.
# estimate the number of treatment components
ad.trt.mat <- unmap(ad.trt)
ad.trt.tune.cnc <- pls(
X = ad.clr, Y = ad.trt.mat,
ncomp = 2, mode = "canonical"
)
ad.trt.tune.cnc$prop_expl_var$Y # 1## comp1 comp2
## 1.0000000 0.9979148
In the results, 1 component was sufficient to explain 100% of the
variance in the outcome dummy matrix ad.trt.mat, as
ad.trt contains only 2 classes.
We then use the PLSDA_batch() function, with both
treatment and batch grouping information, to visualise the optimal
number of batch components to remove.
# estimate the number of batch components
ad.batch.tune.cnc <- PLSDA_batch(
X = ad.clr,
Y.trt = ad.trt, Y.bat = ad.batch,
ncomp.trt = 1, ncomp.bat = 6,
mode = "canonical"
)
ad.batch.tune.cnc$explained_variance.bat$Y # 4## comp1 comp2 comp3 comp4 comp5 comp6
## 0.2474654 0.2615741 0.2301382 0.2608223 0.2308958 0.2607765
## [1] 1
For the batch effect, 4 components were able to explain 100% of the
variance in the batch dummy matrix, as ad.batch contains 5
groups.
For convenience, the optimal number of components can simply be
chosen as nlevels(y)-1.
## [1] 1
## [1] 4
We apply sPLSDA-batch using the same PLSDA_batch()
function, but we specify the number of variables to select on each
component (usually only treatment-related components,
keepX.trt). To determine the optimal number of variables to
select, we use the tune.splsda() function from the \(\color{orange}{\text{mixOmics}}\) package
(Rohart et al.
2017), testing all candidate values provided in
test.keepX for each component.
# estimate the number of variables to select per treatment component
ad.test.keepX <- c(
seq(1, 10, 1), seq(20, 100, 10),
seq(150, 231, 50), 231
)
ad.trt.tune.v <- tune.splsda(
X = ad.clr, Y = ad.trt,
ncomp = 1, test.keepX = ad.test.keepX,
folds = 4, nrepeat = 50, seed = 777
)
ad.trt.tune.v$choice.keepX # 100Here the optimal number of variables to select for the treatment component was 100.
ad.sPLSDA_batch.res.cnc <- PLSDA_batch(
X = ad.clr,
Y.trt = ad.trt, Y.bat = ad.batch,
ncomp.trt = 1, keepX.trt = 100,
ncomp.bat = 4, mode = "canonical"
)
ad.sPLSDA_batch.cnc <- ad.sPLSDA_batch.res.cnc$X.nobatchNote: For unbalanced batch x treatment designs (except for
nested designs), we can set balance = FALSE in the
PLSDA_batch() function to apply weighted PLSDA-batch.
We apply various visualisation and quantitative methods to assess the effectiveness of batch-effect correction.
PCA
In the \(\color{blue}{\text{AD data}}\), we compared the PCA sample plots before and after batch effect correction.
ad.pca.before <- pca(ad.clr, ncomp = 3, scale = TRUE)
ad.pca.PLSDA_batch.cnc <- pca(ad.PLSDA_batch.cnc, ncomp = 3, scale = TRUE)
ad.pca.sPLSDA_batch.cnc <- pca(ad.sPLSDA_batch.cnc, ncomp = 3, scale = TRUE)
ad.pca.PLSDA_batch.reg <- pca(ad.PLSDA_batch.cnc, ncomp = 3, scale = TRUE)
ad.pca.sPLSDA_batch.reg <- pca(ad.sPLSDA_batch.cnc, ncomp = 3, scale = TRUE)# order batches
ad.batch <- factor(ad.metadata$sequencing_run_date,
levels = unique(ad.metadata$sequencing_run_date)
)
ad.pca.before.plot <- Scatter_Density(
components = ad.pca.before$variates$X, comp = c(1, 2),
expl.var = ad.pca.before$prop_expl_var$X,
batch = ad.batch,
trt = ad.trt,
title = "Before correction"
)ad.pca.PLSDA_batch.cnc.plot <- Scatter_Density(
components = ad.pca.PLSDA_batch.cnc$variates$X, comp = c(1, 2),
expl.var = ad.pca.PLSDA_batch.cnc$prop_expl_var$X,
batch = ad.batch,
trt = ad.trt,
title = "PLSDA-batch (canonical)"
)ad.pca.sPLSDA_batch.cnc.plot <- Scatter_Density(
components = ad.pca.sPLSDA_batch.cnc$variates$X, comp = c(1, 2),
expl.var = ad.pca.sPLSDA_batch.cnc$prop_expl_var$X,
batch = ad.batch,
trt = ad.trt,
title = "sPLSDA-batch (canonical)"
)ad.pca.PLSDA_batch.reg.plot <- Scatter_Density(
components = ad.pca.PLSDA_batch.reg$variates$X, comp = c(1, 2),
expl.var = ad.pca.PLSDA_batch.reg$prop_expl_var$X,
batch = ad.batch,
trt = ad.trt,
title = "PLSDA-batch (regression)"
)ad.pca.sPLSDA_batch.reg.plot <- Scatter_Density(
components = ad.pca.sPLSDA_batch.reg$variates$X, comp = c(1, 2),
expl.var = ad.pca.sPLSDA_batch.reg$prop_expl_var$X,
batch = ad.batch,
trt = ad.trt,
title = "sPLSDA-batch (regression)"
)PCA sample plots with density overlays before and after batch-effect correction in the AD data.
As shown in the PCA sample plots, the differences between samples sequenced on “14/04/2016”, “21/09/2017” and the other dates were removed after batch-effect correction. The data corrected with PLSDA-batch retained slightly more treatment variation, mainly on the first PC, compared with sPLSDA-batch, as indicated by the x-axis label (26%). We can also compare boxplots and density plots for key variables identified by PCA as major drivers of variance, as well as heatmaps showing clear patterns before and after correction (results not shown). No noticeable differences between the two modes can be observed from the PCA plots.
pRDA
We calculate the global explained variance across all microbial
variables using pRDA. To achieve this, we loop through each variable in
both the original (uncorrected) data and the batch-corrected data. The
final results are then visualised using the partVar_plot()
function from the \(\color{orange}{\text{PLSDAbatch}}\)
package.
# AD data
ad.corrected.list <- list(
`Before correction` = ad.clr,
`PLSDA-batch (canonical)` = ad.PLSDA_batch.cnc,
`sPLSDA-batch (canonical)` = ad.sPLSDA_batch.cnc,
`PLSDA-batch (regression)` = ad.PLSDA_batch.reg,
`sPLSDA-batch (regression)` = ad.sPLSDA_batch.reg
)
ad.prop.df <- data.frame(
Treatment = NA, Batch = NA,
Intersection = NA,
Residuals = NA
)
for (i in seq_len(length(ad.corrected.list))) {
rda.res <- varpart(ad.corrected.list[[i]], ~trt, ~batch,
data = ad.factors.df, scale = TRUE
)
ad.prop.df[i, ] <- rda.res$part$indfract$Adj.R.squared
}
rownames(ad.prop.df) <- names(ad.corrected.list)
ad.prop.df <- ad.prop.df[, c(1, 3, 2, 4)]
partVar_plot(prop.df = ad.prop.df)Global explained variance before and after batch effect correction for the AD data.
As shown in the figure above, the intersection between batch and treatment variance was small (1.3%) for the \(\color{blue}{\text{AD data}}\), indicating that the batch-by-treatment design is not highly unbalanced. Therefore, the unweighted PLSDA-batch and sPLSDA-batch methods were appropriate, and the weighted versions were not required. The sPLSDA-batch–corrected data showed better performance, with lower intersection variance compared with PLSDA-batch. The results from the two modes were similar.
\(\mathbf{R^2}\)
The \(R^2\) values for each variable
are calculated using lm() from the \(\color{orange}{\text{stats}}\) package. To
compare \(R^2\) values across
variables, we scale the corrected data prior to calculating \(R^2\). The results are then visualised
using ggplot() from the \(\color{orange}{\text{ggplot2}}\) R
package.
# AD data
# scale
ad.corr_scale.list <- lapply(
ad.corrected.list,
function(x) {
apply(x, 2, scale)
}
)
ad.r_values.list <- list()
for (i in seq_len(length(ad.corr_scale.list))) {
ad.r_values <- data.frame(trt = NA, batch = NA)
for (c in seq_len(ncol(ad.corr_scale.list[[i]]))) {
ad.fit.res.trt <- lm(ad.corr_scale.list[[i]][, c] ~ ad.trt)
ad.r_values[c, 1] <- summary(ad.fit.res.trt)$r.squared
ad.fit.res.batch <- lm(ad.corr_scale.list[[i]][, c] ~ ad.batch)
ad.r_values[c, 2] <- summary(ad.fit.res.batch)$r.squared
}
ad.r_values.list[[i]] <- ad.r_values
}
names(ad.r_values.list) <- names(ad.corr_scale.list)
ad.boxp.list <- list()
for (i in seq_len(length(ad.r_values.list))) {
ad.boxp.list[[i]] <-
data.frame(
r2 = c(
ad.r_values.list[[i]][, "trt"],
ad.r_values.list[[i]][, "batch"]
),
Effects = as.factor(rep(c("Treatment", "Batch"),
each = 231
))
)
}
names(ad.boxp.list) <- names(ad.r_values.list)
ad.r2.boxp <- rbind(
ad.boxp.list$`Before correction`,
ad.boxp.list$`PLSDA-batch (canonical)`,
ad.boxp.list$`sPLSDA-batch (canonical)`,
ad.boxp.list$`PLSDA-batch (regression)`,
ad.boxp.list$`sPLSDA-batch (regression)`
)
ad.r2.boxp$methods <- rep(
c(
"Before correction", "PLSDA-batch (canonical)",
"sPLSDA-batch (canonical)", "PLSDA-batch (regression)",
"sPLSDA-batch (regression)"
),
each = 462
)
ad.r2.boxp$methods <- factor(ad.r2.boxp$methods,
levels = unique(ad.r2.boxp$methods)
)
ggplot(ad.r2.boxp, aes(x = Effects, y = r2, fill = Effects)) +
geom_boxplot(alpha = 0.80) +
theme_bw() +
theme(
text = element_text(size = 18),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.x = element_text(angle = 60, hjust = 1, size = 18),
axis.text.y = element_text(size = 18),
panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank(),
legend.position = "right"
) +
facet_grid(~methods) +
scale_fill_manual(values = pb_color(c(12, 14)))AD study: \(R^2\) values for each microbial variable before and after batch effect correction.
The batch-related variance was reduced after both PLSDA-batch and sPLSDA-batch correction, with PLSDA-batch retaining slightly more treatment-related variance.
##################################
ad.barp.list <- list()
for (i in seq_len(length(ad.r_values.list))) {
ad.barp.list[[i]] <- data.frame(
r2 = c(
sum(ad.r_values.list[[i]][, "trt"]),
sum(ad.r_values.list[[i]][, "batch"])
),
Effects = c("Treatment", "Batch")
)
}
names(ad.barp.list) <- names(ad.r_values.list)
ad.r2.barp <- rbind(
ad.barp.list$`Before correction`,
ad.barp.list$`PLSDA-batch (canonical)`,
ad.barp.list$`sPLSDA-batch (canonical)`,
ad.barp.list$`PLSDA-batch (regression)`,
ad.barp.list$`sPLSDA-batch (regression)`
)
ad.r2.barp$methods <- rep(
c(
"Before correction", "PLSDA-batch (canonical)",
"sPLSDA-batch (canonical)", "PLSDA-batch (regression)",
"sPLSDA-batch (regression)"
),
each = 2
)
ad.r2.barp$methods <- factor(ad.r2.barp$methods,
levels = unique(ad.r2.barp$methods)
)
ggplot(ad.r2.barp, aes(x = Effects, y = r2, fill = Effects)) +
geom_bar(stat = "identity") +
theme_bw() +
theme(
text = element_text(size = 18),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.x = element_text(angle = 60, hjust = 1, size = 18),
axis.text.y = element_text(size = 18),
panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank(),
legend.position = "right"
) +
facet_grid(~methods) +
scale_fill_manual(values = pb_color(c(12, 14)))AD study: Sum of \(R^2\) values for each microbial variable before and after batch effect correction.
The overall sum of \(R^2\) values indicated that sPLSDA-batch removed slightly more batch variance than PLSDA-batch (Regression mode: PLSDA-batch (12.49), sPLSDA-batch (9.27); Canonical mode: PLSDA-batch (12.40), sPLSDA-batch (9.25) but preserved less treatment variance (Regression mode: PLSDA-batch (40.21), sPLSDA-batch (36.80); Canonical mode: PLSDA-batch (40.00), sPLSDA-batch (36.22)) compared with PLSDA-batch.
Alignment scores
To use the alignment_score() function from \(\color{orange}{\text{PLSDAbatch}}\), we
need to specify the proportion of variance to explain
(var), the number of nearest neighbours (k),
and the number of principal components to estimate (ncomp).
The results are then visualised using ggplot() from the
\(\color{orange}{\text{ggplot2}}\)
package.
# AD data
ad.scores <- c()
names(ad.batch) <- rownames(ad.clr)
for (i in seq_len(length(ad.corrected.list))) {
res <- alignment_score(
data = ad.corrected.list[[i]],
batch = ad.batch,
var = 0.95,
k = 8,
ncomp = 50
)
ad.scores <- c(ad.scores, res)
}
ad.scores.df <- data.frame(
scores = ad.scores,
methods = names(ad.corrected.list)
)
ad.scores.df$methods <- factor(ad.scores.df$methods,
levels = rev(names(ad.corrected.list))
)
ggplot() +
geom_col(aes(
x = ad.scores.df$methods,
y = ad.scores.df$scores
)) +
geom_text(
aes(
x = ad.scores.df$methods,
y = ad.scores.df$scores / 2,
label = round(ad.scores.df$scores, 3)
),
size = 3, col = "white"
) +
coord_flip() +
theme_bw() +
ylab("Alignment Scores") +
xlab("") +
ylim(0, 0.85)Comparison of alignment scores before and after batch effect correction using different methods for the AD data.
The alignment scores complement the PCA results, especially when the extent of batch-effect removal is difficult to judge from PCA sample plots. In Figure 5, we can clearly see that all methods reduce the major batch effect, but the PCA plots do not reveal the more subtle differences in how effectively each method performs.
Because a higher alignment score indicates better mixing of samples, the bar plot above shows that sPLSDA-batch achieved superior performance compared with PLSDA-batch.
After batch effect correction, we can identify discriminative variables associated with different treatments.
Here, we use the splsda() function from \(\color{orange}{\text{mixOmics}}\) to select
the top 50 microbial variables that, in combination, discriminate the
treatment groups in the \(\color{blue}{\text{AD data}}\).
For details on how to apply sPLS-DA, see the mixOmics documentation: mixOmics.
splsda.plsda_batch.reg <- splsda(
X = ad.PLSDA_batch.reg, Y = ad.trt,
ncomp = 3, keepX = rep(50, 3)
)
select.plsda_batch.reg <- selectVar(splsda.plsda_batch.reg, comp = 1)
head(select.plsda_batch.reg$value)## value.var
## OTU46 0.3034711
## OTU24 0.2816639
## OTU28 0.2815924
## OTU35 -0.2666526
## OTU61 0.2614321
## OTU68 0.2544252
splsda.splsda_batch.reg <- splsda(
X = ad.sPLSDA_batch.reg, Y = ad.trt,
ncomp = 3, keepX = rep(50, 3)
)
select.splsda_batch.reg <- selectVar(splsda.splsda_batch.reg, comp = 1)
head(select.splsda_batch.reg$value)## value.var
## OTU46 0.2949833
## OTU24 0.2744206
## OTU28 0.2716075
## OTU35 -0.2621345
## OTU61 0.2588398
## OTU68 0.2492678
## [1] 44
The discriminative variables were selected and ranked according to their contributions to separating the sample groups treated with different ranges of phenol concentration (0–0.5 vs. 1–2 g/L).
The overlap between the variables selected from the data corrected with PLSDA-batch and sPLSDA-batch was high (44 out of 50), although some differences between the two selections remained.
## R version 4.5.2 (2025-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] PLSDAbatch_1.99.1 SummarizedExperiment_1.41.0
## [3] GenomicRanges_1.63.1 Seqinfo_1.1.0
## [5] IRanges_2.45.0 S4Vectors_0.49.0
## [7] MatrixGenerics_1.23.0 matrixStats_1.5.0
## [9] Biobase_2.71.0 BiocGenerics_0.57.0
## [11] generics_0.1.4 mixOmics_6.35.0
## [13] ggplot2_4.0.1 lattice_0.22-7
## [15] MASS_7.3-65 gridExtra_2.3
## [17] vegan_2.7-2 permute_0.9-8
## [19] pheatmap_1.0.13 BiocStyle_2.39.0
##
## loaded via a namespace (and not attached):
## [1] Rdpack_2.6.4 rlang_1.1.6
## [3] magrittr_2.0.4 compiler_4.5.2
## [5] mgcv_1.9-4 vctrs_0.6.5
## [7] reshape2_1.4.5 stringr_1.6.0
## [9] crayon_1.5.3 pkgconfig_2.0.3
## [11] fastmap_1.2.0 backports_1.5.0
## [13] XVector_0.51.0 labeling_0.4.3
## [15] rmarkdown_2.30 nloptr_2.2.1
## [17] purrr_1.2.0 xfun_0.55
## [19] cachem_1.1.0 jsonlite_2.0.0
## [21] DelayedArray_0.37.0 BiocParallel_1.45.0
## [23] broom_1.0.11 parallel_4.5.2
## [25] cluster_2.1.8.1 R6_2.6.1
## [27] bslib_0.9.0 stringi_1.8.7
## [29] RColorBrewer_1.1-3 car_3.1-3
## [31] boot_1.3-32 jquerylib_0.1.4
## [33] numDeriv_2016.8-1.1 Rcpp_1.1.0.8.1
## [35] knitr_1.51 Matrix_1.7-4
## [37] splines_4.5.2 igraph_2.2.1
## [39] tidyselect_1.2.1 abind_1.4-8
## [41] yaml_2.3.12 TreeSummarizedExperiment_2.19.0
## [43] codetools_0.2-20 tibble_3.3.0
## [45] lmerTest_3.1-3 plyr_1.8.9
## [47] treeio_1.35.0 withr_3.0.2
## [49] rARPACK_0.11-0 S7_0.2.1
## [51] evaluate_1.0.5 Biostrings_2.79.3
## [53] pillar_1.11.1 BiocManager_1.30.27
## [55] ggpubr_0.6.2 carData_3.0-5
## [57] reformulas_0.4.3 ellipse_0.5.0
## [59] insight_1.4.4 tidytree_0.4.6
## [61] scales_1.4.0 minqa_1.2.8
## [63] glue_1.8.0 lazyeval_0.2.2
## [65] maketools_1.3.2 tools_4.5.2
## [67] sys_3.4.3 lme4_1.1-38
## [69] RSpectra_0.16-2 ggsignif_0.6.4
## [71] fs_1.6.6 buildtools_1.0.0
## [73] grid_4.5.2 ape_5.8-1
## [75] tidyr_1.3.2 rbibutils_2.4
## [77] SingleCellExperiment_1.33.0 nlme_3.1-168
## [79] performance_0.15.3 Formula_1.2-5
## [81] cli_3.6.5 rappdirs_0.3.3
## [83] S4Arrays_1.11.1 dplyr_1.1.4
## [85] corpcor_1.6.10 gtable_0.3.6
## [87] yulab.utils_0.2.3 rstatix_0.7.3
## [89] sass_0.4.10 digest_0.6.39
## [91] SparseArray_1.11.10 ggrepel_0.9.6
## [93] farver_2.1.2 htmltools_0.5.9
## [95] lifecycle_1.0.4