Label-Free Quantitative mass spectrometry based workflows for differential expression (DE) analysis of proteins is often challenging due to peptide-specific effects and context-sensitive missingness of peptide intensities.
msqrob2 provides peptide-based workflows that can
assess for DE directly from peptide intensities and outperform
summarisation methods which first aggregate MS1 peptide intensities to
protein intensities before DE analysis. However, they are
computationally expensive, often hard to understand for the
non-specialised end-user, and they do not provide protein summaries,
which are important for visualisation or downstream processing.
msqrob2 therefore also proposes a novel
summarisation strategy, which estimates MSqRob’s model parameters in a
two-stage procedure circumventing the drawbacks of peptide-based
workflows.
the summarisation based workflow in msqrob2
maintains MSqRob’s superior performance, while providing useful protein
expression summaries for plotting and downstream analysis. Summarising
peptide to protein intensities considerably reduces the computational
complexity, the memory footprint and the model complexity. Moreover, it
renders the analysis framework to become modular, providing users the
flexibility to develop workflows tailored towards specific
applications.
In this vignette we will demonstrate how to perform
msqrob’s summarisation based workflow starting from a
Maxquant search on a subset of the cptac spike-in study.
Examples on our peptide-based workflows and on the analysis of more complex designs can be found on our companion website msqrob2 book.
Technical details on our methods can be found in (L. J. Goeminne, Gevaert, and Clement 2016), (L. J. E. Goeminne et al. 2020), (Sticker et al. 2020) and (Vandenbulcke et al. 2025).
This case-study is a subset of the data of the 6th study of the Clinical Proteomic Technology Assessment for Cancer (CPTAC). In this experiment, the authors spiked the Sigma Universal Protein Standard mixture 1 (UPS1) containing 48 different human proteins in a protein background of 60 ng/\(\mu\)L Saccharomyces cerevisiae strain BY4741. Two different spike-in concentrations were used: 6A (0.25 fmol UPS1 proteins/\(\mu\)L) and 6B (0.74 fmol UPS1 proteins/\(\mu\)L) [5]. We limited ourselves to the data of LTQ-Orbitrap W at site 56. The data were searched with MaxQuant version 1.5.2.8, and detailed search settings were described in Goeminne et al. (2016) [1]. Three replicates are available for each concentration.
We load the msqrob2 package, along with additional
packages for data manipulation and visualisation.
We first import the data from the peptide.txt file. This is the file
containing your peptide-level intensities searched by MaxQuant search,
this peptide.txt file can be found by default in the
“path_to_raw_files/combined/txt/” folder from the MaxQuant output, with
“path_to_raw_files” the folder where the raw files were saved. In this
vignette, we use a MaxQuant peptide file which is a subset of the cptac
study. This data is available in msqrob2data GitHub
repository. To import the data we use the QFeatures
package.
We generate the object peptideFile with the path to the peptide.txt file. This can be a path to the data on the web or a path to data on your local computer.
peptideFile <- "https://github.com/statOmics/msqrob2data/raw/refs/heads/main/dda/cptacAvsB_lab3/peptides.txt"We will import the data using the fread function of the
data.table package. We set the check.names argument equal
to TRUE so that all column names are retrieved using the
$ operator.
We also use argument integer64 = “double” because
readQFeatures from the QFeatures package does
not process 64-bit integers correctly.
## Sequence N.term.cleavage.window C.term.cleavage.window
## <char> <char> <char>
## 1: AAAAGAGGAGDSGDAVTK EHQHDEQKAAAAGAGG DSGDAVTKIGSEDVKL
## 2: AAAALAGGK QQLSKAAKAAAALAGG AAALAGGKKSKKKWSK
## 3: AAAALAGGKK QQLSKAAKAAAALAGG AALAGGKKSKKKWSKK
## 4: AAADALSDLEIK MPKETPSKAAADALSD ALSDLEIKDSKSNLNK
## 5: AAADALSDLEIKDSK MPKETPSKAAADALSD DLEIKDSKSNLNKELE
## 6: AAAEEFQR RERKEREKAAAEEFQR AAAEEFQRQQELLRQQ
## Amino.acid.before First.amino.acid Second.amino.acid Second.last.amino.acid
## <char> <char> <char> <char>
## 1: K A A T
## 2: K A A G
## 3: K A A K
## 4: K A A I
## 5: K A A S
## 6: K A A Q
## Last.amino.acid Amino.acid.after A.Count R.Count N.Count D.Count C.Count
## <char> <char> <int> <int> <int> <int> <int>
## 1: K I 7 0 0 2 0
## 2: K K 5 0 0 0 0
## 3: K S 5 0 0 0 0
## 4: K D 4 0 0 2 0
## 5: K S 4 0 0 3 0
## 6: R Q 3 1 0 0 0
## Q.Count E.Count G.Count H.Count I.Count L.Count K.Count M.Count F.Count
## <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1: 0 0 5 0 0 0 1 0 0
## 2: 0 0 2 0 0 1 1 0 0
## 3: 0 0 2 0 0 1 2 0 0
## 4: 0 1 0 0 1 2 1 0 0
## 5: 0 1 0 0 1 2 2 0 0
## 6: 1 2 0 0 0 0 0 0 1
## P.Count S.Count T.Count W.Count Y.Count V.Count U.Count Length
## <int> <int> <int> <int> <int> <int> <int> <int>
## 1: 0 1 1 0 0 1 0 18
## 2: 0 0 0 0 0 0 0 9
## 3: 0 0 0 0 0 0 0 10
## 4: 0 1 0 0 0 0 0 12
## 5: 0 2 0 0 0 0 0 15
## 6: 0 0 0 0 0 0 0 8
## Missed.cleavages Mass Proteins
## <int> <num> <char>
## 1: 0 1445.6746 sp|P38915|SPT8_YEAST
## 2: 0 728.4181 sp|Q3E792|RS25A_YEAST;sp|P0C0T4|RS25B_YEAST
## 3: 1 856.5131 sp|Q3E792|RS25A_YEAST;sp|P0C0T4|RS25B_YEAST
## 4: 0 1215.6347 sp|P09938|RIR2_YEAST
## 5: 1 1545.7886 sp|P09938|RIR2_YEAST
## 6: 0 920.4352 sp|P53075|SHE10_YEAST
## Leading.razor.protein Start.position End.position Unique..Groups.
## <char> <int> <int> <char>
## 1: sp|P38915|SPT8_YEAST 97 114 yes
## 2: sp|Q3E792|RS25A_YEAST 13 21 yes
## 3: sp|Q3E792|RS25A_YEAST 13 22 yes
## 4: sp|P09938|RIR2_YEAST 9 20 yes
## 5: sp|P09938|RIR2_YEAST 9 23 yes
## 6: sp|P53075|SHE10_YEAST 538 545 yes
## Unique..Proteins. Charges PEP Score Identification.type.6A_7
## <char> <char> <num> <num> <char>
## 1: yes 2 1.1800e-05 82.942 By MS/MS
## 2: no 2 7.4600e-06 134.810 By MS/MS
## 3: no 2 3.3100e-09 143.730 By MS/MS
## 4: yes 2 9.1600e-23 182.230 By MS/MS
## 5: yes 3 1.5319e-04 73.927 By MS/MS
## 6: yes 2 1.7588e-02 79.474 By matching
## Identification.type.6A_8 Identification.type.6A_9 Identification.type.6B_7
## <char> <char> <char>
## 1: By MS/MS By MS/MS By matching
## 2: By MS/MS By MS/MS By MS/MS
## 3: By MS/MS By MS/MS By MS/MS
## 4: By matching By MS/MS By MS/MS
## 5: By MS/MS By MS/MS By MS/MS
## 6: By matching By matching By matching
## Identification.type.6B_8 Identification.type.6B_9 Experiment.6A_7
## <char> <char> <int>
## 1: By MS/MS By MS/MS 1
## 2: By MS/MS By MS/MS 2
## 3: By MS/MS By MS/MS 1
## 4: By matching By matching 1
## 5: By MS/MS By MS/MS 1
## 6: By matching By matching NA
## Experiment.6A_8 Experiment.6A_9 Experiment.6B_7 Experiment.6B_8
## <int> <int> <int> <int>
## 1: 1 1 NA 1
## 2: 1 1 2 1
## 3: 1 1 1 1
## 4: 1 1 1 1
## 5: 1 1 1 1
## 6: NA 1 NA NA
## Experiment.6B_9 Intensity Intensity.6A_7 Intensity.6A_8 Intensity.6A_9
## <int> <num> <num> <num> <num>
## 1: 1 1190800 0 0 66760
## 2: 2 280990000 2441300 1220000 1337600
## 3: 1 33360000 1029200 668040 638990
## 4: 1 54622000 515460 670780 712140
## 5: 1 18910000 331130 420900 365560
## 6: 1 1158600 0 0 51558
## Intensity.6B_7 Intensity.6B_8 Intensity.6B_9 Reverse Potential.contaminant
## <num> <num> <num> <char> <char>
## 1: 0 37436 0
## 2: 2850900 935580 1606200
## 3: 777030 641270 562840
## 4: 426580 620510 737780
## 5: 329250 380820 414490
## 6: 0 0 76970
## id Protein.group.IDs Mod..peptide.IDs
## <int> <char> <char>
## 1: 0 859 0
## 2: 1 230 1
## 3: 2 230 2
## 4: 3 229 3
## 5: 4 229 4
## 6: 5 1143 5
## Evidence.IDs
## <char>
## 1: 0;1;2;3;4;5;6;7;8;9;10;11;12;13;14;15;16;17;18;19;20;21;22;23
## 2: 24;25;26;27;28;29;30;31;32;33;34;35;36;37;38;39;40;41;42;43;44;45;46;47;48;49;50;51;52;53;54;55;56;57;58;59;60;61;62;63;64;65;66;67;68;69;70;71;72;73
## 3: 74;75;76;77;78;79;80;81;82;83;84;85;86;87;88;89;90;91;92;93;94;95;96;97;98
## 4: 99;100;101;102;103;104;105;106;107;108;109;110;111;112;113;114;115;116;117;118;119;120;121;122;123;124;125;126;127;128;129;130;131;132;133;134;135;136;137;138;139;140;141;142;143
## 5: 144;145;146;147;148;149;150;151;152;153;154;155;156;157;158;159;160;161;162;163;164;165;166;167;168;169;170;171;172;173;174;175;176;177;178;179;180;181;182;183;184
## 6: 185;186;187;188;189;190;191;192;193;194
## MS.MS.IDs
## <char>
## 1: 0;1;2;3;4;5;6;7;8;9
## 2: 10;11;12;13;14;15;16;17;18;19;20;21;22;23;24;25;26;27;28;29
## 3: 30;31;32;33;34;35;36;37;38;39;40;41;42;43;44;45;46;47;48;49;50
## 4: 51;52;53;54;55;56;57;58;59;60;61;62;63;64;65;66;67;68;69;70;71;72;73;74;75;76;77;78;79;80;81;82;83;84
## 5: 85;86;87;88;89;90;91;92;93;94;95;96;97;98;99;100;101;102;103;104;105;106;107;108;109;110;111;112;113;114;115;116
## 6: 117;118
## Best.MS.MS Oxidation..M..site.IDs MS.MS.Count
## <int> <char> <int>
## 1: 0 10
## 2: 21 18
## 3: 31 21
## 4: 72 29
## 5: 94 32
## 6: 117 2
Each row in the peptide data table contains information about one
peptide (the table above shows the first 6 rows). The columns contains
various descriptors about the peptide, such as its sequence, its charge,
the amino acid composition, etc. Some of these columns (those starting
with Intensity) contain the quantification values for each
sample. The table format where the quantitative values for each sample
are contained in a separate column is depicted as the “wide format”, as
opposed to the “long format” (e.g., the [PSM table]).
Unlike for real experiments, we known the ground truth for this dataset: UPS proteins were spiked in at different concentrations and are thus differentially abundant (DA, spiked in). Yeast proteins consist of the background proteome that is the same in all samples and are thus non-DA.
Each row in the annotation table contains information about one sample. The columns contain various descriptors about the sample, such as the name of the sample or the MS run, the treatment (here the spike-in condition), or any other biological or technical information that may impact the data quality or the quantification. Without an annotation table, no analysis can be performed. The sample annotations are generated by the researcher. In this example, the annotations are extracted from the sample names, although reporting a detailed design of experiments in a table is better practice.
sampleId, to pinpoint the
different samples in the dataset. Since all quantification columns start
with the generic string Intensity 6 we replace the string
with an empty string "" so we keep the last 3 characters to
refer to the sample.(annot <- data.frame(quantCols = quantCols) |> #1.
mutate(sampleId = gsub("Intensity[.]6","", quantCols), #2.
condition = substr(sampleId,1,1), #3.
rep = substr(sampleId, 3, 3)) #4.
)## quantCols sampleId condition rep
## 1 Intensity.6A_7 A_7 A 7
## 2 Intensity.6A_8 A_8 A 8
## 3 Intensity.6A_9 A_9 A 9
## 4 Intensity.6B_7 B_7 B 7
## 5 Intensity.6B_8 B_8 B 8
## 6 Intensity.6B_9 B_9 B 9
The readQFeatures() enables a seamless conversion of
tabular data into a QFeatures object. We provide the
peptide table and the sample annotation table. The function will use the
quantCols column in the sample annotation table to
understand which columns in peptides contain the
quantitative values, and automatically link the corresponding sample
annotation with the quantitative values. We also tell the function to
use the Sequence column as peptide identifier, which will
be used as rownames. See ?readQFeatures() for more
details.
(qf <- readQFeatures(assayData = peptides,
colData = annot,
name = "peptides",
fnames = "Sequence")
)## Checking arguments.
## Loading data as a 'SummarizedExperiment' object.
## Formatting sample annotations (colData).
## Formatting data as a 'QFeatures' object.
## Setting assay rownames.
## An instance of class QFeatures (type: bulk) with 1 set:
##
## [1] peptides: SummarizedExperiment with 11466 rows and 6 columns
Since we have a QFeatures object, we can directly make
use of QFeatures’ data preprocessing functionality. A major
advantage of QFeatures is it provides the power to build highly modular
workflows, where each step is carried out by a dedicated function with a
large choice of available methods and parameters. This means that users
can adapt the workflow to their specific use case and their specific
needs.
The first preprocessing step is to correctly encode the missing
values. It is important that missing values are encoded using
NA. For instance, non-observed values should not be encoded
with a zero because true zeros (the proteomic feature is absent from the
sample) cannot be distinguished from technical zeros (the feature was
missed by the instrument or could not be identified). We therefore
replace any zero in the quantitative data with an NA.
Note that msqrob2 can handle missing data without having
to rely on hard-to-verify imputation assumptions, which is our general
recommendation. However, msqrob2 does not prevent users
from using imputation, which can be performed with impute()
from the QFeatures package.
We perform log2-transformation with logTransform() from
the QFeatures package.
Filtering removes low-quality and unreliable peptides that would otherwise introduce noise and artefacts in the data.
We remove peptides that could not be mapped to a protein. We also
remove peptides that cannot be uniquely mapped to a single proteins
because its sequence is shared across multiple proteins. This often
occurs for homologs or for proteins that share similar functional
domains. Shared peptides are often mapped to protein groups, which are
the set of proteins that share the given peptides. Protein groups are
encoded by appending the constituting protein identifiers, separated by
a ";".
qf <- filterFeatures(
qf, ~ Proteins != "" & ## Remove failed protein inference
!grepl(";", Proteins)) ## Remove protein groups## 'Proteins' found in 2 out of 2 assay(s).
We now remove the peptides that map to decoy sequences or to
contaminants proteins. In our peptide.txt file decoys and
contaminants are indicated with a “+” character in the column
Reverse and Potential.contaminant,
respectively. Note, that depending on the version of MaxQuant the name
of the contaminant variable differs (e.g. Contaminant or
Potential contaminant). Note, that while reading in the
peptides table spaces in the column names are replaced by a
dot.
qf <- filterFeatures(qf, ~ Reverse != "+" & ## Remove decoys
Potential.contaminant != "+") ## Remove contaminants## 'Reverse' found in 2 out of 2 assay(s).'Potential.contaminant' found in 2 out of 2 assay(s).
The data are characterised by a high proportion of missing values as
reported by nNA() from the QFeatures package.
The function computes the number and the proportion of missing values
across peptides, samples or for the whole data set and return a list of
three tables called nNArows, nNAcols, and
nNA, respectively.
## [1] 0.4245165 0.4963918
The samples contain between 42.5% and 49.6% missing values. The
missingness within each peptide is more variable, with most peptides
found accross all samples (nNA is 0) and other that could
not be quantified in any sample (nNA is 6), as depicted by
the histogram below.
data.frame(nNaRes$nNArows) |>
ggplot() +
aes(x = nNA) +
geom_histogram() +
ggtitle("Number of missing values for each peptide") +
theme_bw()## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
We keep peptides that were observed at last 3 times out of the \(n = 6\) samples, so that we can estimate the peptide characteristics. We tolerate the following proportion of NAs: \(\text{pNA} = \frac{(n - 3)}{n} = 0.5\), so we keep peptides that are observed in at least 50% of the samples, which corresponds to one treatment condition. This is an arbitrary value that may need to be adjusted depending on the experiment and the data set.
nObs <- 3
n <- ncol(qf[["peptides_log"]])
(qf <- filterNA(qf, i = "peptides_log", pNA = (n - nObs) / n))## An instance of class QFeatures (type: bulk) with 2 sets:
##
## [1] peptides: SummarizedExperiment with 10393 rows and 6 columns
## [2] peptides_log: SummarizedExperiment with 5910 rows and 6 columns
We keep 5910 peptides upon filtering.
The most common objective of MS-based proteomics experiments is to understand the biological changes in protein abundance between experimental conditions. However, changes in measurements between groups can be caused due to technical factors. For instance, there are systematic fluctuations from run-to-run that shift the measured intensity distribution. We can this explore as follows:
QFeatures’ 3-way subsetting.longForm() to convert the QFeatures
object into a long table, including all colData for
filtering and colouring.qf[, , "peptides_log"] |> #1.
longForm(colvars = colnames(colData(qf))) |> #2.
data.frame() |>
filter(!is.na(value)) |>
ggplot() + #3.
aes(x = value,
colour = condition,
group = colname) +
geom_density() +
theme_minimal()## Warning: 'experiments' dropped; see 'drops()'
## harmonizing input:
## removing 6 sampleMap rows not in names(experiments)
Even in this clean synthetic data set (same yeast background with some spiked UPS proteins), the marginal precursor intensity distributions across samples are not well aligned. Ignoring this effect will increase the noise and reduce the statistical power of the experiment, and may also, in case of unbalanced designs, introduce confounding effects that will bias the results.
There are many ways to perform the normalisation, e.g.
median centering is a popular choice. If we subtract the sample median
at the log scale from each precursor log2 intensity, this basically
boils down to calculating log-ratio’s between the precursor intensity
and its sample median. Here, we choose to work with offsets, that are
chosen in such a way so that the distributions are centered around the
location of a reference sample. E.g. the median of the sample
medians.
Note that users can also use all normalisation functions that are
provided in QFeatures, i.e. by replacing the sweep function by
QFeatures::normalize.
qf <- sweep( #Subtract log2 norm factor column-by-column (MARGIN = 2)
qf,
MARGIN = 2,
STATS = nfLogMedian(qf,"peptides_log"),
i = "peptides_log",
name = "peptides_norm"
)## This function aims to calculate norm factors on a log scale,
## the input data are assumed to be on the log-scale!
We explore the effect of the global normalisation in the subsequent plot.
Formally, the function applies the following operation on each sample \(i\) across all precursors \(p\):
\[ y_{ip}^{\text{norm}} = y_{ip} - \log_2(nf_i) \]
with \(y_{ip}\) the log2-transformed intensities and \(\log_2(nf_i)\) the log2-transformed norm factor. Upon normalisation, we can see that the distribution of the \(y_{ip}^{\text{norm}}\) nicely overlap (using the same code as above)
qf[, , "peptides_norm"] |>
longForm(colvars = colnames(colData(qf))) |>
data.frame() |>
filter(!is.na(value)) |>
ggplot() +
aes(x = value,
colour = condition,
group = colname) +
geom_density() +
labs(subtitle = "Normalised log2 precursor intensities") +
theme_minimal()## Warning: 'experiments' dropped; see 'drops()'
## harmonizing input:
## removing 12 sampleMap rows not in names(experiments)
Here, we summarise the peptide-level data into protein intensities
using aggregateFeatures(), which streamlines summarisation.
It requires the name of a rowData column to group the
peptides into proteins (or protein groups), here Proteins.
We can provide the summarisation approach through the fun
argument. We use the standard summarisation in aggregateFeatures, which
is a robust summarisation method.
qf <- aggregateFeatures(qf,
i = "peptides_norm",
fcol = "Proteins",
na.rm = TRUE,
name = "proteins"
)## | | | 0%
## | |======================================================================| 100%
## The following messages occurred during aggregation:
##
## Your quantitative and row data contain missing values. Please read the
## relevant section(s) in the aggregateFeatures manual page regarding the
## effects of missing values on data aggregation.
##
## Occurred during the aggregation of set(s): peptides
Note, that robust summarisation can take a long time for large
datasets. In that case we suggest to use
fun = MsCoreUtils::medianPolish, na.rm = TRUE. Note that
the optional argument na.rm = TRUE is needed when using
medianPolish because the data contains missing values as we
did not adopt imputation and medianPolish by default cannot
handle missing values.
Data exploration aims to highlight the main sources of variation in the data prior to data modelling and can pinpoint to outlying or off-behaving samples.
qf[, , "peptides_norm"] |>
longForm(colvars = colnames(colData(qf))) |>
data.frame() |>
filter(!is.na(value)) |>
ggplot() +
aes(x = value,
colour = condition,
group = colname) +
geom_density() +
theme_minimal() +
labs(subtitle = "Normalised log2 precursor intensities")## Warning: 'experiments' dropped; see 'drops()'
## harmonizing input:
## removing 18 sampleMap rows not in names(experiments)
qf[, , "peptides_norm"] |>
longForm(colvars = colnames(colData(qf))) |>
data.frame() |>
filter(!is.na(value)) |>
ggplot() +
aes(x = sampleId,
y = value,
colour = condition,
group = colname) +
xlab("sample") +
geom_boxplot() +
theme_minimal() +
labs(subtitle = "Normalised log2 precursor intensities")## Warning: 'experiments' dropped; see 'drops()'
## harmonizing input:
## removing 18 sampleMap rows not in names(experiments)
We do the same for the protein-level values.
qf[, , "proteins"] |>
longForm(colvars = colnames(colData(qf))) |>
data.frame() |>
filter(!is.na(value)) |>
ggplot() +
aes(x = value,
colour = condition,
group = colname) +
geom_density() +
theme_minimal() +
labs(subtitle = "Normalised log2 protein intensities")## Warning: 'experiments' dropped; see 'drops()'
## harmonizing input:
## removing 18 sampleMap rows not in names(experiments)
qf[, , "proteins"] |>
longForm(colvars = colnames(colData(qf))) |>
data.frame() |>
filter(!is.na(value)) |>
ggplot() +
aes(x = sampleId,
y = value,
colour = condition,
group = colname) +
xlab("sample") +
geom_boxplot() +
theme_minimal() +
labs(subtitle = "Normalised log2 protein intensities")## Warning: 'experiments' dropped; see 'drops()'
## harmonizing input:
## removing 18 sampleMap rows not in names(experiments)
qf[,,"peptides_norm"] |>
longForm(colvars = colnames(colData(qf)),
rowvars= c("Sequence",
"Proteins")) |>
data.frame() |>
filter(!is.na(value)) |>
group_by(condition, sampleId) |>
summarise(Precursors = length(unique(Sequence)),
`Protein Groups` = length(unique(Proteins))) |>
pivot_longer(-(1:2),
names_to = "Feature",
values_to = "IDs") |>
ggplot(aes(x = sampleId, y = IDs, fill = condition)) +
geom_col() +
#scale_fill_observable() +
facet_wrap(~Feature,
scales = "free_y") +
labs(title = "Peptide and protein group identificiations per sample",
x = "Sample",
y = "Identifications") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90))## Warning: 'experiments' dropped; see 'drops()'
## harmonizing input:
## removing 18 sampleMap rows not in names(experiments)
## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by condition and sampleId.
## ℹ Output is grouped by condition.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(condition, sampleId))` for per-operation grouping
## (`?dplyr::dplyr_by`) instead.
A common approach for data exploration is to perform dimension reduction, such as Multi Dimensional Scaling (MDS). We will first extract the set to explore along the sample annotations (used for plot colouring).
getWithColData(qf, "proteins") |>
as("SingleCellExperiment") |>
runMDS(exprs_values = 1) |>
plotMDS(colour_by = "condition", shape_by = "rep", point_size = 3)## Warning: 'experiments' dropped; see 'drops()'
This plot reveals interesting information. First, we see that the
samples are nicely separated according to their spike-in condition. The
also seem to line up according to rep.
corMat <- qf[["proteins"]] |>
assay() |>
cor(method = "spearman", use = "pairwise.complete.obs")
colnames(corMat) <- qf$sampleId
rownames(corMat) <- qf$sampleId
corMat |>
ggcorrplot::ggcorrplot() +
scale_fill_viridis_c() +
labs(title = "Correlation matrix",
fill = "Correlation") +
theme(axis.text.x = element_text(angle = 90))## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the ggcorrplot package.
## Please report the issue at <https://github.com/kassambara/ggcorrplot/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
For this experiment, we can only model a single source of variation: spike-in concentration. All remaining sources of variability, e.g. pipetting errors, MS run to run variability, etc will be lumped in the residual variability (error term).
Spike-in condition effects: we model the source of variation induced by the experimental treatment of interest as a fixed effect, which we consider non-random, i.e. the treatment effect is assumed to be the same in repeated experiments, but it is unknown and has to be estimated.
When modelling a typical label-free experiment at the protein level, the model boils down to the following linear model:
\[ y_i = \mathbf{x}_i^T\boldsymbol{\beta}
+\epsilon_i\] with the \(\log_2\)-normalised and summarised protein
intensities in sample; \(\mathbf{x}_i\)
a vector with the covariate pattern for the sample encoding the
intercept, spike-in condition, or other experimental factors;
\(\boldsymbol{\beta}\) the vector of
parameters that model the association between the covariates and the
outcome; and \(\epsilon_i\) the
residuals reflecting variation that is not captured by the fixed
effects. Note that allows for a flexible parameterisation of the
treatment beyond a single covariate, i.e. including a 1 for the
intercept, continuous and categorical variables as well as their
interactions. We assume the residuals to be independent and identically
distributed (i.i.d) according to a normal distribution with zero mean
and constant variance, i.e. \(\epsilon_i \sim
N(0, \sigma^2_\epsilon)\), that can differ from protein to
protein.
In R, this model is encoded using the following simple formula:
We estimate the model with msqrob(). The function takes
the QFeatures object, extracts the quantitative values from the
"proteins" set generated during summarisation, and fits a
simple linear model with condition as covariate, which are
automatically retrieved from colData(qf). Note, that
msqrob() can also take a SummarizedExperiment
of precursor, peptide or protein intensities as its input.
We enable M-estimation (robust = TRUE) for improved
robustness against outliers.
The fitting results are available in the msqrobModels
column of the rowData. More specifically, the modelling output is stored
in the rowData as a statModel object, one
model per row (protein). We will see in a later section how to perform
statistical inference on the estimated parameters.
## $`O00762ups|UBE2C_HUMAN_UPS`
## Object of class "StatModel"
## The type is "rlm"
## There number of elements is 5
##
## $`P00167ups|CYB5_HUMAN_UPS`
## Object of class "StatModel"
## The type is "fitError"
## There number of elements is 5
##
## $`P00441ups|SODC_HUMAN_UPS`
## Object of class "StatModel"
## The type is "rlm"
## There number of elements is 5
We can now convert the research question “do the spike-in conditions affect the protein intensities?” into a statistical hypothesis.
In other words, we need to translate this question in a null and alternative hypothesis on a single model parameter or on a linear combination of model parameters, which is also referred to with a contrast.
To aid defining contrasts, we will visualise the experimental design
using the ExploreModelMatrix package.
vd <- ExploreModelMatrix::VisualizeDesign(
sampleData = colData(qf),
designFormula = model,
textSizeFitted = 4
)
vd$plotlist## [[1]]
This plot shows that the average protein \(\log_2\) intensity for conditionA is
modelled by (Intercept) and that for the
conditionB group by (Intercept) + conditionB
.
Hence, to assess differential abundance of proteins between the two spike-in conditions, we will have to assess the following null hypothesis: \(\log_2 FC_{B-A} = (Intercept + conditionB) - Intercept = conditionB = 0\).
Based on this null hypothesis we now generate a contrast matrix.
## conditionB
## (Intercept) 0
## conditionB 1
We can now test our null hypothesis using
hypothesisTest() which takes the QFeatures
object with the fitted model and the contrast we just built.
msqrob2 automatically applies the hypothesis testing to all
features in the assay.
The results tables for the contrast are stored in the row data of the
proteins assay in qf under the colomnames of
the contrast matrix L.
Users can directly access the results tables via de colData. However,
msqrob2 also provides the msqrobCollect
function. This function will automatically retrieve the results tables
for all contrasts in contrast matrix L and combine them by
default in one table. This table contains two additional columns:
contrast with the name of the contrast and
feature with the name of the modelled feature,
here protein.
## logFC se df t
## conditionB.O00762ups|UBE2C_HUMAN_UPS 1.5026502 0.1418246 6.066186 10.595131
## conditionB.P00167ups|CYB5_HUMAN_UPS NA NA NA NA
## conditionB.P00441ups|SODC_HUMAN_UPS -0.4291633 0.1782764 6.117213 -2.407292
## conditionB.P00709ups|LALBA_HUMAN_UPS 1.3257554 0.4567491 6.117213 2.902590
## conditionB.P00915ups|CAH1_HUMAN_UPS NA NA NA NA
## conditionB.P00918ups|CAH2_HUMAN_UPS 1.4592637 0.1961286 7.117213 7.440342
## pval adjPval contrast
## conditionB.O00762ups|UBE2C_HUMAN_UPS 3.873132e-05 0.004522272 conditionB
## conditionB.P00167ups|CYB5_HUMAN_UPS NA NA conditionB
## conditionB.P00441ups|SODC_HUMAN_UPS 5.198259e-02 0.893152586 conditionB
## conditionB.P00709ups|LALBA_HUMAN_UPS 2.664631e-02 0.668425430 conditionB
## conditionB.P00915ups|CAH1_HUMAN_UPS NA NA conditionB
## conditionB.P00918ups|CAH2_HUMAN_UPS 1.329479e-04 0.010449706 conditionB
## feature
## conditionB.O00762ups|UBE2C_HUMAN_UPS O00762ups|UBE2C_HUMAN_UPS
## conditionB.P00167ups|CYB5_HUMAN_UPS P00167ups|CYB5_HUMAN_UPS
## conditionB.P00441ups|SODC_HUMAN_UPS P00441ups|SODC_HUMAN_UPS
## conditionB.P00709ups|LALBA_HUMAN_UPS P00709ups|LALBA_HUMAN_UPS
## conditionB.P00915ups|CAH1_HUMAN_UPS P00915ups|CAH1_HUMAN_UPS
## conditionB.P00918ups|CAH2_HUMAN_UPS P00918ups|CAH2_HUMAN_UPS
We will return a table of proteins for which the contrast is significant at the 5% FDR level.
alpha <- 0.05 #1.
sigList <- inferences |>
filter(adjPval < alpha) #3.
kable(sigList) |>
kable_styling(full_width = FALSE) |>
scroll_box(height = "250px") #4.| logFC | se | df | t | pval | adjPval | contrast | feature | |
|---|---|---|---|---|---|---|---|---|
| conditionB.O00762ups|UBE2C_HUMAN_UPS | 1.502650 | 0.1418246 | 6.066186 | 10.595131 | 0.0000387 | 0.0045223 | conditionB | O00762ups|UBE2C_HUMAN_UPS |
| conditionB.P00918ups|CAH2_HUMAN_UPS | 1.459264 | 0.1961286 | 7.117213 | 7.440342 | 0.0001329 | 0.0104497 | conditionB | P00918ups|CAH2_HUMAN_UPS |
| conditionB.P01031ups|CO5_HUMAN_UPS | 1.744301 | 0.1856229 | 5.528016 | 9.397013 | 0.0001323 | 0.0104497 | conditionB | P01031ups|CO5_HUMAN_UPS |
| conditionB.P01127ups|PDGFB_HUMAN_UPS | 1.475470 | 0.1622265 | 6.836031 | 9.095124 | 0.0000460 | 0.0045223 | conditionB | P01127ups|PDGFB_HUMAN_UPS |
| conditionB.P01344ups|IGF2_HUMAN_UPS | 1.481863 | 0.1289776 | 6.117213 | 11.489306 | 0.0000228 | 0.0033600 | conditionB | P01344ups|IGF2_HUMAN_UPS |
| conditionB.P01375ups|TNFA_HUMAN_UPS | 1.858881 | 0.2319031 | 6.117213 | 8.015766 | 0.0001824 | 0.0134425 | conditionB | P01375ups|TNFA_HUMAN_UPS |
| conditionB.P02144ups|MYG_HUMAN_UPS | 1.367831 | 0.2173194 | 7.117213 | 6.294107 | 0.0003799 | 0.0248859 | conditionB | P02144ups|MYG_HUMAN_UPS |
| conditionB.P02787ups|TRFE_HUMAN_UPS | 1.462212 | 0.1518081 | 6.604679 | 9.631979 | 0.0000397 | 0.0045223 | conditionB | P02787ups|TRFE_HUMAN_UPS |
| conditionB.P06732ups|KCRM_HUMAN_UPS | 1.615841 | 0.1315682 | 7.088294 | 12.281393 | 0.0000049 | 0.0014503 | conditionB | P06732ups|KCRM_HUMAN_UPS |
| conditionB.P07339ups|CATD_HUMAN_UPS | 1.901800 | 0.1838925 | 6.117213 | 10.341911 | 0.0000422 | 0.0045223 | conditionB | P07339ups|CATD_HUMAN_UPS |
| conditionB.P08311ups|CATG_HUMAN_UPS | 1.543161 | 0.1117913 | 7.117213 | 13.803946 | 0.0000021 | 0.0008392 | conditionB | P08311ups|CATG_HUMAN_UPS |
| conditionB.P09211ups|GSTP1_HUMAN_UPS | 1.171263 | 0.1876560 | 6.006373 | 6.241542 | 0.0007802 | 0.0459900 | conditionB | P09211ups|GSTP1_HUMAN_UPS |
| conditionB.P10636-8ups|TAU_HUMAN_UPS | 1.264263 | 0.1138747 | 7.117213 | 11.102235 | 0.0000095 | 0.0016011 | conditionB | P10636-8ups|TAU_HUMAN_UPS |
| conditionB.P12081ups|SYHC_HUMAN_UPS | 1.411765 | 0.1810004 | 6.117213 | 7.799792 | 0.0002128 | 0.0147581 | conditionB | P12081ups|SYHC_HUMAN_UPS |
| conditionB.P51965ups|UB2E1_HUMAN_UPS | 1.646747 | 0.1484305 | 7.117213 | 11.094395 | 0.0000095 | 0.0016011 | conditionB | P51965ups|UB2E1_HUMAN_UPS |
| conditionB.P61769ups|B2MG_HUMAN_UPS | 2.007500 | 0.3295815 | 7.029565 | 6.091059 | 0.0004874 | 0.0302429 | conditionB | P61769ups|B2MG_HUMAN_UPS |
| conditionB.P63165ups|SUMO1_HUMAN_UPS | 1.044336 | 0.1317043 | 7.117213 | 7.929393 | 0.0000883 | 0.0080056 | conditionB | P63165ups|SUMO1_HUMAN_UPS |
| conditionB.P63279ups|UBC9_HUMAN_UPS | 1.804380 | 0.1170894 | 7.117213 | 15.410281 | 0.0000010 | 0.0008392 | conditionB | P63279ups|UBC9_HUMAN_UPS |
| conditionB.P99999ups|CYC_HUMAN_UPS | 2.043480 | 0.1462042 | 7.117213 | 13.976892 | 0.0000020 | 0.0008392 | conditionB | P99999ups|CYC_HUMAN_UPS |
| conditionB.Q06830ups|PRDX1_HUMAN_UPS | 1.564310 | 0.1362011 | 7.114248 | 11.485293 | 0.0000075 | 0.0016011 | conditionB | Q06830ups|PRDX1_HUMAN_UPS |
A volcano plot is a common visualisation that provides an overview of
the hypothesis testing results, plotting the \(-\log_{10}\) p-value1 as a function of the
estimated log fold change. Volcano plots can be used to highlight the
most interesting proteins that have large fold changes and/or are highly
significant. We can use the table above directly to build a volcano plot
using ggplot2 functionality. We also highlight which
proteins are UPS standards, known to be differentially abundant by
experimental design.
Experienced users can make the plot themselves. However, in msqrob2
we also provide the plotVolcano function that generates volcanoplots
based on msqrob2 inference tables generated with the
hypothesisTest function.
We construct heatmaps for the significant proteins for each contrast.
sig <- sigList |> pull(feature) #1.
quants <- assay(qf,"proteins")[sig,] |>
t() |>
scale() |>
t() #2.
colnames(quants) <- qf$sampleId #3.
annotations <- columnAnnotation(condition = qf$condition) #3.
set.seed(1234) ## annotation colours are randomly generated by default
Heatmap(
quants,
name = "log2 intensity",
top_annotation = annotations
) #4.Note, that 20 DA proteins are returned and that they are all UPS proteins!
A typical workflow for the analysis of a proteomics experiment will end here.
Note, that the evaluations in this section are typically not possible on real experimental data. Indeed, we are using a spike-in dataset so we know the ground truth: all human UPS proteins are differentially abundant (DA) between the conditions and the yeast proteins are non-DA.
We use a nominal FDR level of 0.05
As this is a spike-in study with known ground truth, we can also plot the log2 fold change distributions against the expected values, in this case 0 for the yeast proteins and the difference of the log concentration for the spiked-in UPS standards, i.e. \(logFC = \log_2 (0.74 fmol/0.25 fmol) = \log_2 2.96 = 1.57\).
We collect the inference table again and now add the species information.
inferences <-
msqrobCollect(qf[["proteins"]], L) |>
mutate(species = rowData(qf[["proteins"]])$species)We now evaluate if the estimated fold changes correspond to the real fold changes.
logFC <- inferences |>
filter(!is.na(logFC)) |>
ggplot(aes(x = species, y = logFC, color = species)) + #1.
geom_boxplot() + #2.
theme_bw() +
scale_color_manual(
values = c("grey20", "firebrick"), #3.
name = "",
labels = c("yeast", "ups")
) +
geom_hline(yintercept = 0, color="grey20") + # 4.
geom_hline(yintercept = realLogFC, color="firebrick") #5.
logFCNote, that the msqrob2 estimates are nicely distributed around the true fold changes.
All human UPS proteins are differentially abundant (DA) between the conditions and the yeast proteins are non-DA.
inferences |>
filter(adjPval < alpha) |>
summarise("TP" = sum(species == "ups"),
"FP" = sum(species != "ups"),
"FDP" = mean(species != "ups"))## TP FP FDP
## 1 20 0 0
Note, that at the 5% FDR level 20 true positives (TP, UPS proteins) are returned as DA, 0 false positives (FP, yeast proteins) are returned DA, resulting in a false discovery proportion (FDP) of 0, confirming that msqrob2 is providing strong FDR control for this dataset.
We generate the TPR-FDP curves to assess the performance to prioritise differentially abundant proteins. Again, these curves are built using the ground truth information about which proteins are differentially abundant (spiked in) and which proteins are constant across samples. We create two functions to compute the TPR and the FDP.
computeFDP <- function(pval, tp) {
ord <- order(pval)
fdp <- cumsum(!tp[ord]) / 1:length(tp)
fdp[order(ord)]
}
computeTPR <- function(pval, tp, nTP = NULL) {
if (is.null(nTP)) nTP <- sum(tp)
ord <- order(pval)
tpr <- cumsum(tp[ord]) / nTP
tpr[order(ord)]
}We apply these functions and compute the corresponding metric using the statistical inference results and the ground truth information.
performance <- inferences |> group_by(contrast) |>
na.exclude() |>
mutate(tpr = computeTPR(pval, species == "ups"),
fdp = computeFDP(pval, species == "ups")) |>
arrange(fdp)We also highlight the observed FDP at a \(alpha = 5\%\) FDR threshold.
ggplot(performance) +
aes(
y = fdp,
x = tpr,
) +
geom_line() +
geom_point(data = workPoints, size = 3) +
geom_hline(yintercept = 0.05, linetype = 2) +
coord_flip(ylim = c(0, 0.2)) +
theme_minimal()Note, that more examples can be found in our msqrob2 book.
## R version 4.5.3 (2026-03-11)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] scater_1.39.4 scuttle_1.21.6
## [3] SingleCellExperiment_1.33.2 ComplexHeatmap_2.27.1
## [5] kableExtra_1.4.0 ExploreModelMatrix_1.23.1
## [7] stringr_1.6.0 msqrob2_1.19.3
## [9] ggplot2_4.0.3 tidyr_1.3.2
## [11] dplyr_1.2.1 QFeatures_1.21.3
## [13] MultiAssayExperiment_1.37.4 SummarizedExperiment_1.41.1
## [15] Biobase_2.71.0 GenomicRanges_1.63.2
## [17] Seqinfo_1.1.0 IRanges_2.45.0
## [19] S4Vectors_0.49.2 BiocGenerics_0.57.1
## [21] generics_0.1.4 MatrixGenerics_1.23.0
## [23] matrixStats_1.5.0 BiocStyle_2.39.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 sys_3.4.3 rstudioapi_0.18.0
## [4] jsonlite_2.0.0 shape_1.4.6.1 magrittr_2.0.5
## [7] ggbeeswarm_0.7.3 farver_2.1.2 nloptr_2.2.1
## [10] rmarkdown_2.31 GlobalOptions_0.1.4 vctrs_0.7.3
## [13] Cairo_1.7-0 minqa_1.2.8 BiocBaseUtils_1.13.0
## [16] htmltools_0.5.9 S4Arrays_1.11.1 BiocNeighbors_2.5.4
## [19] SparseArray_1.11.13 sass_0.4.10 bslib_0.10.0
## [22] htmlwidgets_1.6.4 plyr_1.8.9 cachem_1.1.0
## [25] buildtools_1.0.0 igraph_2.3.0 mime_0.13
## [28] lifecycle_1.0.5 iterators_1.0.14 pkgconfig_2.0.3
## [31] rsvd_1.0.5 Matrix_1.7-5 R6_2.6.1
## [34] fastmap_1.2.0 rbibutils_2.4.1 shiny_1.13.0
## [37] clue_0.3-68 digest_0.6.39 colorspace_2.1-2
## [40] irlba_2.3.7 textshaping_1.0.5 beachmat_2.27.5
## [43] labeling_0.4.3 abind_1.4-8 compiler_4.5.3
## [46] withr_3.0.2 doParallel_1.0.17 S7_0.2.2
## [49] ggcorrplot_0.1.4.1 BiocParallel_1.45.0 viridis_0.6.5
## [52] MASS_7.3-65 DelayedArray_0.37.1 rjson_0.2.23
## [55] tools_4.5.3 vipor_0.4.7 otel_0.2.0
## [58] beeswarm_0.4.0 httpuv_1.6.17 glue_1.8.1
## [61] nlme_3.1-169 promises_1.5.0 cluster_2.1.8.2
## [64] reshape2_1.4.5 gtable_0.3.6 data.table_1.18.2.1
## [67] ScaledMatrix_1.19.0 BiocSingular_1.27.1 xml2_1.5.2
## [70] XVector_0.51.0 ggrepel_0.9.8 foreach_1.5.2
## [73] pillar_1.11.1 limma_3.67.3 later_1.4.8
## [76] rintrojs_0.3.4 circlize_0.4.18 splines_4.5.3
## [79] lattice_0.22-9 tidyselect_1.2.1 maketools_1.3.2
## [82] knitr_1.51 gridExtra_2.3 reformulas_0.4.4
## [85] ProtGenerics_1.43.0 svglite_2.2.2 xfun_0.57
## [88] shinydashboard_0.7.3 statmod_1.5.1 DT_0.34.0
## [91] stringi_1.8.7 lazyeval_0.2.3 yaml_2.3.12
## [94] boot_1.3-32 evaluate_1.0.5 codetools_0.2-20
## [97] MsCoreUtils_1.23.9 tibble_3.3.1 BiocManager_1.30.27
## [100] cli_3.6.6 xtable_1.8-8 systemfonts_1.3.2
## [103] Rdpack_2.6.6 jquerylib_0.1.4 Rcpp_1.1.1-1
## [106] png_0.1-9 parallel_4.5.3 assertthat_0.2.1
## [109] AnnotationFilter_1.35.0 lme4_2.0-1 viridisLite_0.4.3
## [112] scales_1.4.0 purrr_1.2.2 crayon_1.5.3
## [115] GetoptLong_1.1.1 rlang_1.2.0 cowplot_1.2.0
## [118] shinyjs_2.1.1
Note, that upon this transformation a value of 1 represents a p-value of 0.1, 2 a p-value of 0.01, 3 a p-value of 0.001, etc.↩︎