limpa assumes that label-free proteomics profiling has been performed on a series of biological samples, and that precursor ion intensities have been obtained for each sample (Luo et al 2023). limma reads the precursor intensity data from popular software tools such as DIA-NN (Demichev et al 2020), Spectronaut (https://staging.biognosys.com), FragPipe (Yu et al 2023), or MaxQuant (Tyanova et al 2016). In each case, the precursor quantifications are read into a wide-format (precursor by sample) matrix of log-intensities. The only other information required to start a limpa analysis is a vector specifying the protein or protein-group membership of each precursor.
Precursor intensities typically include missing values for some samples, which cause considerable problems for downstream analyses. The missing values can’t just be ignored, because they occur more often at low intensities, and missing value imputation also introduces biases and problems. limpa uses statistical models developed by Li & Smyth (2023) to evaluate how much information can be recovered from the missing values, quantifying the overall missing value “mechanism” into a detection probability curve (DPC). limpa uses the DPC, together with a Bayesian model, to summarize the precursor quantifications into protein-level quantifications and to undertake limma-based differential expression analyses (Li 2024, Ritchie et al 2015).
Any questions about limpa can be sent to the Bioconductor support forum.
Installation instructions, code browsing and other information is available from the Bioconductor home page for limpa.
This documentation can be accessed from https://smythlab.github.io/limpa/.
Li M, Cobbold SA, Smyth GK (2025). Quantification and differential analysis of mass spectrometry proteomics data with probabilistic recovery of information from missing values. bioRxiv 2025/651125. doi:10.1101/2025.04.28.651125
If y.peptide is a matrix of precursor-level
log2-intensities (including NAs), protein.id is a vector of
protein IDs, and design is a design matrix, then the
following code will quantify complete log2-expression for the proteins
without missing values and will conduct a differential expression
analysis defined by the design matrix.
library(limpa)
dpcest <- dpc(y.peptide)
y.protein <- dpcQuant(y.peptide, protein.id, dpc=dpcest)
fit <- dpcDE(y.protein, design)
fit <- eBayes(fit)
topTable(fit)
Real proteomics datasets are too large to analyze in this vignette, but we can demonstrate a complete reproducible analysis using a small simulated data. First, generate the dataset:
Loading required package: limma
The dataset is stored as a limma EList object, with components
E (log2-expression), genes (feature
annotation) and targets (sample annotation). The simulation
function can generate any number of precursors or proteins but, by
default, the dataset has 100 precursors belonging to 25 proteins and the
samples are in two groups with \(n=5\)
replicates in each group. About 40% of the expression values are
missing.
[1] 100 10
Protein DEStatus
Peptide001 Protein01 NotDE
Peptide002 Protein01 NotDE
Peptide003 Protein01 NotDE
Peptide004 Protein01 NotDE
Peptide005 Protein02 NotDE
Peptide006 Protein02 NotDE
1 2
5 5
[1] 0.389
Next we estimate the intercept and slope of the detection probability curve, which relates the probability of detection to the underlying precursor log-intensity level on the logit scale.
1 peptides are completely missing in all samples.
beta0 beta1
-3.8185 0.7455
Then we use the DPC to quantify the protein log2-expression values, using the DPC to represent the missing values.
Estimating hyperparameters ...
Quantifying proteins ...
Proteins: 25 Peptides: 100
There are no longer any missing values, and the samples now cluster
into groups. The plotMDSUsingSEs() function is similar to
plotMDS() in the limma package, but takes account of the
standard errors generated by dpcQuant().
> Group <- factor(y.peptide$targets$Group)
> Group.color <- Group
> levels(Group.color) <- c("blue","red")
> plotMDSUsingSEs(y.protein, pch=16, col=as.character(Group.color))Finally, we conduct a differential expression analysis using the
limma package. The dpcDE function calls limpa’s
voomaLmFitWithImpution function, which is an extension of
limma’s vooma() approach.
voomaLmFitWithImputation computes precision weights, in an
analogous way to voom for RNA-seq, but instead of using
count sizes to predict the variances it uses the quantification
precisions from dpcQuant. The plot shows how
dpcDE predicts the protein-wise variances from the
quantification uncertainties and from the averge log-intensity
levels.
DEStatus NPeptides PropObs logFC AveExpr t P.Value adj.P.Val
Protein23 Up 4 0.950 0.9791 9.243 4.696 3.037e-05 0.0007592
Protein22 Up 4 1.000 0.7335 8.912 4.062 2.173e-04 0.0025529
Protein24 Down 4 0.975 -0.8420 9.329 -3.948 3.063e-04 0.0025529
Protein08 Up 4 0.450 0.6917 4.720 2.258 2.943e-02 0.1471725
Protein13 NotDE 4 0.650 -0.6116 5.960 -2.349 2.378e-02 0.1471725
Protein07 NotDE 4 0.425 0.6814 4.266 2.150 3.757e-02 0.1565210
Protein11 NotDE 4 0.650 0.5462 5.157 1.825 7.541e-02 0.2693390
Protein10 NotDE 4 0.375 0.4912 4.801 1.534 1.328e-01 0.4148864
Protein09 NotDE 4 0.275 0.3823 4.653 1.189 2.413e-01 0.6032065
Protein21 NotDE 4 0.850 0.3261 8.483 1.384 1.741e-01 0.4835411
B
Protein23 2.25712
Protein22 0.31342
Protein24 0.08621
Protein08 -3.76750
Protein13 -3.77157
Protein07 -3.99409
Protein11 -4.61641
Protein10 -5.11761
Protein09 -5.54192
Protein21 -5.58621
This small dataset has five truly DE proteins. Four of the give are top-ranked in the DE results. The other DE protein is ranked 10th in the DE results and does not achieve statistical significant because it had only 17% detected observations and, hence, a high quantification uncertainty.
To view the log-expression values for the top DE protein, together with standard errors:
For some proteomics applications, such as PTM or isoform analyses, it
is desirable to undertake quantification and differential expression
analysis for every row of data instead of summarizing multiple rows into
proteins. To impute and quantify on a row-wise base, use the
dpcQuantByRow function instead of dpcQuant,
for example
y.peptide.complete <- dpcQuantByRow(y.peptide, dpc=dpcest)
The input is similar as for dpcQuant except there is no
need to specify protein IDs, and the downstream DE analysis is the
same.
The limpa pipeline starts with a matrix of precursor intensities
(rows for precursors and columns for samples) and a character vector of
protein IDs. (Rows can alternatively correspond to peptides, proteins or
PTMs, see below). The input data can be conveniently supplied as a limma
EList object, but a plain numeric matrix containing the log-intensities
is also acceptable. Non-detected precursors should be entered as
NA.
limpa includes the functions readDIANN(),
readSpectronaut(), readFragPipe() and
readMaxQuant(), which read precursor level data output by
the popular mass spectrometry quantification tools DIA-NN, Spectronaut,
FragPipe, and MaxQuant respectively. In each case, limpa directly reads
files output by those tools without any need for prior processing by the
analyst.
The readDIANN() function reads the main DIA-NN “Report”
file, and it supports either the tab-delimited text format written by
DIA-NN version 1 or the Apache Parquet format used by DIA-NN version 2.
For example,
will read a Report.tsv file written by DIA-NN v1 from
the current working directory, and
will read a Report.parquet file written by DIA-NN v2.
The format can be specified explicity but, by default, is detected
automatically from the Report file name.
The readDIANN() function also includes the option to
filter peptide-precursors based on Q-values. While limpa is robust to
different filtering choices, we recommend the following settings for
data searched with match-between-run (MBR):
> y.peptide <- readDIANN("Report.tsv",
+ q.columns = c("Q.Value","Lib.Q.Value","Lib.PG.Q.Value"),
+ q.cutoffs = 0.01)and the following settings if searched without MBR:
> y.peptide <- readDIANN("Report.tsv",
+ q.columns = c("Q.Value","Global.Q.Value","Global.PG.Q.Value"),
+ q.cutoffs = 0.01)These settings follow suggestions from Thierry Nordmann (Max Planck Institute of Biochemistry).
After reading in the data, it is common to filter out non-proteotypic peptides by
and to filter out compound protein groups (protein groups mapped to two or more protein IDs) by
These filtering steps are not required by limpa but help with downstream interpretation of the results.
Protein groups can also be filtered by number of peptides, typically to remove proteins with only one detected precursor:
This step is entirely optional. We generally recommend that users
keep all proteins in order to retain maximum information. The
dpcQuant() function will still quantify complete data even
for proteins with just one peptide.
Note that peptides do not need to be filtered based on the proportion of detected or missing values. limpa can process peptides correctly even if the number of detections is small.
Missing values for some peptides in some samples has complicated the
analysis of MS proteomics data. Peptides with very low expression values
are frequently not detected, but peptides at high expression levels may
also be undetected for a variety of reasons that are not completely
understood or easily predictable, for example ambiguity of their elution
profile with that of other peptides. If y is the true
expression level of a particular peptide in a particular sample (on the
log2 scale), then limpa assumes that the probability of detection is
given by \[P(D | y) = F(\beta_0 + \beta_1
y)\] where \(D\) indicates
detection, \(\beta_0\) and \(\beta_1\) are the intercept and slope of
the DPC and \(F\) is the logistic
function, given by plogis in R. This probability
relationship is called the detection probability curve (DPC) in limpa.
The slope \(\beta_1\) measures how
dependent the missing value process is on the underlying expression
level. A slope of zero would means completely random missing values,
while very large slopes correspond to left censoring. The DPC allows
limpa to recover information in a probabilistic manner from the missing
values. The larger the slope, the more information there is to recover.
We typically find \(\beta_1\) values
between about 0.7 and 1 to be representative of real MS data.
The DPC is difficult to estimate because y is only
observed for detected peptides, and the detected values are a biased
representation of the complete values that in principle might have been
observed had the missing value mechanism not operated. limpa uses a
mathematical exponential tilting argument to represent the DPC in terms
of observed values only, which provides a means to estimate the DPC from
real data. The DPC slope \(\beta_1\) is
nevertheless often under-estimated if the variability of each peptide is
large.
limpa uses the DPC, together with a Bayesian model, to estimate the expression level of each protein in each sample. It fits an additive model \[\mu_{ij} = \gamma_i + \delta_j\] to the peptide log-intensities for each protein, where \(\mu_{ij}\) is the expected log-expression of peptide \(j\) in sample \(i\), \(\gamma_i\) is the log-expression level of the protein in sample \(i\) and \(\delta_j\) is the baseline effect for peptide \(j\). A sum-to-zero constraint is applied to the peptide effects \(\delta_j\) so that the protein expression \(\gamma_i\) represents the average log-expression of the peptides in sample \(i\). The log-likelihood consists of squared residuals for each observed peptide value and the log probability of being missing for each non-detected peptide value. A multivariate normal prior is also applied to the protein log-expression values, where the prior is estimated from the global data for all proteins. DPC-Quant maximizes the log-posterior with respect to the \(\gamma_i\)s and \(\delta_j\)s, and the final \(\gamma_i\)s become the protein quantifications. DPC-Quant also returns the posterior standard error with which each log-expression value is estimated.
Finally, the protein log2-expression values and associated
uncertainties are passed to the voomaLmFitWithImputation
function, which computes precision weights for each observation and fits
protein-wise linear models. voomaLmFitWithImputation
combines features from the voomaLmFit and
voomLmFit functions in the limma and edgeR packages, with
some extensions specific to proteomics data with missing values. It uses
both protein expression and the quantification standard errors to
predict the protein-wise variances and, hence, to construct precision
weights for downstream linear modelling. This allows the uncertainty
associated with missing values imputation to be propagated through to
the differential expression analysis.
voomaLmFitWithImputation also gives special consideration
to instances where all the expression values for a particular protein
are imputed for one or more treatment conditions, ensuring robust
differential expression analyses even for proteins with only a small
proportion of detected peptides.
limpa’s dpcDE function is a wrapper function, passing
the appropriate standard errors from dpcQuant to
voomaLmFitWithImputation. The limpa package is fully
compatible with limma analysis pipelines, allowing any complex
experimental design and other downstream tasks such as the gene ontology
or pathway analysis. limpa works with any design matrix, with any
combination of explanatory factors and covariates. The
dpcDE() function accepts any argument that
voomaLmFitWithImputation or voomaLmFit do. For
example, dpcDE(y.protein, design, sample.weights=TRUE) can
be used to downweight outlier samples. Or
dpc(y.protein, design, block=subject) could be used to
model the correlation between repeated observations on the same
subject.
We recommend inputing precursor-level data to limpa, but limpa can
also operate on intensities, such as MaxLFQ, that have already been
summarized at the protein level, by using dpcQuantByRow
function instead of dpcQuant.
To analyse PTMs, one would work with a tool such as Spectronaut to
obtain intensities for each distinct modification. limpa can accept a
matrix of PTM-level intensities, and can undertake differential
abundance analyses of the PTMs. Again, one would use
dpcQuantByRow in this case instead of
dpcQuant.
A third application for which dpcQuantByRow is used
instead of dpcQuant is for detecting differential isoform
usage. To detect differential precursor usage for each protein between
experimental conditions, one fits precursor-level linear models using
dpcDE, then limma::diffSplice tests for
differential usage. The full pipeline is shown in a case study (see link
below).
limpa has modest memory requirements and can be run on laptop, even
with 100s or 1000s of samples. The most time-consuming step is
dpcQuant(), which is fast for small to moderate datasets,
but can be relatively slow, taking several hours, for very large
datasets with thousands of samples. For very large datasets with 1000s
of samples, we suggest an alternative pipeline using
imputeByExpTilt().
The limpa project was supported by Chan Zuckerberg Initiative EOSS grant 2021-237445, by Melbourne Research and CSL Translational Data Science Scholarships to ML, and by NHMRC Investigator Grant 2025645 to GS.
Demichev V, Messner CB, Vernardis SI, Lilley KS, Ralser M (2020). DIA-NN: neural networks and interference correction enable deep proteome coverage in high throughput. Nature Methods 17(1), 41-44.
Li M, Smyth GK (2023). Neither random nor censored: estimating intensity-dependent probabilities for missing values in label-free proteomics. Bioinformatics 39(5), btad200. doi:10.1093/bioinformatics/btad200
Li M, Cobbold SA, Smyth GK (2025). Quantification and differential analysis of mass spectrometry proteomics data with probabilistic recovery of information from missing values. bioRxiv 2025/651125. doi:10.1101/2025.04.28.651125
Li M (2024). Linear Models and Empirical Bayes Methods for Mass Spectrometry-based Proteomics Data. PhD Thesis, University of Melbourne. http://hdl.handle.net/11343/351600
Lou R, Cao Y, Li S, Lang X, Li Y, Zhang Y, Shui W (2023). Benchmarking commonly used software suites and analysis workflows for DIA proteomics and phosphoproteomics. Nature Communications 14(1), 94.
Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43, e47. doi:10.1093/nar/gkv007
Tyanova S, Temu T, Cox J (2016). The MaxQuant computational platform for mass spectrometry-based shotgun proteomics. Nature Protocols 11, 2301-2319.
Yu F, Teo GC, Kong AT, Fröhlich K, Li GX, Demichev V, Nesvizhskii AI (2023) Analysis of DIA proteomics data using MSFragger-DIA and FragPipe computational platform. Nature communications, 14(1), 4154.