Denoise with PCA {scran} | R Documentation |
Denoise log-expression data by removing principal components corresponding to technical noise.
## S4 method for signature 'ANY' denoisePCA(x, technical, design=NULL, subset.row=NULL, value=c("pca", "n", "lowrank"), min.rank=5, max.rank=100, approximate=FALSE, rand.seed=1000, irlba.args=list()) ## S4 method for signature 'SingleCellExperiment' denoisePCA(x, ..., subset.row=NULL, value=c("pca", "n", "lowrank"), assay.type="logcounts", get.spikes=FALSE, sce.out=TRUE)
x |
A numeric matrix of log-expression values for |
technical |
A function that computes the technical component of the variance for a gene with a given mean (log-)expression, see |
design |
A numeric matrix containing the experimental design. This is a deprecated option, see the details below. |
subset.row |
A logical, integer or character vector indicating the rows of |
value |
A string specifying the type of value to return; the PCs, the number of retained components, or a low-rank approximation. |
min.rank, max.rank |
Integer scalars specifying the minimum and maximum number of PCs to retain. |
approximate |
A logical scalar indicating whether approximate SVD should be performed via |
rand.seed |
A numeric scalar specifying the seed for approximate PCA when |
irlba.args |
A named list of additional arguments to pass to |
... |
Further arguments to pass to |
assay.type |
A string specifying which assay values to use. |
get.spikes |
A logical scalar specifying whether spike-in transcripts should be used.
This will be intersected with |
sce.out |
A logical scalar specifying whether a modified SingleCellExperiment object should be returned. |
This function performs a principal components analysis to reduce random technical noise in the data. Random noise is uncorrelated across genes and should be captured by later PCs, as the variance in the data explained by any single gene is low. In contrast, biological substructure should be correlated and captured by earlier PCs, as this explains more variance for sets of genes. The idea is to discard later PCs to remove technical noise and improve the resolution of substructure.
The choice of the number of PCs to discard is based on the estimates of technical variance in technical
.
This can be a vector of technical components for each gene, as produced by decomposeVar
.
Alternatively, the trend function obtained from trendVar
is used to compute the technical component for each gene, based on its mean abundance.
An estimate of the overall technical variance is estimated by summing the values across genes.
Genes with negative biological components are ignored during downstream analyses to ensure that the total variance is greater than the overall technical estimate.
The function works by assuming that the first X PCs contain all of the biological signal, while the remainder contains technical noise.
For a given value of X, an estimate of the total technical variance is calculated from the sum of variance explained by all of the later PCs.
A value of X is found such that the predicted technical variance equals the estimated technical variance.
Note that X will be coerced to lie between min.rank
and max.rank
.
For denoisePCA,ANY-method
, a numeric matrix is returned containing the selected PCs (columns) for all cells (rows) if value="pca"
.
If value="n"
, it will return an integer scalar specifying the number of retained components.
If value="lowrank"
, it will return a low-rank approximation of x
with the same dimensions.
For denoisePCA,SingleCellExperiment-method
, the return value is the same as denoisePCA,ANY-method
if sce.out=FALSE
or value="n"
.
Otherwise, a SingleCellExperiment object is returned that is a modified version of x
.
If value="pca"
, the modified object will contain the PCs as the "PCA"
entry in the reducedDims
slot.
If value="lowrank"
, it will return a low-rank approximation in assays
slot, named "lowrank"
.
In all cases, the fractions of variance explained by the first max.rank
PCs will be stored as the "percentVar"
attribute in the return value.
This is directly compatible with functions such as plotPCA
.
When value="lowrank"
, a low-rank approximation of the original matrix is computed using only the first X components.
This is useful for denoising prior to downstream applications that expect gene-wise expression profiles.
Note that approximation values are returned for all genes.
This includes “unselected” genes, i.e., with negative biological components or that were not selected with subset.row
.
The low-rank approximation is obtained for these genes by projecting their expression profiles into the low-dimensional space defined by the SVD on the selected genes.
The exception is when get.spikes=FALSE
, whereby zeroes are returned for all spike-in rows.
Previous versions of this function allowed users to specify a design
matrix to regress out uninteresting factors of variation.
This behaviour is now deprecated in favour of users running appropriate batch correction functions beforehand.
If a batch-corrected expression matrix is supplied in x
, users should also supply a DataFrame as technical
,
as calculation of the residual variance or mean from a corrected matrix is not accurate without knowledge of the blocking factors.
Any correction should preserve the residual variance of each gene for the variances to be correctly interpreted. Specifically, calculation of the variance within each batch should yield the same result in both the corrected and original matrices. This limits the possible methods for batch correction:
removeBatchEffect
performs a linear regression for each gene, removing unwanted factors while retaining relevant biological effects (if known).
This simply involves fitting a linear model and removing undesired coefficients, so the residual variance is unaffected.
mnnCorrect
will identify cells of the same biological identity across batches, and use them to compute correction vectors.
This usually requires setting cos.norm.out=FALSE
and sigma
to some large value to preserve the variance.
Aaron Lun
# Mocking up some data. ngenes <- 1000 is.spike <- 1:100 means <- 2^runif(ngenes, 6, 10) dispersions <- 10/means + 0.2 nsamples <- 50 counts <- matrix(rnbinom(ngenes*nsamples, mu=means, size=1/dispersions), ncol=nsamples) rownames(counts) <- paste0("Gene", seq_len(ngenes)) # Fitting a trend. lcounts <- log2(counts + 1) fit <- trendVar(lcounts, subset.row=is.spike) # Denoising (not including the spike-ins in the PCA; # spike-ins are automatically removed with the SingleCellExperiment method). pcs <- denoisePCA(lcounts, technical=fit$trend, subset.row=-is.spike) dim(pcs)