simpleSumFactors {scran} | R Documentation |
Perform a simple calculation of cell-specific size factors based on summation and normalization against nearest neighbors.
## S4 method for signature 'ANY' simpleSumFactors(x, k=20, ref.k=k, trend.args=list(), approximate=FALSE, irlba.args=list(), subset.row=NULL, min.mean=1, block=NULL, ref.block=NULL, BNPARAM=NULL, BPPARAM=SerialParam()) ## S4 method for signature 'SingleCellExperiment' simpleSumFactors(x, ..., subset.row=NULL, assay.type="counts", get.spikes=FALSE, sf.out=FALSE)
x |
A numeric matrix-like object of counts, where rows are genes and columns are cells. Alternatively, a SingleCellExperiment object containing such a matrix. |
k |
Integer scalar specifying the number of nearest neighbors. |
ref.k |
Integer scalar specifying the number of nearest neighbors to use for constructing the reference pseudo-cell. |
trend.args |
A list containing named arguments to pass to |
approximate |
Logical scalar indicating whether an approximate PCA should be performed. |
irlba.args |
List containing arguments to pass to |
subset.row |
See |
min.mean |
A numeric scalar specifying the minimum (library size-adjusted) average count of genes to be used for normalization. |
block |
An optional factor specifying blocking levels for all cells. |
ref.block |
An integer or string specifying the level of |
BNPARAM |
A BiocNeighborParam object specifying the algorithm to use for nearest neighbor detection. |
BPPARAM |
A BiocParallelParam object specifying the parallelization to use. |
... |
Additional arguments to pass to |
assay.type |
A string specifying which assay values to use, e.g., |
get.spikes |
See |
sf.out |
A logical scalar indicating whether only size factors should be returned. |
This function implements a greatly simplified counterpart to the quickCluster
and computeSumFactors
workflow for computing cell-specific size factors.
Briefly, it:
Identifies the k
nearest neighbors for each cell, based on the first few PCs from the log-transformed library size-normalized expression matrix.
The number of PCs is determined running trendVar
and denoisePCA
.
Constructs a pseudo-cell from the neighbor set by taking the weighted average of the neighbors' expression profiles.
Weights are computed using a tricube kernel on the distance, with bandwidth equal to three-fold the median distance.
This protects against large distances for cell types with fewer than k
cells in the data set.
Computes the ratio of the library size for each cell to the sum of averaged values for the associated pseudo-cell. This represents an initial size factor estimate, equivalent to library size normalization. The assumption is that neighboring cells are from the same cell type and do not exhibit composition biases associated with DE.
Performs median-based normalization on the neighbor-derived pseudo-cell for each cell to a reference pseudo-cell. This yields a rescaling factor that is multiplied by the initial ratio to obtain the cell-specific size factor. The use of the median protects against DE between pseudo-cells, while the use of a summed profile avoids problems with zeros.
The number of neighbors k
represents the minimum number of cells of any cell type or state.
Reducing k
increases the robustness of the algorithm to population heterogeneity.
However, it reduces the precision of the median-based estimator and increases the risk of obtaining zero or undefined size factors after rescaling.
k
should be set to the largest tolerable value for optimal performance, though the default is usually sufficient.
The reference pseudo-cell is constructed from the neighbors of the cell with the lowest distance to the ref.k
th neighbor.
This is a rough proxy for the neighbourhood with the highest density of cells, which should also have the most stable neighbor-derived pseudo-cell.
By default, ref.k
is equal to k
but can be increased to provide a more stable reference pseudo-cell.
(This is more specific than increasing k
, which would require stronger assumptions for all cells.)
If block
is specified, the above process is performed separately for all cells within each level of block
.
To calibrate the size factors across blocks, median-based normalization is applied to the block-specific reference pseudo-cells to obtain a calibration factor.
The global reference pseudo-cell is specified with ref.block
or, if NULL
, is defined as that which is closest to the “middle” of the data set.
This refers to the median position on the first PC after performing a PCA on the log-transformed expression profiles of the block-specific reference pseudo-cells.
The size factor for each cell in each block is then multiplied by the calibration factor for that block to obtain the final size factor for each cell.
If min.mean
is positive, genes with library size-normalized averages below min.mean
will be ignored during median-based normalization.
Specifically, the threshold is applied to the grand average of the pooled profile for each cell and that of the reference cell.
This favours moderate- and high-abundance genes for which median-based normalization is more accurate.
Note that this filtering will only affect the calculation of scaling and calibration factors, and will not be used for nearest-neighbor detection.
For simpleSumFactors,ANY-method
, a numeric vector of size factors for all cells in x
is returned.
For simpleSumFactors,SingleCellExperiment-method
, an object of class x
is returned containing the vector of size factors in sizeFactors(x)
, if sf.out=FALSE
.
Otherwise, the vector of size factors is returned directly.
“cleaning zero size factor estimates”: the rescaling factor is zero for some pseudo-cells.
Increase k
to reduce the number of stochastic zeroes in each pseudo-cell.
“infinite size factors”: the rescaling factor is Inf
for some pseudo-cells.
Increase ref.k
to reduce the number of stochastic zeroes in the reference pseudo-cell.
“no genes remaining after filtering”: filtering has removed all genes for some pseudo-cells.
Lower min.mean
to retain more genes.
“calibration factors are not strictly positive”: some of the inter-block calibration factors are not positive.
Increase ref.k
to reduce the number of stochastic zeroes in each block-specific reference psuedo-cell.
Aaron Lun
Lun ATL (2018). Further scaling normalization. https://github.com/MarioniLab/FurtherNorm2018
Lun ATL, Bach K and Marioni JC (2016). Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 17:75
quickCluster
and computeSumFactors
, on which this function was based.
# Mocking up some data. set.seed(100) popsize <- 200 ngenes <- 10000 all.facs <- 2^rnorm(popsize, sd=0.5) counts <- matrix(rnbinom(ngenes*popsize, mu=all.facs*10, size=1), ncol=popsize, byrow=TRUE) # Computing the size factors. out.facs <- simpleSumFactors(counts) head(out.facs) plot(colSums(counts), out.facs, log="xy")