simpleSumFactors {scran}R Documentation

Quickly compute size factors via summation

Description

Perform a simple calculation of cell-specific size factors based on summation and normalization against nearest neighbors.

Usage

## S4 method for signature 'ANY'
simpleSumFactors(x, k=20, ref.k=k, trend.args=list(), approximate=FALSE, 
    irlba.args=list(), subset.row=NULL, min.mean=1, block=NULL, ref.block=NULL,
    BNPARAM=NULL, BPPARAM=SerialParam())

## S4 method for signature 'SingleCellExperiment'
simpleSumFactors(x, ..., subset.row=NULL, assay.type="counts", 
    get.spikes=FALSE, sf.out=FALSE) 

Arguments

x

A numeric matrix-like object of counts, where rows are genes and columns are cells. Alternatively, a SingleCellExperiment object containing such a matrix.

k

Integer scalar specifying the number of nearest neighbors.

ref.k

Integer scalar specifying the number of nearest neighbors to use for constructing the reference pseudo-cell.

trend.args

A list containing named arguments to pass to trendVar.

approximate

Logical scalar indicating whether an approximate PCA should be performed.

irlba.args

List containing arguments to pass to irlba if approximate=TRUE.

subset.row

See ?"scran-gene-selection".

min.mean

A numeric scalar specifying the minimum (library size-adjusted) average count of genes to be used for normalization.

block

An optional factor specifying blocking levels for all cells.

ref.block

An integer or string specifying the level of block to use as the reference.

BNPARAM

A BiocNeighborParam object specifying the algorithm to use for nearest neighbor detection.

BPPARAM

A BiocParallelParam object specifying the parallelization to use.

...

Additional arguments to pass to simpleSumFactors,ANY-method.

assay.type

A string specifying which assay values to use, e.g., "counts" or "logcounts".

get.spikes

See ?"scran-gene-selection".

sf.out

A logical scalar indicating whether only size factors should be returned.

Details

This function implements a greatly simplified counterpart to the quickCluster and computeSumFactors workflow for computing cell-specific size factors. Briefly, it:

  1. Identifies the k nearest neighbors for each cell, based on the first few PCs from the log-transformed library size-normalized expression matrix. The number of PCs is determined running trendVar and denoisePCA.

  2. Constructs a pseudo-cell from the neighbor set by taking the weighted average of the neighbors' expression profiles. Weights are computed using a tricube kernel on the distance, with bandwidth equal to three-fold the median distance. This protects against large distances for cell types with fewer than k cells in the data set.

  3. Computes the ratio of the library size for each cell to the sum of averaged values for the associated pseudo-cell. This represents an initial size factor estimate, equivalent to library size normalization. The assumption is that neighboring cells are from the same cell type and do not exhibit composition biases associated with DE.

  4. Performs median-based normalization on the neighbor-derived pseudo-cell for each cell to a reference pseudo-cell. This yields a rescaling factor that is multiplied by the initial ratio to obtain the cell-specific size factor. The use of the median protects against DE between pseudo-cells, while the use of a summed profile avoids problems with zeros.

The number of neighbors k represents the minimum number of cells of any cell type or state. Reducing k increases the robustness of the algorithm to population heterogeneity. However, it reduces the precision of the median-based estimator and increases the risk of obtaining zero or undefined size factors after rescaling. k should be set to the largest tolerable value for optimal performance, though the default is usually sufficient.

The reference pseudo-cell is constructed from the neighbors of the cell with the lowest distance to the ref.kth neighbor. This is a rough proxy for the neighbourhood with the highest density of cells, which should also have the most stable neighbor-derived pseudo-cell. By default, ref.k is equal to k but can be increased to provide a more stable reference pseudo-cell. (This is more specific than increasing k, which would require stronger assumptions for all cells.)

If block is specified, the above process is performed separately for all cells within each level of block. To calibrate the size factors across blocks, median-based normalization is applied to the block-specific reference pseudo-cells to obtain a calibration factor. The global reference pseudo-cell is specified with ref.block or, if NULL, is defined as that which is closest to the “middle” of the data set. This refers to the median position on the first PC after performing a PCA on the log-transformed expression profiles of the block-specific reference pseudo-cells. The size factor for each cell in each block is then multiplied by the calibration factor for that block to obtain the final size factor for each cell.

If min.mean is positive, genes with library size-normalized averages below min.mean will be ignored during median-based normalization. Specifically, the threshold is applied to the grand average of the pooled profile for each cell and that of the reference cell. This favours moderate- and high-abundance genes for which median-based normalization is more accurate. Note that this filtering will only affect the calculation of scaling and calibration factors, and will not be used for nearest-neighbor detection.

Value

For simpleSumFactors,ANY-method, a numeric vector of size factors for all cells in x is returned.

For simpleSumFactors,SingleCellExperiment-method, an object of class x is returned containing the vector of size factors in sizeFactors(x), if sf.out=FALSE. Otherwise, the vector of size factors is returned directly.

Trouble-shooting errors and warnings

“cleaning zero size factor estimates”: the rescaling factor is zero for some pseudo-cells. Increase k to reduce the number of stochastic zeroes in each pseudo-cell.

“infinite size factors”: the rescaling factor is Inf for some pseudo-cells. Increase ref.k to reduce the number of stochastic zeroes in the reference pseudo-cell.

“no genes remaining after filtering”: filtering has removed all genes for some pseudo-cells. Lower min.mean to retain more genes.

“calibration factors are not strictly positive”: some of the inter-block calibration factors are not positive. Increase ref.k to reduce the number of stochastic zeroes in each block-specific reference psuedo-cell.

Author(s)

Aaron Lun

References

Lun ATL (2018). Further scaling normalization. https://github.com/MarioniLab/FurtherNorm2018

Lun ATL, Bach K and Marioni JC (2016). Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 17:75

See Also

quickCluster and computeSumFactors, on which this function was based.

Examples

# Mocking up some data.
set.seed(100)
popsize <- 200
ngenes <- 10000
all.facs <- 2^rnorm(popsize, sd=0.5)
counts <- matrix(rnbinom(ngenes*popsize, mu=all.facs*10, size=1), ncol=popsize, byrow=TRUE)

# Computing the size factors.
out.facs <- simpleSumFactors(counts)
head(out.facs)
plot(colSums(counts), out.facs, log="xy")

[Package scran version 1.10.2 Index]