multiBatchNorm {batchelor}R Documentation

Per-batch scaling normalization

Description

Perform scaling normalization within each batch to provide comparable results to the lowest-coverage batch.

Usage

multiBatchNorm(..., assay.type = "counts", norm.args = list(),
  min.mean = 1, subset.row = NULL, separate.spikes = FALSE)

Arguments

...

Two or more SingleCellExperiment objects containing counts and size factors. Each object is assumed to represent one batch.

assay.type

A string specifying which assay values contains the counts.

norm.args

A named list of further arguments to pass to normalize.

min.mean

A numeric scalar specifying the minimum (library size-adjusted) average count of genes to be used for normalization.

subset.row

A vector specifying which features to use for correction.

separate.spikes

Logical scalar indicating whether spike-in size factors should be rescaled separately from endogenous genes.

Details

When performing integrative analyses of multiple batches, it is often the case that different batches have large differences in coverage. This function removes systematic differences in coverage across batches to simplify downstream comparisons. It does so by resaling the size factors using median-based normalization on the ratio of the average counts between batches. This is roughly equivalent to the between-cluster normalization described by Lun et al. (2016).

This function will adjust the size factors so that counts in high-coverage batches are scaled downwards to match the coverage of the most shallow batch. The normalize function will then add the same pseudo-count to all batches before log-transformation. By scaling downwards, we favour stronger squeezing of log-fold changes from the pseudo-count, mitigating any technical differences in variance between batches. Of course, genuine biological differences will also be shrunk, but this is less of an issue for upregulated genes with large counts.

This function is preferred over running normalize directly when computing log-normalized values for use in mnnCorrect or fastMNN. In most cases, size factors will be computed within each batch; their direct application in normalize will not account for scaling differences between batches. In contrast, multiBatchNorm will rescale the size factors so that they are comparable across batches.

Only genes with library size-adjusted average counts greater than min.mean will be used for computing the rescaling factors. This improves precision and avoids problems with discreteness. Users can also set subset.row to restrict the set of genes used for computing the rescaling factors. However, this only affects the rescaling of the size factors - normalized values for all genes will still be returned.

Value

A list of SingleCellExperiment objects with normalized log-expression values in the "logcounts" assay (depending on values in norm.args).

Handling spike-ins

Spike-in transcripts should be either absent in all batches or, if present, they should be the same across all batches. Rows annotated as spike-in transcripts are not used to compute the rescaling factors for endogenous genes.

By default, the spike-in size factors are rescaled using the same scaling factor for the endogenous genes in the same batch. This preserves the abundances of the spike-in transcripts relative to the endogenous genes, which is important if the returned SingleCellExperiments are to be used to model technical noise.

If separate.spikes=TRUE, spike-in size factors are rescaled separately from those of the endogenous genes. This will eliminate differences in spike-in quantities across batches at the cost of losing the ability to compare between endogenous and spike-in transcripts within each batch.

Author(s)

Aaron Lun

References

Lun ATL, Bach K and Marioni JC (2016). Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 17:75

See Also

mnnCorrect and fastMNN for methods that can benefit from rescaling.

normalize for the calculation of log-transformed normalized expression values.

Examples

d1 <- matrix(rnbinom(50000, mu=10, size=1), ncol=100)
sce1 <- SingleCellExperiment(list(counts=d1))
sizeFactors(sce1) <- runif(ncol(d1))

d2 <- matrix(rnbinom(20000, mu=50, size=1), ncol=40)
sce2 <- SingleCellExperiment(list(counts=d2))
sizeFactors(sce2) <- runif(ncol(d2))

out <- multiBatchNorm(sce1, sce2)
summary(sizeFactors(out[[1]]))
summary(sizeFactors(out[[2]]))


[Package batchelor version 1.0.1 Index]