fastMNN {scran}R Documentation

Fast mutual nearest neighbors correction

Description

Correct for batch effects in single-cell expression data using the mutual nearest neighbors (MNN) method.

Usage

fastMNN(..., k=20, cos.norm=TRUE, ndist=3, d=50, approximate=FALSE, 
    irlba.args=list(), subset.row=NULL, auto.order=FALSE, pc.input=FALSE,
    compute.variances=FALSE, assay.type="logcounts", get.spikes=FALSE, 
    BNPARAM=NULL, BPPARAM=SerialParam()) 

Arguments

...

Two or more log-expression matrices where genes correspond to rows and cells correspond to columns. One matrix should contain cells from the same batch; multiple matrices represent separate batches of cells. Each matrix should contain the same number of rows, corresponding to the same genes (in the same order).

Alternatively, two or more SingleCellExperiment objects can be supplied, where each object contains a log-expression matrix in the assay.type assay.

Alternatively, two or more matrices of low-dimensional representations can be supplied if pc.input=TRUE. Here, rows are cells and columns are dimensions (the latter should be common across all batches).

k

An integer scalar specifying the number of nearest neighbors to consider when identifying MNNs.

cos.norm

A logical scalar indicating whether cosine normalization should be performed on the input data prior to calculating distances between cells.

ndist

A numeric scalar specifying the threshold beyond which neighbours are to be ignored when computing correction vectors. Each threshold is defined in terms of the number of median distances.

d, approximate, irlba.args

Further arguments to pass to multiBatchPCA. Setting approximate=TRUE is recommended for large data sets with many cells.

subset.row

See ?"scran-gene-selection".

auto.order

Logical scalar indicating whether re-ordering of batches should be performed to maximize the number of MNN pairs at each step.

Alternatively, an integer vector containing a permutation of 1:N where N is the number of batches.

compute.variances

Logical scalar indicating whether the percentage of variance lost due to non-orthogonality should be computed.

pc.input

Logical scalar indicating whether the values in ... are already low-dimensional, e.g., the output of multiBatchPCA.

assay.type

A string or integer scalar specifying the assay containing the expression values, if SingleCellExperiment objects are present in ....

get.spikes

See ?"scran-gene-selection". Only relevant if ... contains SingleCellExperiment objects.

BNPARAM

A BiocNeighborParam object specifying the nearest neighbor algorithm. Defaults to an exact algorithm if NULL, see ?findKNN for more details.

BPPARAM

A BiocParallelParam object specifying whether the PCA and nearest-neighbor searches should be parallelized.

Details

This function provides a variant of the mnnCorrect function, modified for speed and more robust performance. In particular:

The default setting of cos.norm=TRUE provides some protection against differences in scaling for arbitrary expression matrices. However, if possible, we recommend using the output of multiBatchNorm as input to fastMNN. This will equalize coverage on the count level before the log-transformation, which is a more accurate rescaling than cosine normalization on the log-values.

If compute.variances=TRUE, the function will compute the percentage of variance that is parallel to the average correction vectors at each merge step. This represents the variance that is not orthogonal to the batch effect and subsequently lost upon correction. Large proportions suggest that there is biological structure that is parallel to the batch effect, corresponding to violations of the assumption that the batch effect is orthogonal to the biological subspace.

Value

A named list containing:

corrected:

A matrix with number of columns equal to d, and number of rows equal to the total number of cells in .... Cells are ordered in the same manner as supplied in ....

batch:

A Rle containing the batch of origin for each row (i.e., cell) in corrected.

pairs:

A list of DataFrames specifying which pairs of cells in corrected were identified as MNNs at each step.

If pc.input=FALSE, an additional rotation field will be present, containing the matrix of rotation vectors from multiBatchPCA.

If compute.variances=TRUE, an additional lost.var field will be present containing a numeric vector of percentages of variances lost for each batch.

Controlling the merge order

By default, batches are merged in the user-supplied order. However, if auto.order=TRUE, batches are ordered to maximize the number of MNN pairs at each step. The aim is to improve the stability of the correction by first merging more similar batches with more MNN pairs. This can be somewhat time-consuming as MNN pairs need to be iteratively recomputed for all possible batch pairings. It is often more convenient for the user to specify an appropriate ordering based on prior knowledge about the batches.

If auto.order is an integer vector, it is treated as an ordering permutation with which to merge batches. For example, if auto.order=c(4,1,3,2), batches 4 and 1 in ... are merged first, followed by batch 3 and then batch 2. This is often more convenient than changing the order manually in ..., which would alter the order of batches in the output corrected matrix. Indeed, no matter what the setting of auto.order is, the order of cells in the output corrected matrix is always the same.

Further control of the merge order can be achieved by performing the multi-sample PCA outside of this function with multiBatchPCA. Then, batches can be progressively merged by repeated calls to fastMNN with pc.input=TRUE. This is useful in situations where the order of batches to merge is not straightforward, e.g., involving hierarchies of batch similarities. We only recommend this mode for advanced users, and note that:

See the Examples below for how the pc.input argument should be used.

Choice of genes

Users should set subset.row to subset the inputs to highly variable genes or marker genes. This provides more meaningful identification of MNN pairs by reducing the noise from irrelevant genes. Note that users should not be too restrictive with subsetting, as high dimensionality is required to satisfy the orthogonality assumption in MNN detection.

For SingleCellExperiment inputs, spike-in transcripts should be the same across all objects. They are automatically removed unless get.spikes=TRUE, see ?"scran-gene-selection" for more details.

The reported coordinates for cells refer to a low-dimensional space, but it may be desirable to obtain corrected gene expression values, e.g., for visualization. This can be done by computing the cross-product of the coordinates with the rotation matrix - see the Examples below. Note that this will represent corrected values in the space defined by the inputs (e.g., log-transformed) and after cosine normalization if cos.norm=TRUE.

Author(s)

Aaron Lun

References

Haghverdi L, Lun ATL, Morgan MD, Marioni JC (2018). Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36(5):421

Lun ATL (2018). Further MNN algorithm development. https://github.com/MarioniLab/FurtherMNN2018

See Also

mnnCorrect, irlba, multiBatchNorm

Examples

B1 <- matrix(rnorm(10000), ncol=50) # Batch 1 
B2 <- matrix(rnorm(10000), ncol=50) # Batch 2
out <- fastMNN(B1, B2) # corrected values
names(out)

# An equivalent approach with PC input.
cB1 <- cosineNorm(B1)
cB2 <- cosineNorm(B2)
pcs <- multiBatchPCA(cB1, cB2)
out.2 <- fastMNN(pcs[[1]], pcs[[2]], pc.input=TRUE)
all.equal(out[1:3], out.2) # should be TRUE (no rotation)

# Obtaining corrected expression values for genes 1 and 10.
cor.exp <- tcrossprod(out$rotation[c(1,10),], out$corrected)
dim(cor.exp)

[Package scran version 1.10.2 Index]