mnnCorrect {scran}R Documentation

Mutual nearest neighbors correction

Description

Correct for batch effects in single-cell expression data using the mutual nearest neighbors method.

Usage

mnnCorrect(..., k=20, sigma=0.1, cos.norm.in=TRUE, cos.norm.out=TRUE,
    svd.dim=0L, var.adj=TRUE, compute.angle=FALSE, subset.row=NULL, 
    order=NULL, pc.approx=FALSE, irlba.args=list(),
    BPPARAM=SerialParam())

Arguments

...

Two or more expression matrices where genes correspond to rows and cells correspond to columns. Each matrix should contain cells from the same batch; multiple matrices represent separate batches of cells. Each matrix should contain the same number of rows, corresponding to the same genes (in the same order).

k

An integer scalar specifying the number of nearest neighbors to consider when identifying mutual nearest neighbors.

sigma

A numeric scalar specifying the bandwidth of the Gaussian smoothing kernel used to compute the correction vector for each cell.

cos.norm.in

A logical scalar indicating whether cosine normalization should be performed on the input data prior to calculating distances between cells.

cos.norm.out

A logical scalar indicating whether cosine normalization should be performed prior to computing corrected expression values.

svd.dim

An integer scalar specifying the number of dimensions to use for summarizing biological substructure within each batch.

var.adj

A logical scalar indicating whether variance adjustment should be performed on the correction vectors.

compute.angle

A logical scalar specifying whether to calculate the angle between each cell's correction vector and the biological subspace of the reference batch.

subset.row

A vector specifying the genes with which distances between cells are calculated, e.g., for identifying mutual nearest neighbours. All genes are used by default.

order

An integer vector specifying the order in which batches are to be corrected.

pc.approx

A logical scalar indicating whether irlba should be used to identify the biological subspace.

irlba.args

A list of arguments to pass to irlba when pc.approx=TRUE.

BPPARAM

A BiocParallelParam object specifying whether the nearest-neighbor searches should be parallelized.

Details

This function is designed for batch correction of single-cell RNA-seq data where the batches are partially confounded with biological conditions of interest. It does so by identifying pairs of mutual nearest neighbors (MNN) in the high-dimensional expression space. Each MNN pair represents cells in different batches that are of the same cell type/state, assuming that batch effects are mostly orthogonal to the biological manifold. Correction vectors are calculated from the pairs of MNNs and corrected expression values are returned for use in clustering and dimensionality reduction.

The concept of a MNN pair can be explained by considering cells in each of two batches. For each cell in one batch, the set of k nearest cells in the other batch is identified, based on the Euclidean distance in expression space. Two cells in different batches are considered to be MNNs if each cell is in the other's set. The size of k can be interpreted as the minimum size of a subpopulation in each batch. The algorithm is generally robust to the choice of k, though values that are too small will not yield enough MNN pairs, while values that are too large will ignore substructure within each batch.

For each MNN pair, a pairwise correction vector is computed based on the difference in the expression profiles. The correction vector for each cell is computed by applying a Gaussian smoothing kernel with bandwidth sigma is the pairwise vectors. This stabilizes the vectors across many MNN pairs and extends the correction to those cells that do not have MNNs. The choice of sigma determines the extent of smoothing - a value of 0.1 is used by default, corresponding to 10% of the radius of the space after cosine normalization.

Value

A named list containing two components:

corrected:

A list of length equal to the number of batches, containing matrices of corrected expression values for each cell in each batch. The order of batches is the same as supplied in ..., and the order of cells in each matrix is also unchanged.

pairs:

A named list of length equal to the number of batches, containing DataFrames specifying the MNN pairs used for correction. Each row of the DataFrame defines a pair based on the cell in the current batch and another cell in an earlier batch. The identity of the other cell and batch are stored as run-length encodings to save space.

angles:

A named list of length equal to the number of batches, containing numeric vectors of angles. Each angle is computed between each cell's correction vector with the first two basis vectors of the first batch of cells (plus any previously corrected batches). This is only returned if compute.angle=TRUE.

Choosing the gene set

Distances between cells are calculated with all genes if subset.row=NULL. However, users can set subset.row to perform the distance calculation on a subset of genes, e.g., highly variable genes or marker genes. This may provide more meaningful identification of MNN pairs by reducing the noise from irrelevant genes.

Regardless of whether subset.row is specified, corrected values are returned for all genes. This is possible as subset.row is only used to identify the MNN pairs and other cell-based distance calculations. Correction vectors between MNN pairs can then be computed in the original space involving all genes in the supplied matrices.

Expected type of input data

The input expression values should generally be log-transformed, e.g., log-counts, see normalize for details. They should also be normalized within each data set to remove cell-specific biases in capture efficiency and sequencing depth. By default, a further cosine normalization step is performed on the supplied expression data to eliminate gross scaling differences between data sets.

Users should note that the order in which batches are corrected will affect the final results. The first batch in order is used as the reference batch against which the second batch is corrected. Corrected values of the second batch are added to the reference batch, against which the third batch is corrected, and so on. This strategy maximizes the chance of detecting sufficient MNN pairs for stable calculation of correction vectors in subsequent batches.

Further options

The function depends on a shared biological manifold, i.e., one or more cell types/states being present in multiple batches. If this is not true, MNNs may be incorrectly identified, resulting in over-correction and removal of interesting biology. Some protection can be provided by removing components of the correction vectors that are parallel to the biological subspaces in each batch. The biological subspace in each batch is identified with a SVD on the expression matrix, using either svd or irlba. The number of dimensions of this subspace can be controlled with svd.dim. (By default, this option is turned off by setting svd.dim=0.)

If var.adj=TRUE, the function will adjust the correction vector to equalize the variances of the two data sets along the batch effect vector. In particular, it avoids “kissing” effects whereby MNN pairs are identified between the surfaces of point clouds from different batches. Naive correction would then bring only the surfaces into contact, rather than fully merging the clouds together. The adjustment ensures that the cells from the two batches are properly intermingled after correction. This is done by identifying each cell's position on the correction vector, identifying corresponding quantiles between batches, and scaling the correction vector to ensure that the quantiles are matched after correction.

Author(s)

Laleh Haghverdi, with modifications by Aaron Lun

See Also

get.knnx irlba

Examples

B1 <- matrix(rnorm(10000), ncol=50) # Batch 1 
B2 <- matrix(rnorm(10000), ncol=50) # Batch 2
out <- mnnCorrect(B1, B2) # corrected values

[Package scran version 1.8.4 Index]