doubletCells {scran}R Documentation

Detect doublet cells

Description

Identify potential doublet cells based on simulations of putative doublet expression profiles.

Usage

## S4 method for signature 'ANY'
doubletCells(x, size.factors.norm=NULL, size.factors.content=NULL,
    k=50, subset.row=NULL, niters=max(10000, ncol(x)), block=10000, 
    d=50, approximate=FALSE, irlba.args=list(), force.match=FALSE, 
    force.k=20, force.ndist=3, BNPARAM=NULL, BPPARAM=SerialParam())

## S4 method for signature 'SingleCellExperiment'
doubletCells(x, size.factors.norm=NA, ..., subset.row=NULL, 
    assay.type="counts", get.spikes=FALSE)

Arguments

x

A numeric matrix-like object of count values, where each column corresponds to a cell and each row corresponds to an endogenous gene. Alternatively, a SingleCellExperiment object containing such a matrix.

size.factors.norm

A numeric vector of size factors for normalization of x prior to PCA and distance calculations. If NULL, defaults to the column sums of x.

For the SingleCellExperiment method, this may be NA, in which case sizeFactors(x) is used instead.

size.factors.content

A numeric vector of size factors for RNA content normalization of x prior to simulating doublets.

k

An integer scalar specifying the number of nearest neighbours to use to determine the bandwidth for density calculations.

subset.row

See ?"scran-gene-selection".

niters

An integer scalar specifying how many simulated doublets should be generated.

block

An integer scalar controlling the rate of doublet generation, to keep memory usage low.

d

An integer scalar specifying the number of components to retain after the PCA.

approximate

A logical scalar indicating whether irlba should be used to perform the initial PCA.

irlba.args

A list of arguments to pass to irlba when approximate=TRUE.

force.match

A logical scalar indicating whether remapping of simulated doublets to original cells should be performed.

force.k

An integer scalar specifying the number of neighbours to use for remapping if force.match=TRUE.

force.ndist

A numeric scalar specifying the bandwidth for remapping if force.match=TRUE.

BNPARAM

A BiocNeighborParam object specifying the nearest neighbor algorithm. Defaults to an exact algorithm if NULL, see ?findKNN for more details.

BPPARAM

A BiocParallelParam object specifying whether the neighbour searches should be parallelized.

...

Additional arguments to pass to the ANY method.

assay.type

A string specifying which assay values to use, e.g., "counts" or "logcounts".

get.spikes

See ?"scran-gene-selection".

Details

This function simulates doublets by adding the count vectors for two randomly chosen cells in x. For each cell, we compute the density of simulated doublets and compare it to the density of original cells. Genuine doublets should have a high density of simulated doublets relative to the density of its neighbourhood. Thus, the doublet score for each cell is defined as the ratio of densities of simulated doublets to the (squared) density of the original cells.

Densities are calculated in low-dimensional space after a PCA on the log-normalized expression matrix of x. Simulated doublets are projected into the low-dimensional space using the rotation vectors computed from the original cells. A tricube kernel is used to compute the density around each cell. The bandwidth of the kernel is set to the median distance to the k nearest neighbour across all cells.

The two size factor arguments have different roles:

It is possible to set both of these arguments, as they will not interfere with each other.

If force.match=TRUE, simulated doublets will be remapped to the nearest neighbours in the original data. This is done by taking the (tricube-weighted) average of the PC scores for the force.k nearest neighbors. The tricube bandwidth for remapping is chosen by taking the median distance and multiplying it by force.ndist, to protect against later neighbours that might be outliers. The aim is to adjust for unknown differences in RNA content that would cause the simulated doublets to be systematically displaced from their true locations. However, it may also result in spuriously high scores for single cells that happen to be close to a cluster of simulated doublets.

Value

A numeric vector of doublet scores for each cell in x.

Author(s)

Aaron Lun

Examples

# Mocking up an example.
ngenes <- 100
mu1 <- 2^rexp(ngenes)
mu2 <- 2^rnorm(ngenes)

counts.1 <- matrix(rpois(ngenes*100, mu1), nrow=ngenes)
counts.2 <- matrix(rpois(ngenes*100, mu2), nrow=ngenes)
counts.m <- matrix(rpois(ngenes*20, mu1+mu2), nrow=ngenes)

counts <- cbind(counts.1, counts.2, counts.m)
clusters <- rep(1:3, c(ncol(counts.1), ncol(counts.2), ncol(counts.m)))

# Find potential doublets...
scores <- doubletCells(counts)
boxplot(split(scores, clusters))

[Package scran version 1.10.2 Index]