scDblFinder {scDblFinder}R Documentation

scDblFinder

Description

Identification of heterotypic (or neotypic) doublets in single-cell RNAseq using cluster-based generation of artificial doublets.

Usage

scDblFinder(
  sce,
  clusters = NULL,
  samples = NULL,
  trajectoryMode = FALSE,
  artificialDoublets = NULL,
  knownDoublets = NULL,
  dbr = NULL,
  clustCor = NULL,
  dbr.sd = NULL,
  nfeatures = 1000,
  dims = 20,
  k = NULL,
  removeUnidentifiable = TRUE,
  includePCs = 1:5,
  propRandom = 0,
  propMarkers = 0,
  aggregateFeatures = FALSE,
  returnType = c("sce", "table", "full"),
  score = c("xgb", "weighted", "ratio"),
  processing = "default",
  metric = "logloss",
  nrounds = 0.25,
  max_depth = 5,
  iter = 1,
  multiSampleMode = c("split", "singleModel", "singleModelSplitThres"),
  threshold = TRUE,
  verbose = is.null(samples),
  BPPARAM = SerialParam(),
  ...
)

Arguments

sce

A SummarizedExperiment-class, SingleCellExperiment-class, or array of counts.

clusters

The optional cluster assignments (if omitted, will run clustering). This is used to make doublets more efficiently. clusters should either be a vector of labels for each cell, or the name of a colData column of sce. Alternatively, if it is a single integer, will determine how many clusters to create (using k-means clustering). This options should be used when distinct subpopulations are not expected in the data (e.g. trajectories).

samples

A vector of the same length as cells (or the name of a column of colData(x)), indicating to which sample each cell belongs. Here, a sample is understood as being processed independently. If omitted, doublets will be searched for with all cells together. If given, doublets will be searched for independently for each sample, which is preferable if they represent different captures. If your samples were multiplexed using cell hashes, want you want to give here are the different batches/wells (i.e. independent captures, since doublets cannot arise across them) rather than biological samples.

trajectoryMode

Logical; whether to generate fewer doublets from cells that are closer to each other, for datasets with gradients rather than separated subpopulations. This disrupts the proportionality and is not anymore the recommended way of handling such datasets. See vignette("scDblFinder") for more details.

artificialDoublets

The approximate number of artificial doublets to create. If NULL, will be the maximum of the number of cells or 5*nbClusters^2.

knownDoublets

An optional logical vector of known doublets (e.g. through cell barcodes), or the name of a colData column of 'sce' containing that information. Including known doublets tends to increase the sensitivity of doublet identification, but decrease the specificity (since some of the known doublets are homotypic).

dbr

The expected doublet rate. By default this is assumed to be 1% per thousand cells captured (so 4% among 4000 thousand cells), which is appropriate for 10x datasets. Corrections for homeotypic doublets will be performed on the given rate.

clustCor

Include Spearman correlations to cell type averages in the predictors. If 'clustCor' is a matrix of cell type marker expressions (with features as rows and cell types as columns), the subset of these which are present in the selected features will be correlated to each cell to produce additional predictors (i.e. one per cell type). Alternatively, if 'clustCor' is a positive integer, this number of inter-cluster markers will be selected and used for correlation (se 'clustCor=Inf' to use all available genes).

dbr.sd

The uncertainty range in the doublet rate, interpreted as a +/- around 'dbr'. During thresholding, deviation from the expected doublet rate will be calculated from these boundaries, and will be considered null within these boundaries. If NULL, will be 40% of 'dbr'. Set to 'dbr.sd=0' to disable.

nfeatures

The number of top features to use (default 1000)

dims

The number of dimensions used.

k

Number of nearest neighbors (for KNN graph). If more than one value is given, the doublet density will be calculated at each k (and other values at the highest k), and all the information will be used by the classifier. If omitted, a reasonable set of values is used.

removeUnidentifiable

Logical; whether to remove artificial doublets of a combination that is generally found to be unidentifiable.

includePCs

The index of principal components to include in the predictors (e.g. 'includePCs=1:2').

propRandom

The proportion of the artificial doublets which should be made of random cells (as opposed to inter-cluster combinations).

propMarkers

The proportion of features to select based on marker identification.

aggregateFeatures

Whether to perform feature aggregation (recommended for ATAC). Can also be a positive integer, in which case this will indicate the number of components to use for feature aggregation (if TRUE, 'dims' will be used.)

returnType

Either "sce" (default), "table" (to return the table of cell attributes including artificial doublets), or "full" (returns an SCE object containing both the real and artificial cells.

score

Score to use for final classification.

processing

Counts (real and artificial) processing before KNN. Either 'default' (normal scater-based normalization and PCA), "rawPCA" (PCA without normalization), "rawFeatures" (no normalization/dimensional reduction) or a custom function with (at least) arguments 'e' (the matrix of counts) and 'dims' (the desired number of dimensions), returning a named matrix with cells as rows and components as columns.

metric

Error metric to optimize during training (e.g. 'merror', 'logloss', 'auc', 'aucpr').

nrounds

Maximum rounds of boosting. If NULL, will be determined through cross-validation.

max_depth

Maximum depths of each tree.

iter

A positive integer indicating the number of scoring iterations (ignored if ‘score' isn’t based on classifiers). At each iteration, real cells that would be called as doublets are excluding from the training, and new scores are calculated. Recommended values are 1 or 2.

multiSampleMode

Either "split" (recommended if there is a lot of heterogeneity across samples), "singleModel" (recommended _only_ if the samples are very similar), or "singleModelSplitThres" (use a single classifier, but sample-specific thresholds).

threshold

Logical; whether to threshold scores into binary doublet calls

verbose

Logical; whether to print messages and the thresholding plot.

BPPARAM

Used for multithreading when splitting by samples (i.e. when 'samples!=NULL'); otherwise passed to eventual PCA and K/SNN calculations.

...

further arguments passed to getArtificialDoublets.

Details

This function generates artificial doublets from clusters of real cells, evaluates their prevalence in the neighborhood of each cells, and uses this along with additional features to classify doublets. The approach is complementary to doublets identified via cell hashes and SNPs in multiplexed samples: the latter can identify doublets formed by cells of the same type from two samples, which are nearly undistinguishable from real cells transcriptionally, but cannot identify doublets made by cells of the same sample. See vignette("scDblFinder") for more details on the method.

When multiple samples/captures are present, they should be specified using the samples argument. In this case, we recommend the use of BPPARAM to perform several of the steps in parallel. Artificial doublets and kNN networks will be computed separately; then the behavior will then depend on the ‘multiSampleMode' argument. If ’split', the whole process is split by sample (this is recommended when there is heterogeneity between samples, for instance in the number of cells); if 'singleModel', the classifier and thresholding will be trained globally (this is not recommended unless the samples are extremely comparable); if 'singleModelSplitThres', the classifierwill be trained globally, but the thresholding be performed separately for each samples.

When inter-sample doublets are available, they can be provided to 'scDblFinder' through the knownDoublets argument to improve the identification of further doublets. However, because such 'true' doublets can include a lot of homotypic doublets, in practice this often lead to a slight decrease in the accuracy of detecting neotypic doublets.

Finally, for some types of data, such as single-cell ATAC-seq, selecting a number of top features is ineffective due to the high sparsity of the signal. In such contexts, rather than _selecting_ features we recommend to use the alternative approach of _aggregating_ similar features (with 'aggregateFeatures=TRUE'), which strongly improves accuracy.

Value

The sce object with several additional colData columns, in particular 'scDblFinder.score' (the final score used) and 'scDblFinder.class' (whether the cell is called as 'doublet' or 'singlet'). See vignette("scDblFinder") for more details; for alternative return values, see the 'returnType' argument.

Examples

library(SingleCellExperiment)
sce <- mockDoubletSCE()
sce <- scDblFinder(sce, dbr=0.1)
table(truth=sce$type, call=sce$scDblFinder.class)


[Package scDblFinder version 1.6.0 Index]