swappedDrops {DropletUtils}R Documentation

Clean barcode-swapped droplet data

Description

Remove the effects of barcode swapping on droplet-based single-cell RNA-seq data, specifically 10X Genomics datasets.

Usage

swappedDrops(samples, barcode.length=NULL, ...)

removeSwappedDrops(cells, umis, genes, nreads, ref.genes, min.frac=0.8,
    get.swapped=FALSE, get.diagnostics=FALSE, hdf5.out=TRUE)

Arguments

samples

A character vector containing paths to the molecule information HDF5 files, produced by CellRanger for 10X Genomics data. Each file corresponds to one sample in a multiplexed pool.

barcode.length

An integer scalar specifying the length of the cell barcode, see read10xMolInfo.

...

Further arguments to be passed to removeSwappedDrops.

cells

A list of character vectors containing cell barcodes. Each vector corresponds to one sample in a multiplexed pool, and each entry of the vector corresponds to one molecule.

umis

A list of integer vectors containing encoded UMI sequences, organized as described for cells. See ?encodeSequences to convert sequences to integers.

genes

A list of integer vectors specifying the gene indices, organized as described for cells. Each index should refer to an element of ref.genes.

nreads

A list of integer vectors containing the number of reads per molecule, organized as described for cells.

ref.genes

A character vector containing the names or symbols of all genes.

min.frac

A numeric scalar specifying the minimum fraction of reads required for a swapped molecule to be assigned to a sample.

get.swapped

A logical scalar indicating whether the UMI counts corresponding to swapped molecules should also be returned.

get.diagnostics

A logical scalar indicating whether to return the number of reads for each swapped molecule in each sample.

hdf5.out

A logical scalar indicating whether the diagnotic matrix should be returned as a HDF5Matrix.

Details

Barcode swapping on the Illumina sequencer occurs when multiplexed samples undergo PCR re-amplification on the flow cell by excess primer with different barcodes. This results in sequencing of the wrong barcode and molecules being assigned to incorrect samples after debarcoding. With droplet data, there is the opportunity to remove such effects based on the combination of gene, UMI and cell barcode for each observed transcript molecule. It is very unlikely that the same combination will arise from different molecules in multiple samples. Thus, observation of the same combination across multiple samples is indicative of barcode swapping.

We can remove swapped molecules based on the number of reads assigned to each gene-UMI-barcode combination. From the total number of reads assigned to that combination, the fraction of reads in each sample is calculated. The sample with the largest fraction that is greater than min.frac is defined as the putative sample of origin to which the molecule is assigned. This assumes that the swapping rate is low, so the sample of origin for a molecule should contain the majority of the reads. In other all samples, reads for the combination are assumed to derive from swapping and do not contribute to the count matrix. Setting min.frac=1 will effectively remove all molecules that appear in multiple samples. We do not recommend setting min.frac lower than 0.5.

If diagnostics=TRUE, a diagnostics matrix is returned containing the number of reads per gene-UMI-barcode combination in each sample. Each row corresponds to a combination and each column corresponds to a sample. This can be useful for examining the level of swapping across samples on a molecule-by-molecule basis, though for the sake of memory, the actual identity of the molecules is not returned. By default, the matrix is returned as a HDF5Matrix, which reduces memory usage and avoids potential issues with integer overflow. If hdf5.out=FALSE, a sparse matrix is returned instead, which is faster but uses more memory.

swappedDrops is a wrapper around removeSwappedDrops that extracts the relevant data from the 10X Genomics molecule information file. For other types of droplet-based data, it may be more convenient to call removeSwappedDrops directly.

Value

A list is returned with the cleaned entry, itself a list of sparse matrices is returned. Each matrix corresponds to a sample and contains the UMI count for each gene (row) and cell barcode (column) after removing swapped molecules. All cell barcodes that were originally observed are reported as columns, though note that it is possible for some barcodes to contain no counts.

If get.swapped=TRUE, a swapped entry is returned in the top-level list. This is a list containing sample-specific sparse matrices of UMI counts corresponding to the swapped molecules. Adding the cleaned and swapped matrices for each sample should yield the total UMI count prior to removal of swapped molecules.

If get.diagnostics=TRUE, the top-level list will also contain an additional diagnostics matrix.

Author(s)

Jonathan Griffiths, with modifications by Aaron Lun

References

Griffiths JA, Lun ATL, Richard AC, Bach K, Marioni JC (2018). Detection and removal of barcode swapping in single-cell RNA-seq data. Nat. Commun. 9, 1:2667.

See Also

read10xMolInfo, encodeSequences

Examples

# Mocking up some 10x HDF5-formatted data, with swapping.
curfiles <- DropletUtils:::sim10xMolInfo(tempfile(), nsamples=3)

# Obtaining count matrices with swapping removed.
out <- swappedDrops(curfiles)
lapply(out$cleaned, dim)

out <- swappedDrops(curfiles, get.swapped=TRUE, get.diagnostics=TRUE)
names(out)

[Package DropletUtils version 1.2.2 Index]