cellCounts {Rsubread}R Documentation

Map and quantify single cell RNA-seq data generated by 10X Genomics

Description

Process raw 10X scRNA-seq data and generate UMI counts for each gene in each cell.

Usage

cellCounts(

    # input data
    index,
    sample,
    input.mode = "BCL",
    cell.barcode = NULL,
  
    # specify the aligner used for read mapping
    aligner = "align",
  
    # parameters used by featureCounts for assigning and counting UMIs
    annot.inbuilt = "mm10",
    annot.ext = NULL,
    isGTFAnnotationFile = FALSE,
    GTF.featureType = "exon",
    GTF.attrType = "gene_id",
    useMetaFeatures = TRUE,
    
    # user provided UMI cutoff for cell calling
    umi.cutoff = NULL,

    # number of threads
    nthreads = 10,

    # dealing with multi-mapping reads in the alignment step
    nBestLocations = 1,
    unique.mapping = FALSE,

    # other parameters passed to align, subjunc and featureCounts functions 
    ...)

Arguments

index

A character string providing the base name of index files created for a reference genome by the buildindex function.

sample

A data frame or a character string providing sample-related information, including location of the data, sample names and index set names. See the Details section below for more details.

input.mode

A character string specifying the input mode. The supported input modes include BCL, FASTQ and FASTQ-dir. BCL is the BCL format of raw reads generated by the sequencers such as Illumina sequencers. FASTQ is the FASTQ format of sequencing reads. FASTQ-dir is a directory where FASTQ-format reads are saved. FASTQ-dir is useful for providing cellCounts the FASTQ data generated by bcl2fastq program or bamtofastq program (developed by 10X). BCL by default.

cell.barcode

A character string giving the name of a text file (can be gzipped) that contains the set of cell barcodes used in sample preparation. If NULL, a cell barcode set will be determined for the input data by cellCounts based on the matching of cell barcodes sequences of the first 100,000 reads in the data with the three cell barcode sets used by 10X Genomics. NULL by default.

aligner

Specify the name of the aligner used for read mapping. Currently only the align function (the Subread aligner) in this package is supported. align by default.

annot.inbuilt

Specify an inbuilt annotation for UMI counting. See featureCounts for more details. mm10 by default.

annot.ext

Specify an external annotation for UMI counting. See featureCounts for more details. NULL by default.

isGTFAnnotationFile

See featureCounts for more details. FALSE by default.

GTF.featureType

See featureCounts for more details. exon by default.

GTF.attrType

See featureCounts for more details. gene_id by default.

useMetaFeatures

Specify if UMI counting should be carried out at the meta-feature level (eg. gene level). See featureCounts for more details. TRUE by default.

umi.cutoff

Specify a UMI count cutoff for cell calling. All the cells with a total UMI count greater than this cutoff will be called. If NULL, a bootstrapping procedure will be performed to determine this cutoff. NULL by default.

nthreads

A numeric value giving the number of threads used for read mapping and counting. 10 by default.

nBestLocations

A numeric value giving the maximum number of reported alignments for each multi-mapping read. 1 by default.

unique.mapping

A logical value specifying if the multi-mapping reads should not be reported as mapped (i.e. reporting uniquely mapped reads only). FALSE by default.

...

other parameters passed to align and featureCounts functions.

Details

This function takes as input scRNA-seq reads generated by the 10X platform, maps them to the reference genome and then produces UMI (Unique Molecular Identifier) counts for each gene in each cell. The align read mapping function and the featureCounts quantification function, both included in this package, are utilised by this function. Sample demultiplexing, cell barcode demultiplexing and read deduplication are carried out before generating the UMI counts. cellCounts can process multiple datasets at the same time.

The sample information should be provided to cellCounts via the sample parameter. If the input format is BCL (ie. input.mode="BCL"), the provided sample information should include the location where the read data are stored, flowcell lanes used for sequencing, sample names and names of index sets used for indexing samples. These information should be saved to a data.frame object and then provided to the sample parameter. Below shows an example of this data frame:

InputDirectory		Lane		SampleName	IndexSetName
/path/to/dataset1	1		Sample1		SI-GA-E1
/path/to/dataset1	1		Sample2		SI-GA-E2
/path/to/dataset1	2		Sample1		SI-GA-E1
/path/to/dataset1	2		Sample2		SI-GA-E2
/path/to/dataset2	1		Sample3		SI-GA-E3
/path/to/dataset2	1		Sample4		SI-GA-E4
/path/to/dataset2	2		Sample3		SI-GA-E3
/path/to/dataset2	2		Sample4		SI-GA-E4
...

It is compulsory to have the four column headers shown in the example above when generating this data frame for a 10X dataset. If more than one datasets are provided for analysis, the InputDirectory column should include more than one distinct directory. Note that this data frame is different from the Sample Sheet generated by the Illumina sequencer. The cellCounts function uses the index set names included in this data frame to generate an Illumina Sample Sheet and then uses it to demultiplex all the samples.

If the input format is FASTQ, a data.frame object containing the following three columns, BarcodeUMIFile, ReadFile and SampleName, should be provided to the sample parameter. Each row in the data frame represents a sample. The ReadFile column includes names of FASTQ files that contain read data for the samples. Each FASTQ file corresponds to a sample. The read data included in these FASTQ files only contain genomic sequences of the reads. The cell barcode and UMI sequences of these reads can be found in the corresponding FASTQ files included in the BarcodeUMIFile column.

Finally, if the input format is FASTQ-dir, a character string, which includes the path to the directory where the FASTQ-format read data are stored, should be provided to the sample parameter. The data in this directory are expected to be generated by the bcl2fastq program or the bamtofastq program (a program developed by 10X).

Value

The cellCounts function returns a List object to R. It also outputs three gzipped FASTQ files and one BAM file for each sample. The three gzipped FASTQ files include cell barcode and UMI sequences (R1), sample index sequences (I1) and the actual genomic sequences of the reads (R2), respectively. The BAM file includes location-sorted read mapping results.

The returned List object contains the following components:

counts

a List object including UMI counts for each sample. Each component in this object is a matrix that contains UMI counts for a sample. Rows in the matrix are genes and columns are cells.

annotation

a data.frame object containing a gene annotation. This is the annotation that was used for the assignment of UMIs to genes during quantification. Rows in the annotation are genes. Columns of the annotation include GeneID, Chr, Start, End and Length.

sample.info

a data.frame object containing sample information and quantification statistics. It includes the following columns: SampleName, InputDirectory (if the input format is BCL), TotalCells, HighConfidenceCells (if umi.cutoff is NULL), RescuedCells (if umi.cutoff is NULL), TotalUMI, MinUMI, MedianUMI, MaxUMI, MeanUMI, TotalReads, MappedReads and AssignedReads. Each row in the data frame is a sample.

cell.confidence

a List object indicating if a cell is a high-confidence cell or a rescued cell (low confidence). Each component in the object is a logical vector indicating which cells in a sample are high-confidence cells. cell.confidence is included in the output only if umi.cutoff is NULL.

Author(s)

Yang Liao and Wei Shi

See Also

buildindex, align, featureCounts


[Package Rsubread version 2.6.4 Index]