scater 1.6.3
This document provides advice for users of early versions of scater who will
need to transition from the use of the SCESet class to the SingleCellExperiment class.
As of July 2017, scater has switched from the SCESet class previously
defined within the package to the more widely applicable SingleCellExperiment
class. From Bioconductor 3.6 (October 2017), the release version of scater
will use SingleCellExperiment.
SingleCellExperiment is a more modern and robust class that provides a common
data structure used by many single-cell Bioconductor packages. Advantages
include support for sparse data matrices and the capability for on-disk storage
of data to minimise memory usage for large single-cell datasets.
It should be straight-forward to convert existing scripts based on SCESet
objects to SingleCellExperiment objects, with key changes outlined immediately
below.
toSingleCellExperiment and updateSCESet (for backwards
compatibility) can be used to convert an old SCESet object to a
SingleCellExperiment object;SingleCellExperiment object with the function
SingleCellExperiment (actually less fiddly than creating a new SCESet);scater functions have been refactored to take SingleCellExperiment
objects, so once data is in a SingleCellExperiment object, the user experience
is almost identical to that with the SCESet class.Potential “gotchas”:
colnames function (instead
of sampleNames or cellNames for an SCESet object);rownames function (instead of featureNames);phenoData in an SCESet, corresponds to colData
in a SingleCellExperiment object and is accessed/assigned with the colData
function (this replaces the pData function);$ operator
(e.g. sce$total_counts);featureData in an SCESet, corresponds to
rowData in a SingleCellExperiment object and is accessed/assigned with the
rowData function (this replaces the fData function);plotScater, which produces a cumulative expression, overview plot, replaces
the generic plot function for SCESet objects.In Bioconductor terminology we assay numerous “features” for a number of
“samples”. Features, in the context of scater, correspond most commonly to
genes or transcripts, but could be any general genomic or transcriptomic regions
(e.g. exon) of interest for which we take measurements. Samples correspond to
cells.
With the switch to using the SingleCellExperiment class, the terminology has
become more general again. Now we have “rows” representing features and “cols”
representing samples (cells). Thus, applying the rownames function returns the
names of the features defined for a SingleCellExperiment object, which in
typical scater usage would correspond to gene IDs. In much of
what follows, it may be more intuitive to mentally replace “feature” with “gene”
or “transcript” (depending on the context of the study) wherever “feature”
appears.
In the scater context, “samples” refer to individual cells that we have
assayed. This differs from common usage of “sample” in other contexts, where
we might usually use “sample” to refer to an individual subject, a biological
replicate or similar. A “sample” in this sense in scater may be referred to as
a “block” in the more classical statistical sense. Within a “block” (
e.g. individual) we may have assayed numerous cells. Thus, the function colnames,
when applied to a SingleCellExperiment object returns the cell IDs.
SingleCellExperiment class and methodsIn scater we organise single-cell expression data in objects of the
SingleCellExperiment class. The class inherits the Bioconductor
SummarizedExperiment class, which provides a common interface across many
Bioconductor packages. For more details about other features inherited from
Bioconductor’s SummarizedExperiment class, type ?SummarizedExperiment at the
R prompt.
The class only requires some “assay data” (i.e. expression values of some sort) as input. Most commonly, these will be “counts” (e.g. molecule or read counts) and/or log2-scale transformed counts.
Cell metadata can be supplied as a DataFrame object, where rows are cells, and
columns are cell attributes (such as cell type, culture condition, day captured,
etc.). Feature metadata can be supplied as a DataFrame object, where rows are features (e.g. genes), and columns are feature attributes, such as Ensembl ID,
biotype, gc content, etc.
We can create a minimal SingleCellExperiment object as follows:
data("sc_example_counts")
example_sce <- SingleCellExperiment(assays = list(counts = sc_example_counts))
example_sce
## class: SingleCellExperiment 
## dim: 2000 40 
## metadata(0):
## assays(1): counts
## rownames(2000): Gene_0001 Gene_0002 ... Gene_1999 Gene_2000
## rowData names(0):
## colnames(40): Cell_001 Cell_002 ... Cell_039 Cell_040
## colData names(0):
## reducedDimNames(0):
## spikeNames(0):
The requirements for the SingleCellExperiment class (as with other S4 classes
in R and Bioconductor) are strict. The idea is that strictness with generating a valid
class object ensures that downstream methods applied to the class will work
reliably.
Thus, if we supply colData and/or rowData when building an obejct, the
expression value matrix must have the same number of columns as the colData
DataFrame has rows, and it must have the same number of rows as the rowData
DataFrame has rows. Row names of the colData object need to match the column
names of the expression matrix and row names of the rowData object need to
match row names of the expression matrix.
We can create a new SingleCellExperiment object with count data, cell metadata
and gene metadata as follows.
data("sc_example_cell_info")
gene_df <- DataFrame(Gene = rownames(sc_example_counts))
rownames(gene_df) <- gene_df$Gene
example_sce <- SingleCellExperiment(assays = list(counts = sc_example_counts), 
                                    colData = sc_example_cell_info, 
                                    rowData = gene_df)
example_sce
## class: SingleCellExperiment 
## dim: 2000 40 
## metadata(0):
## assays(1): counts
## rownames(2000): Gene_0001 Gene_0002 ... Gene_1999 Gene_2000
## rowData names(1): Gene
## colnames(40): Cell_001 Cell_002 ... Cell_039 Cell_040
## colData names(4): Cell Mutation_Status Cell_Cycle Treatment
## reducedDimNames(0):
## spikeNames(0):
Frequently (typically), we will want both raw counts and log2-scale counts in
our SingleCellExperiment object. It is straight-forward to add
log2-counts-per-million to an object containing counts.
We can use the normalise (or, if you prefer, normalize) function:
example_sce <- normalise(example_sce)
## Warning in .local(object, ...): using library sizes as size factors
(This gives a warning to let us know that as size factors for normalisation have not yet been defined, library sizes (total counts) are used instead. This function can also be used for more sophisticated size-factor normalisation once size factors have been calculated.)
Or, we use calculateCPM directly (with equivalent results):
logcounts(example_sce) <- log2(calculateCPM(example_sce, 
                                            use.size.factors = FALSE) + 1)
The log-scale count data is stored in the logcounts assay slot of a
SingleCellExperiment object. The exprs getter/setter function also accesses
this logcounts slot, to enable equivalent usage as in previous versions of
scater.
SingleCellExperiment objectWe have accessor functions to access elements of the SingleCellExperiment
object. Furthermore, subsetting SingleCellExperiment objects is
straightforward and reliable, using the usual R [] notation, with rows
representing features and columns representing cells.
counts(object): returns the matrix of read counts. As you can see above, if
no counts are defined for the object, then the counts matrix slot is simpy
NULL.counts(example_sce)[1:3, 1:6]
##           Cell_001 Cell_002 Cell_003 Cell_004 Cell_005 Cell_006
## Gene_0001        0      123        2        0        0        0
## Gene_0002      575       65        3     1561     2311      160
## Gene_0003        0        0        0        0     1213        0
exprs(object): returns the matrix of (log-counts) expression values, in fact
accessing the logcounts slot of the object (synonym for logcounts). Typically these
should be log2(counts-per-million) values or
log2(reads-per-kilobase-per-million-mapped), appropriately normalised of course.
The package will generally assume that these are the values to use for
expression.exprs(example_sce)[1:3, 1:6]
##           Cell_001 Cell_002 Cell_003 Cell_004  Cell_005 Cell_006
## Gene_0001 0.000000 8.192430 1.828628  0.00000  0.000000 0.000000
## Gene_0002 9.033633 7.276677 2.271422 11.07878 10.103749 8.492693
## Gene_0003 0.000000 0.000000 0.000000  0.00000  9.174997 0.000000
assay
function. We simply supply the function with the SingleCellExperiment object
and the name of the desired expression matrix:assay(example_sce, "counts")[1:3, 1:6]
Similarly we can assign a new (say, transformed) expression matrix to an
SingleCellExperiment object using assay as follows:
assay(example_sce, "counts") <- counts(example_sce)
For convenience (and backwards compatibility) getters and setters are provided
as follows: exprs, tpm, cpm, fpkm and versions of these with the prefix “norm_”):
Handily, it is also easy to replace other data in slots of the SCESet object
using generic accessor and replacement functions.
gene_df <- DataFrame(Gene = rownames(sc_example_counts))
rownames(gene_df) <- gene_df$Gene
## replace rowData (previously featureData)
rowData(example_sce) <- gene_df
## replace colData (previously phenotype data)
colData(example_sce) <- DataFrame(sc_example_cell_info)
After gaining familiarity with creating and manipulating SingleCellExperiment
objects, see the other scater vignettes for guidance on using scater for
quality control, data visualisation and more.