scater 1.6.3
This document provides advice for users of early versions of scater
who will
need to transition from the use of the SCESet
class to the SingleCellExperiment
class.
As of July 2017, scater
has switched from the SCESet
class previously
defined within the package to the more widely applicable SingleCellExperiment
class. From Bioconductor 3.6 (October 2017), the release version of scater
will use SingleCellExperiment
.
SingleCellExperiment
is a more modern and robust class that provides a common
data structure used by many single-cell Bioconductor packages. Advantages
include support for sparse data matrices and the capability for on-disk storage
of data to minimise memory usage for large single-cell datasets.
It should be straight-forward to convert existing scripts based on SCESet
objects to SingleCellExperiment
objects, with key changes outlined immediately
below.
toSingleCellExperiment
and updateSCESet
(for backwards
compatibility) can be used to convert an old SCESet
object to a
SingleCellExperiment
object;SingleCellExperiment
object with the function
SingleCellExperiment
(actually less fiddly than creating a new SCESet
);scater
functions have been refactored to take SingleCellExperiment
objects, so once data is in a SingleCellExperiment
object, the user experience
is almost identical to that with the SCESet
class.Potential “gotchas”:
colnames
function (instead
of sampleNames
or cellNames
for an SCESet
object);rownames
function (instead of featureNames
);phenoData
in an SCESet
, corresponds to colData
in a SingleCellExperiment
object and is accessed/assigned with the colData
function (this replaces the pData
function);$
operator
(e.g. sce$total_counts
);featureData
in an SCESet
, corresponds to
rowData
in a SingleCellExperiment
object and is accessed/assigned with the
rowData
function (this replaces the fData
function);plotScater
, which produces a cumulative expression, overview plot, replaces
the generic plot
function for SCESet
objects.In Bioconductor terminology we assay numerous “features” for a number of
“samples”. Features, in the context of scater
, correspond most commonly to
genes or transcripts, but could be any general genomic or transcriptomic regions
(e.g. exon) of interest for which we take measurements. Samples correspond to
cells.
With the switch to using the SingleCellExperiment
class, the terminology has
become more general again. Now we have “rows” representing features and “cols”
representing samples (cells). Thus, applying the rownames
function returns the
names of the features defined for a SingleCellExperiment
object, which in
typical scater
usage would correspond to gene IDs. In much of
what follows, it may be more intuitive to mentally replace “feature” with “gene”
or “transcript” (depending on the context of the study) wherever “feature”
appears.
In the scater
context, “samples” refer to individual cells that we have
assayed. This differs from common usage of “sample” in other contexts, where
we might usually use “sample” to refer to an individual subject, a biological
replicate or similar. A “sample” in this sense in scater
may be referred to as
a “block” in the more classical statistical sense. Within a “block” (
e.g. individual) we may have assayed numerous cells. Thus, the function colnames
,
when applied to a SingleCellExperiment
object returns the cell IDs.
SingleCellExperiment
class and methodsIn scater
we organise single-cell expression data in objects of the
SingleCellExperiment
class. The class inherits the Bioconductor
SummarizedExperiment
class, which provides a common interface across many
Bioconductor packages. For more details about other features inherited from
Bioconductor’s SummarizedExperiment
class, type ?SummarizedExperiment
at the
R prompt.
The class only requires some “assay data” (i.e. expression values of some sort) as input. Most commonly, these will be “counts” (e.g. molecule or read counts) and/or log2-scale transformed counts.
Cell metadata can be supplied as a DataFrame
object, where rows are cells, and
columns are cell attributes (such as cell type, culture condition, day captured,
etc.). Feature metadata can be supplied as a DataFrame
object, where rows are features (e.g. genes), and columns are feature attributes, such as Ensembl ID,
biotype, gc content, etc.
We can create a minimal SingleCellExperiment
object as follows:
data("sc_example_counts")
example_sce <- SingleCellExperiment(assays = list(counts = sc_example_counts))
example_sce
## class: SingleCellExperiment
## dim: 2000 40
## metadata(0):
## assays(1): counts
## rownames(2000): Gene_0001 Gene_0002 ... Gene_1999 Gene_2000
## rowData names(0):
## colnames(40): Cell_001 Cell_002 ... Cell_039 Cell_040
## colData names(0):
## reducedDimNames(0):
## spikeNames(0):
The requirements for the SingleCellExperiment
class (as with other S4 classes
in R and Bioconductor) are strict. The idea is that strictness with generating a valid
class object ensures that downstream methods applied to the class will work
reliably.
Thus, if we supply colData
and/or rowData
when building an obejct, the
expression value matrix must have the same number of columns as the colData
DataFrame
has rows, and it must have the same number of rows as the rowData
DataFrame
has rows. Row names of the colData
object need to match the column
names of the expression matrix and row names of the rowData
object need to
match row names of the expression matrix.
We can create a new SingleCellExperiment
object with count data, cell metadata
and gene metadata as follows.
data("sc_example_cell_info")
gene_df <- DataFrame(Gene = rownames(sc_example_counts))
rownames(gene_df) <- gene_df$Gene
example_sce <- SingleCellExperiment(assays = list(counts = sc_example_counts),
colData = sc_example_cell_info,
rowData = gene_df)
example_sce
## class: SingleCellExperiment
## dim: 2000 40
## metadata(0):
## assays(1): counts
## rownames(2000): Gene_0001 Gene_0002 ... Gene_1999 Gene_2000
## rowData names(1): Gene
## colnames(40): Cell_001 Cell_002 ... Cell_039 Cell_040
## colData names(4): Cell Mutation_Status Cell_Cycle Treatment
## reducedDimNames(0):
## spikeNames(0):
Frequently (typically), we will want both raw counts and log2-scale counts in
our SingleCellExperiment
object. It is straight-forward to add
log2-counts-per-million to an object containing counts.
We can use the normalise
(or, if you prefer, normalize
) function:
example_sce <- normalise(example_sce)
## Warning in .local(object, ...): using library sizes as size factors
(This gives a warning to let us know that as size factors for normalisation have not yet been defined, library sizes (total counts) are used instead. This function can also be used for more sophisticated size-factor normalisation once size factors have been calculated.)
Or, we use calculateCPM
directly (with equivalent results):
logcounts(example_sce) <- log2(calculateCPM(example_sce,
use.size.factors = FALSE) + 1)
The log-scale count data is stored in the logcounts
assay slot of a
SingleCellExperiment
object. The exprs
getter/setter function also accesses
this logcounts
slot, to enable equivalent usage as in previous versions of
scater
.
SingleCellExperiment
objectWe have accessor functions to access elements of the SingleCellExperiment
object. Furthermore, subsetting SingleCellExperiment
objects is
straightforward and reliable, using the usual R []
notation, with rows
representing features and columns representing cells.
counts(object)
: returns the matrix of read counts. As you can see above, if
no counts are defined for the object, then the counts matrix slot is simpy
NULL
.counts(example_sce)[1:3, 1:6]
## Cell_001 Cell_002 Cell_003 Cell_004 Cell_005 Cell_006
## Gene_0001 0 123 2 0 0 0
## Gene_0002 575 65 3 1561 2311 160
## Gene_0003 0 0 0 0 1213 0
exprs(object)
: returns the matrix of (log-counts) expression values, in fact
accessing the logcounts
slot of the object (synonym for logcounts
). Typically these
should be log2(counts-per-million) values or
log2(reads-per-kilobase-per-million-mapped), appropriately normalised of course.
The package will generally assume that these are the values to use for
expression.exprs(example_sce)[1:3, 1:6]
## Cell_001 Cell_002 Cell_003 Cell_004 Cell_005 Cell_006
## Gene_0001 0.000000 8.192430 1.828628 0.00000 0.000000 0.000000
## Gene_0002 9.033633 7.276677 2.271422 11.07878 10.103749 8.492693
## Gene_0003 0.000000 0.000000 0.000000 0.00000 9.174997 0.000000
assay
function. We simply supply the function with the SingleCellExperiment
object
and the name of the desired expression matrix:assay(example_sce, "counts")[1:3, 1:6]
Similarly we can assign a new (say, transformed) expression matrix to an
SingleCellExperiment
object using assay
as follows:
assay(example_sce, "counts") <- counts(example_sce)
For convenience (and backwards compatibility) getters and setters are provided
as follows: exprs
, tpm
, cpm
, fpkm
and versions of these with the prefix “norm_”):
Handily, it is also easy to replace other data in slots of the SCESet
object
using generic accessor and replacement functions.
gene_df <- DataFrame(Gene = rownames(sc_example_counts))
rownames(gene_df) <- gene_df$Gene
## replace rowData (previously featureData)
rowData(example_sce) <- gene_df
## replace colData (previously phenotype data)
colData(example_sce) <- DataFrame(sc_example_cell_info)
After gaining familiarity with creating and manipulating SingleCellExperiment
objects, see the other scater
vignettes for guidance on using scater
for
quality control, data visualisation and more.