scmap
package vignetteSingle-cell RNA-seq (scRNA-seq) is widely used to investigate the composition of complex tissues since the technology allows researchers to define cell-types using unsupervised clustering of the transcriptome. However, due to differences in experimental methods and computational analyses, it is often challenging to directly compare the cells identified in two different experiments.
scmap
is a method for projecting cells from a scRNA-seq experiment on to the cell-types identified in a different experiment. A copy of the scmap
manuscript is available on bioRxiv.
SingleCellExperiment
classscmap
is built on top of the Bioconductor’s SingleCellExperiment class. scmap
operates on objects of class SingleCellExperiment
and writes all of its results back to the the object.
scmap
InputIf you already have an SCESet
object, then proceed to the next chapter.
If you have a matrix or a data frame containing expression data then you first need to create an SingleCellExperiment
object containing your data. For illustrative purposes we will use an example expression matrix provided with scmap
. The dataset (yan
) represents FPKM gene expression of 90 cells derived from human embryo. The authors (Yan et al.) have defined developmental stages of all cells in the original publication (ann
data frame). We will use these stages in projection later.
library(SingleCellExperiment)
library(scmap)
head(ann)
## cell_type1
## Oocyte..1.RPKM. zygote
## Oocyte..2.RPKM. zygote
## Oocyte..3.RPKM. zygote
## Zygote..1.RPKM. zygote
## Zygote..2.RPKM. zygote
## Zygote..3.RPKM. zygote
yan[1:3, 1:3]
## Oocyte..1.RPKM. Oocyte..2.RPKM. Oocyte..3.RPKM.
## C9orf152 0.0 0.0 0.0
## RPS11 1219.9 1021.1 931.6
## ELMO2 7.0 12.2 9.3
Note that the cell type information has to be stored in the cell_type1
column of the rowData
slot of the SingleCellExperiment
object.
Now let’s create a SingleCellExperiment
object of the yan
dataset:
sce <- SingleCellExperiment(assays = list(normcounts = as.matrix(yan)), colData = ann)
# this is needed to calculate dropout rate for feature selection
# important: normcounts have the same zeros as raw counts (fpkm)
counts(sce) <- normcounts(sce)
logcounts(sce) <- log2(normcounts(sce) + 1)
# use gene names as feature symbols
rowData(sce)$feature_symbol <- rownames(sce)
isSpike(sce, "ERCC") <- grepl("^ERCC-", rownames(sce))
# remove features with duplicated names
sce <- sce[!duplicated(rownames(sce)), ]
sce
## class: SingleCellExperiment
## dim: 20214 90
## metadata(0):
## assays(3): normcounts counts logcounts
## rownames(20214): C9orf152 RPS11 ... CTSC AQP7
## rowData names(1): feature_symbol
## colnames(90): Oocyte..1.RPKM. Oocyte..2.RPKM. ...
## Late.blastocyst..3..Cell.7.RPKM. Late.blastocyst..3..Cell.8.RPKM.
## colData names(1): cell_type1
## reducedDimNames(0):
## spikeNames(1): ERCC
scmap
Once we have a SingleCellExperiment
object we can run scmap
. Firstly, we need to select the most informative features from our input dataset:
sce <- getFeatures(sce, suppress_plot = FALSE)
Genes highlighted with the red colour will be used in the futher analysis (projection).
We will project the yan
dataset to itself:
sce <- projectData(projection = sce, reference = sce)
In your own analysis you can choose any two scRNASeq datasets and project them to each other. Note that the getFeatures
functions has to be run on the reference dataset before running the projectData
function.
Let’s look at the results. The labels produced by scmap
are located in the scmap_labs
column of the colData
slot of the projection dataset. We will compare them to the original labels provided by the authors of the publication:
colData(sce)
## DataFrame with 90 rows and 3 columns
## cell_type1 scmap_labs scmap_siml
## <factor> <character> <numeric>
## Oocyte..1.RPKM. zygote zygote 0.9947609
## Oocyte..2.RPKM. zygote zygote 0.9951257
## Oocyte..3.RPKM. zygote zygote 0.9955916
## Zygote..1.RPKM. zygote 2cell 0.9934012
## Zygote..2.RPKM. zygote 2cell 0.9953694
## ... ... ... ...
## Late.blastocyst..3..Cell.4.RPKM. blast blast 0.8321482
## Late.blastocyst..3..Cell.5.RPKM. blast blast 0.8400685
## Late.blastocyst..3..Cell.6.RPKM. blast blast 0.9235622
## Late.blastocyst..3..Cell.7.RPKM. blast blast 0.9377231
## Late.blastocyst..3..Cell.8.RPKM. blast blast 0.9174087
Clearly the projection is almost perfect. With scmap
one can also plot a Sankey diagram (however, cell_type1
columns have to be provided in the colData
slots of both the reference and the projection datasets):
plot(getSankey(colData(sce)$cell_type1, colData(sce)$scmap_labs))
The cell type centroids can be precomputed by using the createReference
method:
reference <- createReference(sce[rowData(sce)$scmap_features, ])
One can also visualise the cell type centroids, e.g.:
heatmap(as.matrix(reference))
Exactly the same projection as above can be performed by providing the precomputed reference instead of the SingleCellExperiment
object:
sce <- projectData(projection = sce, reference = reference)
colData(sce)
## DataFrame with 90 rows and 3 columns
## cell_type1 scmap_labs scmap_siml
## <factor> <character> <numeric>
## Oocyte..1.RPKM. zygote zygote 0.9947609
## Oocyte..2.RPKM. zygote zygote 0.9951257
## Oocyte..3.RPKM. zygote zygote 0.9955916
## Zygote..1.RPKM. zygote 2cell 0.9934012
## Zygote..2.RPKM. zygote 2cell 0.9953694
## ... ... ... ...
## Late.blastocyst..3..Cell.4.RPKM. blast blast 0.8321482
## Late.blastocyst..3..Cell.5.RPKM. blast blast 0.8400685
## Late.blastocyst..3..Cell.6.RPKM. blast blast 0.9235622
## Late.blastocyst..3..Cell.7.RPKM. blast blast 0.9377231
## Late.blastocyst..3..Cell.8.RPKM. blast blast 0.9174087
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.6-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.6-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] bindrcpp_0.2 scmap_1.0.0
## [3] SingleCellExperiment_1.0.0 SummarizedExperiment_1.8.0
## [5] DelayedArray_0.4.0 matrixStats_0.52.2
## [7] Biobase_2.38.0 GenomicRanges_1.30.0
## [9] GenomeInfoDb_1.14.0 IRanges_2.12.0
## [11] S4Vectors_0.16.0 BiocGenerics_0.24.0
## [13] googleVis_0.6.2 knitr_1.17
## [15] BiocStyle_2.6.0
##
## loaded via a namespace (and not attached):
## [1] reshape2_1.4.2 lattice_0.20-35 colorspace_1.3-2
## [4] htmltools_0.3.6 yaml_2.1.14 rlang_0.1.2
## [7] e1071_1.6-8 glue_1.2.0 GenomeInfoDbData_0.99.1
## [10] bindr_0.1 plyr_1.8.4 stringr_1.2.0
## [13] zlibbioc_1.24.0 munsell_0.4.3 gtable_0.2.0
## [16] codetools_0.2-15 evaluate_0.10.1 labeling_0.3
## [19] class_7.3-14 Rcpp_0.12.13 backports_1.1.1
## [22] scales_0.5.0 jsonlite_1.5 XVector_0.18.0
## [25] ggplot2_2.2.1 digest_0.6.12 stringi_1.1.5
## [28] bookdown_0.5 dplyr_0.7.4 grid_3.4.2
## [31] rprojroot_1.2 tools_3.4.2 bitops_1.0-6
## [34] magrittr_1.5 RCurl_1.95-4.8 lazyeval_0.2.1
## [37] proxy_0.4-19 tibble_1.3.4 randomForest_4.6-12
## [40] pkgconfig_2.0.1 Matrix_1.2-11 assertthat_0.2.0
## [43] rmarkdown_1.6 R6_2.2.2 compiler_3.4.2