Contents

1 Introduction

Single-cell RNA-seq (scRNA-seq) is widely used to investigate the composition of complex tissues since the technology allows researchers to define cell-types using unsupervised clustering of the transcriptome. However, due to differences in experimental methods and computational analyses, it is often challenging to directly compare the cells identified in two different experiments.

scmap is a method for projecting cells from a scRNA-seq experiment on to the cell-types identified in a different experiment. A copy of the scmap manuscript is available on bioRxiv.

2 SingleCellExperiment class

scmap is built on top of the Bioconductor’s SingleCellExperiment class. scmap operates on objects of class SingleCellExperiment and writes all of its results back to the the object.

3 scmap Input

If you already have an SCESet object, then proceed to the next chapter.

If you have a matrix or a data frame containing expression data then you first need to create an SingleCellExperiment object containing your data. For illustrative purposes we will use an example expression matrix provided with scmap. The dataset (yan) represents FPKM gene expression of 90 cells derived from human embryo. The authors (Yan et al.) have defined developmental stages of all cells in the original publication (ann data frame). We will use these stages in projection later.

library(SingleCellExperiment)
library(scmap)

head(ann)
##                 cell_type1
## Oocyte..1.RPKM.     zygote
## Oocyte..2.RPKM.     zygote
## Oocyte..3.RPKM.     zygote
## Zygote..1.RPKM.     zygote
## Zygote..2.RPKM.     zygote
## Zygote..3.RPKM.     zygote
yan[1:3, 1:3]
##          Oocyte..1.RPKM. Oocyte..2.RPKM. Oocyte..3.RPKM.
## C9orf152             0.0             0.0             0.0
## RPS11             1219.9          1021.1           931.6
## ELMO2                7.0            12.2             9.3

Note that the cell type information has to be stored in the cell_type1 column of the rowData slot of the SingleCellExperiment object.

Now let’s create a SingleCellExperiment object of the yan dataset:

sce <- SingleCellExperiment(assays = list(normcounts = as.matrix(yan)), colData = ann)
# this is needed to calculate dropout rate for feature selection
# important: normcounts have the same zeros as raw counts (fpkm)
counts(sce) <- normcounts(sce)
logcounts(sce) <- log2(normcounts(sce) + 1)
# use gene names as feature symbols
rowData(sce)$feature_symbol <- rownames(sce)
isSpike(sce, "ERCC") <- grepl("^ERCC-", rownames(sce))
# remove features with duplicated names
sce <- sce[!duplicated(rownames(sce)), ]
sce
## class: SingleCellExperiment 
## dim: 20214 90 
## metadata(0):
## assays(3): normcounts counts logcounts
## rownames(20214): C9orf152 RPS11 ... CTSC AQP7
## rowData names(1): feature_symbol
## colnames(90): Oocyte..1.RPKM. Oocyte..2.RPKM. ...
##   Late.blastocyst..3..Cell.7.RPKM. Late.blastocyst..3..Cell.8.RPKM.
## colData names(1): cell_type1
## reducedDimNames(0):
## spikeNames(1): ERCC

4 Run scmap

4.1 Feature Selection

Once we have a SingleCellExperiment object we can run scmap. Firstly, we need to select the most informative features from our input dataset:

sce <- getFeatures(sce, suppress_plot = FALSE)

Genes highlighted with the red colour will be used in the futher analysis (projection).

4.2 Projecting

We will project the yan dataset to itself:

sce <- projectData(projection = sce, reference = sce)

In your own analysis you can choose any two scRNASeq datasets and project them to each other. Note that the getFeatures functions has to be run on the reference dataset before running the projectData function.

5 Results

Let’s look at the results. The labels produced by scmap are located in the scmap_labs column of the colData slot of the projection dataset. We will compare them to the original labels provided by the authors of the publication:

colData(sce)
## DataFrame with 90 rows and 3 columns
##                                  cell_type1  scmap_labs scmap_siml
##                                    <factor> <character>  <numeric>
## Oocyte..1.RPKM.                      zygote      zygote  0.9947609
## Oocyte..2.RPKM.                      zygote      zygote  0.9951257
## Oocyte..3.RPKM.                      zygote      zygote  0.9955916
## Zygote..1.RPKM.                      zygote       2cell  0.9934012
## Zygote..2.RPKM.                      zygote       2cell  0.9953694
## ...                                     ...         ...        ...
## Late.blastocyst..3..Cell.4.RPKM.      blast       blast  0.8321482
## Late.blastocyst..3..Cell.5.RPKM.      blast       blast  0.8400685
## Late.blastocyst..3..Cell.6.RPKM.      blast       blast  0.9235622
## Late.blastocyst..3..Cell.7.RPKM.      blast       blast  0.9377231
## Late.blastocyst..3..Cell.8.RPKM.      blast       blast  0.9174087

Clearly the projection is almost perfect. With scmap one can also plot a Sankey diagram (however, cell_type1 columns have to be provided in the colData slots of both the reference and the projection datasets):

plot(getSankey(colData(sce)$cell_type1, colData(sce)$scmap_labs))

6 Creating a precomputed Reference

The cell type centroids can be precomputed by using the createReference method:

reference <- createReference(sce[rowData(sce)$scmap_features, ])

One can also visualise the cell type centroids, e.g.:

heatmap(as.matrix(reference))

Exactly the same projection as above can be performed by providing the precomputed reference instead of the SingleCellExperiment object:

sce <- projectData(projection = sce, reference = reference)
colData(sce)
## DataFrame with 90 rows and 3 columns
##                                  cell_type1  scmap_labs scmap_siml
##                                    <factor> <character>  <numeric>
## Oocyte..1.RPKM.                      zygote      zygote  0.9947609
## Oocyte..2.RPKM.                      zygote      zygote  0.9951257
## Oocyte..3.RPKM.                      zygote      zygote  0.9955916
## Zygote..1.RPKM.                      zygote       2cell  0.9934012
## Zygote..2.RPKM.                      zygote       2cell  0.9953694
## ...                                     ...         ...        ...
## Late.blastocyst..3..Cell.4.RPKM.      blast       blast  0.8321482
## Late.blastocyst..3..Cell.5.RPKM.      blast       blast  0.8400685
## Late.blastocyst..3..Cell.6.RPKM.      blast       blast  0.9235622
## Late.blastocyst..3..Cell.7.RPKM.      blast       blast  0.9377231
## Late.blastocyst..3..Cell.8.RPKM.      blast       blast  0.9174087

7 sessionInfo()

## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.6-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.6-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] bindrcpp_0.2               scmap_1.0.0               
##  [3] SingleCellExperiment_1.0.0 SummarizedExperiment_1.8.0
##  [5] DelayedArray_0.4.0         matrixStats_0.52.2        
##  [7] Biobase_2.38.0             GenomicRanges_1.30.0      
##  [9] GenomeInfoDb_1.14.0        IRanges_2.12.0            
## [11] S4Vectors_0.16.0           BiocGenerics_0.24.0       
## [13] googleVis_0.6.2            knitr_1.17                
## [15] BiocStyle_2.6.0           
## 
## loaded via a namespace (and not attached):
##  [1] reshape2_1.4.2          lattice_0.20-35         colorspace_1.3-2       
##  [4] htmltools_0.3.6         yaml_2.1.14             rlang_0.1.2            
##  [7] e1071_1.6-8             glue_1.2.0              GenomeInfoDbData_0.99.1
## [10] bindr_0.1               plyr_1.8.4              stringr_1.2.0          
## [13] zlibbioc_1.24.0         munsell_0.4.3           gtable_0.2.0           
## [16] codetools_0.2-15        evaluate_0.10.1         labeling_0.3           
## [19] class_7.3-14            Rcpp_0.12.13            backports_1.1.1        
## [22] scales_0.5.0            jsonlite_1.5            XVector_0.18.0         
## [25] ggplot2_2.2.1           digest_0.6.12           stringi_1.1.5          
## [28] bookdown_0.5            dplyr_0.7.4             grid_3.4.2             
## [31] rprojroot_1.2           tools_3.4.2             bitops_1.0-6           
## [34] magrittr_1.5            RCurl_1.95-4.8          lazyeval_0.2.1         
## [37] proxy_0.4-19            tibble_1.3.4            randomForest_4.6-12    
## [40] pkgconfig_2.0.1         Matrix_1.2-11           assertthat_0.2.0       
## [43] rmarkdown_1.6           R6_2.2.2                compiler_3.4.2