1 Introduction

Woah, hi there. Didn’t except a visitor to this vignette. Most of the functionality in scran is now deprecated or defunct, having since been superceded by the scrapper package. We just keep this package around so that users trying to call the old functions are redirected to their replacements. And also, scran got me a well-paid gig that funds my retirement, so I feel an obligation to repay the favor to an loyal old draft horse.

# Show your appreciation for this package by installing it
# to bump up its download statistics.
install.packages("BiocManager")
BiocManager::install("scran")

Anyway, this vignette will go through some of the remaining non-deprecated functions in this package. In general, these functions are no longer part of routine single-cell analyses but we keep them around for historical purposes. We’ll demonstrate on a small single-cell pancreas dataset:

library(scran)
library(scRNAseq)
sce <- GrunPancreasData()
sce
## class: SingleCellExperiment 
## dim: 20064 1728 
## metadata(0):
## assays(1): counts
## rownames(20064): A1BG-AS1__chr19 A1BG__chr19 ... ZZEF1__chr17
##   ZZZ3__chr1
## rowData names(2): symbol chr
## colnames(1728): D2ex_1 D2ex_2 ... D17TGFB_95 D17TGFB_96
## colData names(2): donor sample
## reducedDimNames(0):
## mainExpName: endogenous
## altExpNames(1): ERCC
# Quickly analyzing it to generate bits and pieces that we can
# reuse in the rest of this vignette.
library(scrapper)
results <- analyze.se(sce, num.threads=2)
sce.ready <- results$x

2 Quick clustering

When using the computeSumFactors() function from scuttle, it can be beneficial to initially compute normalization factors within clusters. This reduces our sensitivity to violations of the assumption of a non-DE majority of genes. The quickCluster() function generates a quick-and-dirty clustering that can be passed to clusters=:

library(scran)
clusters <- quickCluster(sce.ready)
sce.normed <- computeSumFactors(sce.ready, clusters=clusters)
summary(sizeFactors(sce.normed))
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##  0.002541  0.196259  0.753183  1.000000  1.384071 17.053692

If necessary, further subclustering can be performed conveniently using the quickSubCluster() wrapper function. This splits the input SingleCellExperiment into several smaller objects containing cells from each cluster and performs another round of clustering within that cluster, using a freshly identified set of HVGs to improve resolution for internal structure.

subout <- quickSubCluster(sce.ready, groups=clusters)
table(metadata(subout)$subcluster) # formatted as '<parent>.<subcluster>'
## 
## 1.1 1.2 1.3 1.4 1.5 1.6 2.1 2.2 2.3 2.4 2.5 3.1 3.2 3.3 3.4 4.1 4.2 4.3 4.4 4.5 
##  25  44  18  37  25  16  87  56  89  40  31  68 108  90  35  84  34  50  50  20 
## 5.1 5.2 5.3 5.4 5.5 5.6 5.7 6.1 6.2 6.3 6.4 6.5 6.6 7.1 7.2 7.3 7.4 7.5 
##  42  43  33  34  15  42  36  21  63  67  25  27  16  22  17  20  31  21

3 Automated PC choice

A common question is how many PCs to retain; more PCs will capture more biological signal at the cost of retaining more noise and requiring more computational work. One approach to choosing the number of PCs is to use the technical component estimates to determine the proportion of variance that should be retained. This is implemented in denoisePCANumber():

var.explained <- metadata(sce.ready)$PCA$variance.explained
stats <- rowData(sce.ready)
total.tech.var.in.hvgs <- sum(stats[stats$hvg,"fitted"])
total.var.in.hvgs <- sum(stats[stats$hvg,"variances"])
chosen <- denoisePCANumber(var.explained, total.tech.var.in.hvgs, total.var.in.hvgs)
chosen
## [1] 10

We could then use only the specified number of top PCs in our downstream analyses:

reducedDim(sce.ready, "PCA.denoised") <- reducedDim(sce.ready, "PCA")[,1:chosen]

4 Detecting correlated genes

Another useful procedure is to identify significant pairwise correlations between pairs of HVGs. The idea is to distinguish between HVGs caused by random stochasticity, and those that are driving systematic heterogeneity, e.g., between subpopulations. Correlations are computed by the correlatePairs method using a slightly modified version of Spearman’s rho, tested against the null hypothesis of zero correlation using the same method in cor.test().

# Using the first 200 HVs, which are the most interesting anyway.
of.interest <- order(rowData(sce.ready)$residuals, decreasing=TRUE)[1:200]
cor.pairs <- correlatePairs(sce.ready, subset.row=of.interest)
cor.pairs
## DataFrame with 19900 rows and 5 columns
##                 gene1            gene2          rho      p.value          FDR
##           <character>      <character>    <numeric>    <numeric>    <numeric>
## 1     KCNQ1OT1__chr11   UGDH-AS1__chr4     0.828302  0.00000e+00  0.00000e+00
## 2           CPE__chr4      SCG5__chr15     0.776521 2.95214e-319 2.93738e-315
## 3         CHGA__chr14      SCG5__chr15     0.764677 7.93539e-304 5.26381e-300
## 4         PRSS1__chr7      REG1A__chr2     0.749495 2.47271e-285 1.23017e-281
## 5           GCG__chr2       TTR__chr18     0.748607 2.71937e-284 1.08231e-280
## ...               ...              ...          ...          ...          ...
## 19896 C10orf10__chr10     PNLIP__chr10 -2.16940e-05     0.999312     0.999513
## 19897      CPB1__chr3 UCKL1-AS1__chr20  1.73259e-05     0.999451     0.999601
## 19898      CLPS__chr6     KRT19__chr17  1.28942e-05     0.999591     0.999692
## 19899     HSPB1__chr7       ORC4__chr2  6.49553e-06     0.999794     0.999844
## 19900       C3__chr19       FTL__chr19  3.55270e-06     0.999887     0.999887

As with variance estimation, if uninteresting substructure is present, this should be blocked on using the block= argument. This avoids strong correlations due to the blocking factor.

cor.pairs2 <- correlatePairs(sce.ready, subset.row=of.interest, block=sce.ready$donor)

The pairs can be used for choosing marker genes in experimental validation, and to construct gene-gene association networks. In other situations, the pairs may not be of direct interest - rather, we just want to know whether a gene is correlated with any other gene. This is often the case if we are to select a set of correlated HVGs for use in downstream steps like clustering or dimensionality reduction. To do so, we use correlateGenes() to compute a single set of statistics for each gene, rather than for each pair.

cor.genes <- correlateGenes(cor.pairs)
cor.genes
## DataFrame with 200 rows and 4 columns
##                gene       rho      p.value          FDR
##         <character> <numeric>    <numeric>    <numeric>
## 1   KCNQ1OT1__chr11  0.828302  0.00000e+00  0.00000e+00
## 2         CPE__chr4  0.776521 5.87476e-317 2.93738e-315
## 3       CHGA__chr14  0.764677 1.57914e-301 6.31657e-300
## 4       PRSS1__chr7  0.749495 4.92069e-283 1.40591e-281
## 5         GCG__chr2  0.748607 5.41155e-282 1.20257e-280
## ...             ...       ...          ...          ...
## 196    RAB12__chr18  0.276252  8.32805e-27  8.41217e-27
## 197    BCYRN1__chrX  0.248600  2.06287e-21  2.06287e-21
## 198    RPSAP9__chr9  0.284065  1.89998e-28  1.93876e-28
## 199  ZFP36L1__chr14  0.516747 2.02541e-106 3.58479e-106
## 200   ZNF793__chr19  0.334242  2.69517e-40  2.91369e-40

Significant correlations are defined at a false discovery rate (FDR) threshold of, e.g., 5%. Note that the p-values are calculated by permutation and will have a lower bound. If there were insufficient permutation iterations, a warning will be issued suggesting that more iterations be performed.

Session information

sessionInfo()
## R version 4.6.0 alpha (2026-04-05 r89794)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] scrapper_1.5.16             scRNAseq_2.25.0            
##  [3] BiocParallel_1.45.0         scran_1.39.2               
##  [5] scuttle_1.21.3              SingleCellExperiment_1.33.2
##  [7] SummarizedExperiment_1.41.1 Biobase_2.71.0             
##  [9] GenomicRanges_1.63.2        Seqinfo_1.1.0              
## [11] IRanges_2.45.0              S4Vectors_0.49.1           
## [13] BiocGenerics_0.57.0         generics_0.1.4             
## [15] MatrixGenerics_1.23.0       matrixStats_1.5.0          
## [17] knitr_1.51                  BiocStyle_2.39.0           
## 
## loaded via a namespace (and not attached):
##   [1] DBI_1.3.0                bitops_1.0-9             httr2_1.2.2             
##   [4] rlang_1.2.0              magrittr_2.0.5           otel_0.2.0              
##   [7] gypsum_1.7.0             compiler_4.6.0           RSQLite_2.4.6           
##  [10] GenomicFeatures_1.63.2   png_0.1-9                vctrs_0.7.2             
##  [13] ProtGenerics_1.43.0      pkgconfig_2.0.3          crayon_1.5.3            
##  [16] fastmap_1.2.0            dbplyr_2.5.2             XVector_0.51.0          
##  [19] Rsamtools_2.27.1         rmarkdown_2.31           UCSC.utils_1.7.1        
##  [22] bit_4.6.0                xfun_0.57                bluster_1.21.1          
##  [25] cachem_1.1.0             beachmat_2.27.3          cigarillo_1.1.0         
##  [28] GenomeInfoDb_1.47.2      jsonlite_2.0.0           blob_1.3.0              
##  [31] rhdf5filters_1.23.3      DelayedArray_0.37.1      Rhdf5lib_1.33.6         
##  [34] irlba_2.3.7              parallel_4.6.0           cluster_2.1.8.2         
##  [37] R6_2.6.1                 bslib_0.10.0             limma_3.67.0            
##  [40] rtracklayer_1.71.3       jquerylib_0.1.4          Rcpp_1.1.1              
##  [43] bookdown_0.46            Matrix_1.7-5             igraph_2.2.3            
##  [46] tidyselect_1.2.1         abind_1.4-8              yaml_2.3.12             
##  [49] codetools_0.2-20         curl_7.0.0               alabaster.sce_1.11.0    
##  [52] lattice_0.22-9           tibble_3.3.1             KEGGREST_1.51.1         
##  [55] evaluate_1.0.5           BiocFileCache_3.1.0      alabaster.schemas_1.11.0
##  [58] ExperimentHub_3.1.0      Biostrings_2.79.5        pillar_1.11.1           
##  [61] BiocManager_1.30.27      filelock_1.0.3           RCurl_1.98-1.18         
##  [64] ensembldb_2.35.0         BiocVersion_3.23.1       alabaster.base_1.11.4   
##  [67] alabaster.ranges_1.11.0  glue_1.8.0               metapod_1.19.2          
##  [70] lazyeval_0.2.3           alabaster.matrix_1.11.0  tools_4.6.0             
##  [73] AnnotationHub_4.1.0      BiocIO_1.21.0            BiocNeighbors_2.5.4     
##  [76] ScaledMatrix_1.19.0      GenomicAlignments_1.47.0 locfit_1.5-9.12         
##  [79] XML_3.99-0.23            rhdf5_2.55.16            grid_4.6.0              
##  [82] AnnotationDbi_1.73.0     edgeR_4.9.4              HDF5Array_1.39.1        
##  [85] BiocSingular_1.27.1      restfulr_0.0.16          cli_3.6.5               
##  [88] rsvd_1.0.5               rappdirs_0.3.4           S4Arrays_1.11.1         
##  [91] dplyr_1.2.1              AnnotationFilter_1.35.0  alabaster.se_1.11.0     
##  [94] sass_0.4.10              digest_0.6.39            SparseArray_1.11.13     
##  [97] dqrng_0.4.1              rjson_0.2.23             memoise_2.0.1           
## [100] htmltools_0.5.9          lifecycle_1.0.5          h5mread_1.3.3           
## [103] httr_1.4.8               statmod_1.5.1            bit64_4.6.0-1