The SCArray package provides large-scale single-cell RNA-seq data
manipulation using Genomic Data Structure (GDS) files. It
combines dense/sparse matrices stored in GDS files and the Bioconductor
infrastructure framework (SingleCellExperiment
and DelayedArray)
to provide out-of-memory data storage and manipulation using the R
programming language. As shown in the figure, SCArray provides a
SingleCellExperiment
object for downstream data analyses.
GDS is an alternative to HDF5. Unlike HDF5, GDS supports the direct
storage of a sparse matrix without converting it to multiple
vectors.
Figure 1: Workflow of SCArray
Requires R (>= v3.5.0), gdsfmt (>= v1.24.0)
Bioconductor repository
To install this package, start R and enter:
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
::install("SCArray") BiocManager
The SCArray package can convert a single-cell experiment object
(SingleCellExperiment) to a GDS file using the function
scConvGDS()
. For example,
suppressPackageStartupMessages(library(SCArray))
suppressPackageStartupMessages(library(SingleCellExperiment))
# load a SingleCellExperiment object
<- system.file("extdata", "LaMannoBrainSub.rds", package="SCArray")
fn <- readRDS(fn)
sce
# convert to a GDS file
scConvGDS(sce, "test.gds")
## Output: test.gds
## Compression: LZMA_RA
## Dimension: 18219 x 25
## Assay List:
## counts |+ counts { SparseReal32 18219x25 LZMA_ra(13.1%), 75.6K }
## rowData:
## colData:
## CELL_ID
## Cell_type
## Done.
# list data structure in the GDS file
<- scOpen("test.gds")); scClose(f) (f
## Object of class "SCArrayFileClass"
## File: /private/tmp/RtmpP1bDRL/Rbuild105a47a6b93de/SCArray/vignettes/test.gds (137.9K)
## + [ ] *
## |--+ feature.id { Str8 18219 LZMA_ra(48.6%), 60.5K }
## |--+ sample.id { Str8 25 LZMA_ra(40.0%), 157B }
## |--+ counts { SparseReal32 18219x25 LZMA_ra(13.1%), 75.6K }
## |--+ feature.data [ ]
## |--+ sample.data [ ]
## | |--+ CELL_ID { Str8 25 LZMA_ra(40.0%), 157B }
## | \--+ Cell_type { Str8 25 LZMA_ra(64.3%), 133B }
## \--+ meta.data [ ]
The input of scConvGDS()
can be a dense or sparse matrix
for count data:
library(Matrix)
<- matrix(0, nrow=4, ncol=8)
cnt set.seed(100); cnt[sample.int(length(cnt), 8)] <- rpois(8, 4)
<- as(cnt, "dgCMatrix")) (cnt
## 4 x 8 sparse Matrix of class "dgCMatrix"
##
## [1,] . . . . . . . 6
## [2,] 3 1 . . . 4 . .
## [3,] . . . . . 3 . 4
## [4,] 4 . 3 . . . . .
# convert to a GDS file
scConvGDS(cnt, "test.gds")
## Output: test.gds
## Compression: LZMA_RA
## Dimension: 4 x 8
## Assay List:
## counts |+ counts { SparseReal32 4x8 LZMA_ra(159.4%), 109B }
## Done.
When a single-cell GDS file is available, users can use
scExperiment()
to load a SingleCellExperiment object from
the GDS file. The assay data in the SingleCellExperiment object are
DelayedMatrix objects to avoid the memory limit.
# a GDS file in the SCArray package
<- system.file("extdata", "LaMannoBrainData.gds", package="SCArray")) (fn
## [1] "/private/tmp/RtmpP1bDRL/Rinst105a42cb9129c/SCArray/extdata/LaMannoBrainData.gds"
# load a SingleCellExperiment object from the file
<- scExperiment(fn)
sce sce
## class: SingleCellExperiment
## dim: 12000 243
## metadata(0):
## assays(1): counts
## rownames(12000): Rp1 Sox17 ... Efhd2 Fhad1
## rowData names(0):
## colnames(243): 1772072122_A04 1772072122_A05 ... 1772099011_H05
## 1772099012_E04
## colData names(2): CELL_ID Cell_type
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
# it is a DelayedMatrix (the whole matrix is not loaded)
assays(sce)$counts
## <12000 x 243> sparse matrix of class SC_GDSMatrix and type "double":
## 1772072122_A04 1772072122_A05 ... 1772099011_H05 1772099012_E04
## Rp1 0 0 . 0 0
## Sox17 0 0 . 0 0
## Mrpl15 1 2 . 2 2
## Lypla1 0 0 . 0 1
## Tcea1 1 0 . 6 1
## ... . . . . .
## Agmat 0 0 . 0 0
## Dnajc16 0 0 . 0 0
## Casp9 0 0 . 0 0
## Efhd2 0 0 . 1 1
## Fhad1 0 0 . 1 0
# column data
colData(sce)
## DataFrame with 243 rows and 2 columns
## CELL_ID Cell_type
## <character> <character>
## 1772072122_A04 1772072122_A04 DA-VTA4
## 1772072122_A05 1772072122_A05 DA-VTA2
## 1772072122_A06 1772072122_A06 DA-VTA3
## 1772072122_A07 1772072122_A07 DA-SNC
## 1772072122_A08 1772072122_A08 DA-VTA4
## ... ... ...
## 1772099011_D01 1772099011_D01 DA-VTA1
## 1772099011_F04 1772099011_F04 DA-VTA2
## 1772099011_G07 1772099011_G07 DA-VTA4
## 1772099011_H05 1772099011_H05 DA-SNC
## 1772099012_E04 1772099012_E04 DA-VTA4
# row data
rowData(sce)
## DataFrame with 12000 rows and 0 columns
SCArray provides a SingleCellExperiment
object for
downstream data analyses. At first, we create a log count matrix
logcnt
from the count matrix. Note that logcnt
is also a DelayedMatrix without actually generating the whole
matrix.
<- assays(sce)$counts
cnt <- log2(cnt + 1)
logcnt assays(sce)$logcounts <- logcnt
logcnt
## <12000 x 243> sparse matrix of class DelayedMatrix and type "double":
## 1772072122_A04 1772072122_A05 ... 1772099011_H05 1772099012_E04
## Rp1 0.000000 0.000000 . 0.000000 0.000000
## Sox17 0.000000 0.000000 . 0.000000 0.000000
## Mrpl15 1.000000 1.584963 . 1.584963 1.584963
## Lypla1 0.000000 0.000000 . 0.000000 1.000000
## Tcea1 1.000000 0.000000 . 2.807355 1.000000
## ... . . . . .
## Agmat 0 0 . 0 0
## Dnajc16 0 0 . 0 0
## Casp9 0 0 . 0 0
## Efhd2 0 0 . 1 1
## Fhad1 0 0 . 1 0
The DelayedMatrixStats package provides functions operating on rows and columns of DelayedMatrix objects. For example, we can calculate the mean for each column or row of the log count matrix.
suppressPackageStartupMessages(library(DelayedMatrixStats))
<- DelayedMatrixStats::colMeans2(logcnt)
col_mean str(col_mean)
## num [1:243] 0.261 0.138 0.238 0.259 0.143 ...
<- DelayedMatrixStats::rowMeans2(logcnt)
row_mean str(row_mean)
## num [1:12000] 0 0.00652 0.81827 0.47055 1.33912 ...
The scater package can perform the uniform manifold approximation and projection (UMAP) for the cell data, based on the data in a SingleCellExperiment object.
suppressPackageStartupMessages(library(scater))
# run umap analysis
<- runUMAP(sce) sce
plotReducedDim()
plots cell-level reduced dimension
results (UMAP) stored in the SingleCellExperiment object:
plotReducedDim(sce, dimred="UMAP")
# print version information about R, the OS and attached or loaded packages
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.0
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] scater_1.26.0 ggplot2_3.3.6
## [3] scuttle_1.8.0 DelayedMatrixStats_1.19.0
## [5] SingleCellExperiment_1.20.0 SummarizedExperiment_1.28.0
## [7] Biobase_2.58.0 GenomicRanges_1.50.1
## [9] GenomeInfoDb_1.34.2 SCArray_1.6.0
## [11] DelayedArray_0.24.0 IRanges_2.32.0
## [13] S4Vectors_0.36.0 MatrixGenerics_1.10.0
## [15] matrixStats_0.62.0 BiocGenerics_0.44.0
## [17] Matrix_1.4-1 gdsfmt_1.34.0
##
## loaded via a namespace (and not attached):
## [1] viridis_0.6.2 sass_0.4.1 BiocSingular_1.14.0
## [4] viridisLite_0.4.0 jsonlite_1.8.0 bslib_0.3.1
## [7] assertthat_0.2.1 highr_0.9 GenomeInfoDbData_1.2.8
## [10] vipor_0.4.5 yaml_2.3.5 ggrepel_0.9.1
## [13] pillar_1.7.0 lattice_0.20-45 glue_1.6.2
## [16] beachmat_2.14.0 digest_0.6.29 XVector_0.38.0
## [19] colorspace_2.0-3 cowplot_1.1.1 htmltools_0.5.2
## [22] pkgconfig_2.0.3 zlibbioc_1.44.0 purrr_0.3.4
## [25] scales_1.2.0 RSpectra_0.16-1 ScaledMatrix_1.6.0
## [28] BiocParallel_1.32.1 tibble_3.1.7 farver_2.1.1
## [31] generics_0.1.3 ellipsis_0.3.2 withr_2.5.0
## [34] cli_3.3.0 magrittr_2.0.3 crayon_1.5.1
## [37] evaluate_0.15 fansi_1.0.3 FNN_1.1.3.1
## [40] beeswarm_0.4.0 tools_4.2.1 lifecycle_1.0.1
## [43] stringr_1.4.0 munsell_0.5.0 irlba_2.3.5
## [46] compiler_4.2.1 jquerylib_0.1.4 rsvd_1.0.5
## [49] rlang_1.0.4 grid_4.2.1 RCurl_1.98-1.7
## [52] BiocNeighbors_1.16.0 labeling_0.4.2 bitops_1.0-7
## [55] rmarkdown_2.14 gtable_0.3.0 codetools_0.2-18
## [58] DBI_1.1.3 R6_2.5.1 gridExtra_2.3
## [61] knitr_1.39 dplyr_1.0.9 uwot_0.1.11
## [64] fastmap_1.1.0 utf8_1.2.2 stringi_1.7.8
## [67] ggbeeswarm_0.6.0 parallel_4.2.1 Rcpp_1.0.9
## [70] vctrs_0.4.1 tidyselect_1.1.2 xfun_0.31
## [73] sparseMatrixStats_1.10.0