beachmat 1.4.0
beachmat has a few useful utilities outside of the C++ API. This document describes how to use them.
Given the dimensions of a matrix, users can choose HDF5 chunk dimensions that give fast performance for both row- and column-level access.
library(beachmat)
nrows <- 10000
ncols <- 200
getBestChunkDims(c(nrows, ncols))
## [1] 708 15
In the future, it should be possible to feed this back into the API.
Currently, if chunk dimensions are not specified in the C++ code, the API will retrieve them from R via the getHDF5DumpChunkDim()
function from HDF5Array.
The aim is to also provide a setHDF5DumpChunkDim()
function so that any chunk dimension specified in R will be respected.
The most common access patterns for matrices (at least, for high-throughput biological data) is by row or by column.
The rechunkByMargins()
will take a HDF5 file and convert it to using purely row- or column-based chunks.
library(HDF5Array)
A <- as(matrix(runif(5000), nrow=100, ncol=50), "HDF5Array")
byrow <- rechunkByMargins(A, byrow=TRUE)
byrow
## <100 x 50> HDF5Matrix object of type "double":
## [,1] [,2] [,3] ... [,49] [,50]
## [1,] 0.1181245 0.8875337 0.3254774 . 0.70656007 0.39623515
## [2,] 0.3498117 0.2388030 0.5047783 . 0.62947795 0.36708736
## [3,] 0.6723204 0.9962194 0.6723537 . 0.78523727 0.91860941
## [4,] 0.8487139 0.6463400 0.8024627 . 0.55188436 0.53735966
## [5,] 0.7277063 0.2893102 0.8674756 . 0.77353529 0.01753654
## ... . . . . . .
## [96,] 0.68009318 0.41287443 0.57544743 . 0.9289865 0.2763149
## [97,] 0.13603556 0.17802172 0.20724004 . 0.8014221 0.7448020
## [98,] 0.49308121 0.92935112 0.42804175 . 0.4058182 0.1812442
## [99,] 0.53140746 0.25712945 0.32461960 . 0.7158630 0.5151891
## [100,] 0.54541600 0.04084118 0.86567668 . 0.8634237 0.7218797
bycol <- rechunkByMargins(A, byrow=FALSE)
bycol
## <100 x 50> HDF5Matrix object of type "double":
## [,1] [,2] [,3] ... [,49] [,50]
## [1,] 0.1181245 0.8875337 0.3254774 . 0.70656007 0.39623515
## [2,] 0.3498117 0.2388030 0.5047783 . 0.62947795 0.36708736
## [3,] 0.6723204 0.9962194 0.6723537 . 0.78523727 0.91860941
## [4,] 0.8487139 0.6463400 0.8024627 . 0.55188436 0.53735966
## [5,] 0.7277063 0.2893102 0.8674756 . 0.77353529 0.01753654
## ... . . . . . .
## [96,] 0.68009318 0.41287443 0.57544743 . 0.9289865 0.2763149
## [97,] 0.13603556 0.17802172 0.20724004 . 0.8014221 0.7448020
## [98,] 0.49308121 0.92935112 0.42804175 . 0.4058182 0.1812442
## [99,] 0.53140746 0.25712945 0.32461960 . 0.7158630 0.5151891
## [100,] 0.54541600 0.04084118 0.86567668 . 0.8634237 0.7218797
Rechunking can provide a substantial speed-up to downstream functions, especially those requiring access to random columns or rows.
Indeed, the time saved in those functions often offsets the time spent in constructing a new HDF5Matrix
.
sessionInfo()
## R version 3.5.1 Patched (2018-07-24 r75008)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows Server 2012 R2 x64 (build 9600)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=C
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] HDF5Array_1.10.0 rhdf5_2.26.0 DelayedArray_0.8.0
## [4] BiocParallel_1.16.0 IRanges_2.16.0 S4Vectors_0.20.0
## [7] BiocGenerics_0.28.0 matrixStats_0.54.0 beachmat_1.4.0
## [10] knitr_1.20 BiocStyle_2.10.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.19 magrittr_1.5 stringr_1.3.1
## [4] tools_3.5.1 xfun_0.4 htmltools_0.3.6
## [7] yaml_2.2.0 rprojroot_1.3-2 digest_0.6.18
## [10] bookdown_0.7 Rhdf5lib_1.4.0 BiocManager_1.30.3
## [13] evaluate_0.12 rmarkdown_1.10 stringi_1.2.4
## [16] compiler_3.5.1 backports_1.1.2