SQLDataFrame 1.22.0
Firstly, I would like to extend my special thanks to Aaron Lun for his
foundational work on the ParquetDataFrame package, and for his
highly technical and detailed suggestions aimed at enhancing the
current implementation of the SQLDataFrame
package. Here
I’m introducing the new version of SQLDataFrame for handling
different SQL-backed files. Essentially, the implementation of
SQLDataFrame is modeled upon ParquetDataFrame regarding its data
structure, construction, documentation, and examples. This approach
ensures the retension of best practices and maintains consistentcy in
the use within Bioconductor ecosystem, thus simplifying the learning
curve for users.
The SQLDataFrame, as its name suggests, is a DataFrame where the
columns are derived from data in a SQL table. This is fully
file-backed so no data is actually loaded into memory until requested,
allowing users to represent large datasets in limited memory. As the
SQLDataFrame inherits from S4Vectors’ DataFrame, it
can be used anywhere in Bioconductor’s ecosystem that accepts a
DataFrame, e.g., as the column data of a SummarizedExperiment, or
inside a BumpyDataFrameMatrix from the BumpyMatrix
package.
SQLDataFrame currently supports the following backends with their
respective extension classes (and constructor functions):
SQLite: SQLiteDataFrameDuckDB: DuckDBDataFrameIt can be easily extended to any other SQL-backed file types by simply
defining the extension classs in SQL_extensions.R with minor updates
in acquireConn.R to create a database instance. Pull requests for
adding new SQL backends are welcome!
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("SQLDataFrame")library(SQLDataFrame)Given a path, database type and table name to a SQL file, we can
construct a SQLDataFrame. If the backend is supported, we can
construct an extension class directly.
## Mocking up a file.
tf <- tempfile()
con <- DBI::dbConnect(RSQLite::SQLite(), tf)
DBI::dbWriteTable(con, "mtcars", mtcars)
DBI::dbDisconnect(con)
## Creating the SQLiteDataFrame.
library(SQLDataFrame)
df <- SQLDataFrame(tf, dbtype = "sqlite", table = "mtcars")
df0 <- SQLiteDataFrame(tf, table = "mtcars")
identical(df, df0)## [1] TRUESimilarly, we can create a DuckDbDataFrame:
tf1 <- tempfile()
on.exit(unlist(tf1))
con <- DBI::dbConnect(duckdb::duckdb(), tf1)
DBI::dbWriteTable(con, "mtcars", mtcars)
DBI::dbDisconnect(con)
df1 <- SQLDataFrame(tf1, dbtype = "duckdb", table = "mtcars")
df2 <- DuckDBDataFrame(tf1, table = "mtcars")
identical(df1, df2)## [1] TRUEThese support all the usual methods for a DataFrame, except that the
data is kept on file and referenced as needed:
nrow(df)## [1] 32colnames(df)##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"class(as.data.frame(df))## [1] "data.frame"We extract individual columns as SQLColumnVector objects. These
are 1-dimensional file-backed DelayedArrays that pull a single
column’s data from the SQL table on demand.
df$mpg## <32> SQLColumnVector object of type "double":
##  [1]  [2]  [3]    . [31] [32] 
## 21.0 21.0 22.8    . 15.0 21.4# These can participate in usual vector operations:
df$mpg * 10## <32> DelayedArray object of type "double":
##  [1]  [2]  [3]    . [31] [32] 
##  210  210  228    .  150  214log1p(df$mpg)## <32> DelayedArray object of type "double":
##      [1]      [2]      [3]        .     [31]     [32] 
## 3.091042 3.091042 3.169686        . 2.772589 3.109061# Realize this into an ordinary vector.
as.vector(df$mpg)##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4DFrameThe main goal of a SQLDataFrame is to serve as a consistent
representation of the data inside a SQL table. However, this cannot be
easily reconciled with many DataFrame operations that add or change
data - at least, not without mutating the SQL file, which is outside
the scope of the SQLDataFrame class. To handle such operations,
the SQLDataFrame will collapse to a DFrame of
SQLColumnVectors:
copy <- df
copy$some_random_thing <- runif(nrow(df))
class(copy)## [1] "DFrame"
## attr(,"package")
## [1] "S4Vectors"colnames(copy)##  [1] "mpg"               "cyl"               "disp"             
##  [4] "hp"                "drat"              "wt"               
##  [7] "qsec"              "vs"                "am"               
## [10] "gear"              "carb"              "some_random_thing"This preserves the memory efficiency of file-backed data while
supporting all DataFrame operations. For example, we can easily
subset and mutate the various columns, which manifest as delayed
operations inside each column.
copy$wt <- copy$wt * 1000
top.hits <- head(copy)
top.hits## DataFrame with 6 rows and 12 columns
##              mpg            cyl           disp             hp           drat
##   <DelayedArray> <DelayedArray> <DelayedArray> <DelayedArray> <DelayedArray>
## 1             21              6            160            110            3.9
## 2             21              6            160            110            3.9
## 3           22.8              4            108             93           3.85
## 4           21.4              6            258            110           3.08
## 5           18.7              8            360            175           3.15
## 6           18.1              6            225            105           2.76
##               wt           qsec             vs             am           gear
##   <DelayedArray> <DelayedArray> <DelayedArray> <DelayedArray> <DelayedArray>
## 1           2620          16.46              0              1              4
## 2           2875          17.02              0              1              4
## 3           2320          18.61              1              1              4
## 4           3215          19.44              1              0              3
## 5           3440          17.02              0              0              3
## 6           3460          20.22              1              0              3
##             carb some_random_thing
##   <DelayedArray>         <numeric>
## 1              4         0.0302001
## 2              4         0.5836876
## 3              1         0.2798251
## 4              1         0.5166436
## 5              2         0.3784230
## 6              1         0.2492389The fallback to DFrame ensures that a SQLDataFrame is
interoperable with other Bioconductor data structures that need to
perform arbitrary DataFrame operations. Of course, when a collapse
occurs, we lose all guarantees that the in-memory representation is
compatible with the underlying SQL table. This may preclude further
optimizations in cases where we consider directly operating on the
file.
At any point, users can retrieve a handle of connection to the
underlying SQL file via the acquireConn() function. This can be used
with methods in the DBI, RSQLite, and duckdb
packages. The cached DBIConnection for any given path can be
deleted by calling releaseConn.
handle <- acquireConn(path(df), dbtype = dbtype(df))
handle## <SQLiteConnection>
##   Path: /tmp/RtmpY60A22/file29150c35ecc21
##   Extensions: TRUEreleaseConn(path(df))Note that the acquired handle will not capture any delayed
subsetting/mutation operations that have been applied in the R
session. In theory, it is possible to convert a subset of r Biocpkg("DelayedArray") operations into their DBI
equivalents, which would improve performance by avoiding the R
interpreter when executing a query on the file.
In practice, any performance boost tends to be rather fragile as only a subset of operations are supported, meaning that it is easy to silently fall back to R-based evaluation when an unsupported operation is executed. Users wanting to optimize query performance should just operate on the handle directly.
sessionInfo()## R version 4.5.0 RC (2025-04-04 r88126)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] SQLDataFrame_1.22.0   DelayedArray_0.34.0   SparseArray_1.8.0    
##  [4] S4Arrays_1.8.0        abind_1.4-8           IRanges_2.42.0       
##  [7] S4Vectors_0.46.0      MatrixGenerics_1.20.0 matrixStats_1.5.0    
## [10] BiocGenerics_0.54.0   generics_0.1.3        Matrix_1.7-3         
## [13] BiocStyle_2.36.0     
## 
## loaded via a namespace (and not attached):
##  [1] bit_4.6.0           jsonlite_2.0.0      compiler_4.5.0     
##  [4] BiocManager_1.30.25 crayon_1.5.3        blob_1.2.4         
##  [7] jquerylib_0.1.4     yaml_2.3.10         fastmap_1.2.0      
## [10] lattice_0.22-7      R6_2.6.1            XVector_0.48.0     
## [13] knitr_1.50          bookdown_0.43       DBI_1.2.3          
## [16] pillar_1.10.2       bslib_0.9.0         rlang_1.1.6        
## [19] cachem_1.1.0        xfun_0.52           sass_0.4.10        
## [22] bit64_4.6.0-1       memoise_2.0.1       RSQLite_2.3.9      
## [25] cli_3.6.4           digest_0.6.37       grid_4.5.0         
## [28] lifecycle_1.0.4     vctrs_0.6.5         glue_1.8.0         
## [31] evaluate_1.0.3      duckdb_1.2.1        rmarkdown_2.29     
## [34] pkgconfig_2.0.3     tools_4.5.0         htmltools_0.5.8.1