SQLDataFrame 1.4.2
last edit: 10/7/2019
SQL database are very commonly used in the storage of very large
genomic data resources. Many useful tools, such as DBI, dbplyr
have provided convenient interfaces for R users to check and
manipulate the data. These tools represent the SQL tables in tidy
formats and support lazy and quick aggregation operations (e.g,
*_join, union, etc.) for tables from same resources. Cross
database aggregation is also supported when opted (using copy=TRUE)
but become very expensive due to the internal copying process of a
whole table into the other connection. Use of advanced functions often
involves specialized SQL knowledge which brings challenges for common
R users. The interoperability of existing bioinformatics tools are
suboptimal, e.g., the SummarizedExperiment container for
representation of sequencing or genotyping experiments that many
modern bioinformatics pipelines are based.
The SQLDataFrame package was developed using familiar DataFrame-like
paradigm and lazily represents the very large dataset from different
SQL databases, such as SQLite and MySQL. The DataFrame-like interface
provides familiarity for common R users in easy data manipulations
such as square bracket subsetting, rbinding, etc. For modern R
users, it also recognizes the tidy data analysis and dplyr
grammar by supporting %>%, select, filter, mutate, etc. More
importantly, database type-specific strategies were implemented in
SQLDataFrame to efficiently handle the cross-database operations
without incurring any internally expensive processes (especially for
database with write permission). Some previously difficult data
operations are made quick and easy in R, such as cross-database ID
matching and conversion, variant annotation extraction, etc. The
scalability and interoperability of SQLDataFrame are expected to
significantly promote the handling of very large genomic data
resources and facilitating the overall bioinformatics analysis.
Currently SQLDataFrame supports the DBI backend of SQLite, MySQL and
Google BigQuery, which are most commonly used SQL-based databases. In
the future or upon feature request, we would implement this package so
that users could choose to use different database backend for
SQLDataFrame representation.
Here is a list of commonly used backends (bolded are already supported!):
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("SQLDataFrame")The development version is also available to download from Github.
BiocManager::install("Liubuntu/SQLDataFrame")library(SQLDataFrame)
library(DBI)SQLDataFrame classSQLDataFrame constructorThere are two ways to construct a SQLDataFrame object:
Provide an argument of conn, dbtable and dbkey. conn is a
valid DBIConnection from SQLite or MySQL; dbtable specifies the
database table name that is going to be represented as SQLDataFrame
object. If only one table is available in the specified database name,
this argument could be left blank. The dbkey argument is used to
specify the column name in the table which could uniquely identify all
the data observations (rows).
Provide dbtable, dbkey as specified above, and credentials to
build valid DBIConnections. for SQLite, the credential argument
includes dbname. For MySQL, the credential arguments are host,
user, password. Additional to the credentials, users must provide
the type argument to specify the SQL database type. Supported types
are “SQLite” and “MySQL”. If not specified, “SQLite” is used by
default. Supported database tables could be on-disk or remote on the
web or cloud.
dbfile <- system.file("extdata/test.db", package = "SQLDataFrame")
conn <- DBI::dbConnect(DBI::dbDriver("SQLite"), dbname = dbfile)
obj <- SQLDataFrame(conn = conn, dbtable = "state",
                    dbkey = "state")construction from database credentials:
obj1 <- SQLDataFrame(dbname = dbfile, type = "SQLite",
                     dbtable = "state", dbkey = "state")
all.equal(obj, obj1)
#> [1] TRUENote that after reading the database table into SQLDataFrame, the
key columns will be kept as fixed columns showing on the left hand
side, with | separating key column(s) with the other columns. The
ncol, colnames, and corresponding column subsetting will only
apply to the non-key-columns.
obj
#> SQLDataFrame with 50 rows and 4 columns
#>          state |           division        region population        size
#>    <character> |        <character>   <character>  <numeric> <character>
#>        Alabama | East South Central         South       3615      medium
#>         Alaska |            Pacific          West        365       small
#>        Arizona |           Mountain          West       2280      medium
#>       Arkansas | West South Central         South       2110      medium
#>     California |            Pacific          West      21198       large
#>            ... .                ...           ...        ...         ...
#>       Virginia |     South Atlantic         South       4981      medium
#>     Washington |            Pacific          West       3559      medium
#>  West Virginia |     South Atlantic         South       1799      medium
#>      Wisconsin | East North Central North Central       4589      medium
#>        Wyoming |           Mountain          West        376       small
dim(obj)
#> [1] 50  4
colnames(obj)
#> [1] "division"   "region"     "population" "size"To make the SQLDataFrame object as light and compact as possible,
there are only 5 slots contained in the object: tblData, dbkey,
dbnrows, dbconcatKey, indexes. Metadata information could be
returned through these 5 slots using slot accessors or other utility
functions.
slotNames(obj)
#> [1] "dbkey"       "dbnrows"     "tblData"     "indexes"     "dbconcatKey"
dbtable(obj)
#> [1] "state"
dbkey(obj)
#> [1] "state"
connSQLDataFrame(obj)
#> <SQLiteConnection>
#>   Path: /tmp/RtmpAefSws/Rinst21403f1cd904/SQLDataFrame/extdata/test.db
#>   Extensions: TRUEBesides, many useful common methods are defined on SQLDataFrame
object to make it a more DataFrame-like data structure. e.g., we can
use dimnames() to return the row/colnames of the data. It returns an
unnamed list, with the first element being rownames which is always
NULL, and 2nd element being colnames (could also use colnames()
method). dim() method is defined to return the dimension of the
database table, which enables the nrow()/ncol() to extract a
specific dimension. length() method is also defined which works same
as ncol().
Note that the rownames(SQLDataFrame) would always be NULL as
rownames are not supported in SQLDataFrame. However, ROWNAMES(obj)
was implemented for the [ subsetting with characters.
dim(obj)
#> [1] 50  4
dimnames(obj)
#> [[1]]
#> NULL
#> 
#> [[2]]
#> [1] "division"   "region"     "population" "size"
length(obj)
#> [1] 4
ROWNAMES(obj)
#>  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
#>  [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
#>  [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
#> [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
#> [17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
#> [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
#> [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
#> [29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
#> [33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
#> [37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
#> [41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
#> [45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
#> [49] "Wisconsin"      "Wyoming"NOTE that the dbtable() accessor only works for a SQLDataFrame
object that the lazy tbl carried in tblData slot corresponds to a
single database. If the SQLDataFrame was generated from rbind,
union or *_join, call saveSQLDataFrame() to save the lazy tbl to
disk so that dbtable() will be activated.
dbtable(obj)
#> [1] "state"
aa <- rbind(obj[1:5, ], obj[6:10, ])
aa
#> SQLDataFrame with 10 rows and 4 columns
#>        state |           division      region population        size
#>  <character> |        <character> <character>  <numeric> <character>
#>      Alabama | East South Central       South       3615      medium
#>       Alaska |            Pacific        West        365       small
#>      Arizona |           Mountain        West       2280      medium
#>     Arkansas | West South Central       South       2110      medium
#>   California |            Pacific        West      21198       large
#>     Colorado |           Mountain        West       2541      medium
#>  Connecticut |        New England   Northeast       3100      medium
#>     Delaware |     South Atlantic       South        579       small
#>      Florida |     South Atlantic       South       8277       large
#>      Georgia |     South Atlantic       South       4931      medium
dbtable(aa)  ## message
#> Warning in dbtable(aa): ## not available for SQLDataFrame with lazy queries of 'union', 'join', or 'rbind'. 
#> ## call 'saveSQLDataFrame()' to save the data as database table and call 'dbtable()' again!
bb <- saveSQLDataFrame(aa, dbname = tempfile(fileext=".db"),
                       dbtable = "aa")
#> ## A new database table is saved! 
#> ## Source: table<aa> [10 X 4] 
#> ## Database: sqlite 3.33.0 [/tmp/Rtmpk1wczg/file27011c4f8c33.db] 
#> ## Use the following command to reload into R: 
#> ## sdf <- SQLDataFrame(
#> ##   dbname = '/tmp/Rtmpk1wczg/file27011c4f8c33.db',
#> ##   type = 'SQLite',
#> ##   dbtable = 'aa',
#> ##   dbkey = 'state')
connSQLDataFrame(bb)
#> <SQLiteConnection>
#>   Path: /tmp/Rtmpk1wczg/file27011c4f8c33.db
#>   Extensions: TRUE
dbtable(bb)
#> [1] "aa"We could also construct a SQLDataFrame object directly from a file
name. The makeSQLDataFrame function takes input of character value
of file name for common text files (.csv, .txt, etc.), write into
database tables, and open as SQLDataFrame object. Users could
provide values for the dbname and dbtable argument. If NULL,
default value for dbname would be a temporary database file, and
dbtable would be the basename(filename) without extension.
NOTE that the input file must have one or multiple columns that
could uniquely identify each observation (row) to be used the
dbkey() for SQLDataFrame. Also the file must be rectangular, i.e.,
rownames are not accepted. But users could save rownames as a separate
column.
mtc <- tibble::rownames_to_column(mtcars)[,1:6]
filename <- file.path(tempdir(), "mtc.csv")
write.csv(mtc, file= filename, row.names = FALSE)
aa <- makeSQLDataFrame(filename, dbkey = "rowname", sep = ",",
                       overwrite = TRUE)
#> ## A new database table is saved! 
#> ## Source: table<mtc> [32 X 5] 
#> ## Database: sqlite 3.33.0 [/tmp/Rtmpk1wczg/file27012638d467.db] 
#> ## Use the following command to reload into R: 
#> ## sdf <- SQLDataFrame(
#> ##   dbname = '/tmp/Rtmpk1wczg/file27012638d467.db',
#> ##   type = 'SQLite',
#> ##   dbtable = 'mtc',
#> ##   dbkey = 'rowname')
#> 
aa
#> SQLDataFrame with 32 rows and 5 columns
#>              rowname |       mpg       cyl      disp        hp      drat
#>          <character> | <numeric> <integer> <numeric> <integer> <numeric>
#>          "Mazda RX4" |      21.0         6       160       110      3.90
#>      "Mazda RX4 Wag" |      21.0         6       160       110      3.90
#>         "Datsun 710" |      22.8         4       108        93      3.85
#>     "Hornet 4 Drive" |      21.4         6       258       110      3.08
#>  "Hornet Sportabout" |      18.7         8       360       175      3.15
#>                  ... .       ...       ...       ...       ...       ...
#>       "Lotus Europa" |      30.4         4      95.1       113      3.77
#>     "Ford Pantera L" |      15.8         8     351.0       264      4.22
#>       "Ferrari Dino" |      19.7         6     145.0       175      3.62
#>      "Maserati Bora" |      15.0         8     301.0       335      3.54
#>         "Volvo 142E" |      21.4         4     121.0       109      4.11
connSQLDataFrame(aa)
#> <SQLiteConnection>
#>   Path: /tmp/Rtmpk1wczg/file27012638d467.db
#>   Extensions: TRUE
dbtable(aa)
#> [1] "mtc"With all the methods ([ subsetting, rbind, *_join, etc.,)
provided in the next section, the SQLDataFrame always work like a
lazy representation until users explicitly call the saveSQLDataFrame
function for realization. saveSQLDataFrame write the lazy tbl
carried in tblData slot into an on-disk database table, and re-open
the SQLDataFrame object from the new path.
It’s also recommended that users call saveSQLDataFrame frequently to
avoid too many lazy layers which slows down the data processing.
connSQLDataFrame(obj)
#> <SQLiteConnection>
#>   Path: /tmp/RtmpAefSws/Rinst21403f1cd904/SQLDataFrame/extdata/test.db
#>   Extensions: TRUE
dbtable(obj)
#> [1] "state"
obj1 <- saveSQLDataFrame(obj, dbname = tempfile(fileext = ".db"),
                        dbtable = "obj_copy")
#> ## A new database table is saved! 
#> ## Source: table<obj_copy> [50 X 4] 
#> ## Database: sqlite 3.33.0 [/tmp/Rtmpk1wczg/file270111ba3025.db] 
#> ## Use the following command to reload into R: 
#> ## sdf <- SQLDataFrame(
#> ##   dbname = '/tmp/Rtmpk1wczg/file270111ba3025.db',
#> ##   type = 'SQLite',
#> ##   dbtable = 'obj_copy',
#> ##   dbkey = 'state')
connSQLDataFrame(obj1)
#> <SQLiteConnection>
#>   Path: /tmp/Rtmpk1wczg/file270111ba3025.db
#>   Extensions: TRUE
dbtable(obj1)
#> [1] "obj_copy"[[ subsetting[[,SQLDataFrame Behaves similarly to [[,DataFrame and returns a
realized vector of values from a single column. $,SQLDataFrame is
also defined to conveniently extract column values.
head(obj[[1]])
#> [1] "East South Central" "Pacific"            "Mountain"          
#> [4] "West South Central" "Pacific"            "Mountain"
head(obj[["region"]])
#> [1] "South" "West"  "West"  "South" "West"  "West"
head(obj$size)
#> [1] "medium" "small"  "medium" "medium" "large"  "medium"We can also get the key column values using character extraction.
head(obj[["state"]])
#> [1] "Alabama"    "Alaska"     "Arizona"    "Arkansas"   "California"
#> [6] "Colorado"[ subsettingSQLDataFrame instances can be subsetted in a similar way of
DataFrame following the usual R conventions, with numeric,
character or logical vectors; logical vectors are recycled to the
appropriate length.
NOTE, use drop=FALSE explicitly for single column subsetting if
you want to return a SQLDataFrame object, otherwise, the default
drop=TRUE would always return a realized value for that column.
obj[1:3, 1:2]
#> SQLDataFrame with 3 rows and 2 columns
#>        state |           division      region
#>  <character> |        <character> <character>
#>      Alabama | East South Central       South
#>       Alaska |            Pacific        West
#>      Arizona |           Mountain        West
obj[c(TRUE, FALSE), c(TRUE, FALSE), drop=FALSE]
#> SQLDataFrame with 25 rows and 2 columns
#>         state |           division population
#>   <character> |        <character>  <numeric>
#>       Alabama | East South Central       3615
#>       Arizona |           Mountain       2280
#>    California |            Pacific      21198
#>   Connecticut |        New England       3100
#>       Florida |     South Atlantic       8277
#>           ... .                ...        ...
#>  South Dakota | West North Central        746
#>         Texas | West South Central      12237
#>       Vermont |        New England        472
#>    Washington |            Pacific       3559
#>     Wisconsin | East North Central       4589
obj[1:3, "population", drop=FALSE]
#> SQLDataFrame with 3 rows and 1 column
#>        state | population
#>  <character> |  <numeric>
#>      Alabama |       3615
#>       Alaska |        365
#>      Arizona |       2280
obj[, "population"]  ## realized column value
#>  [1]  3615   365  2280  2110 21198  2541  3100   579  8277  4931   868   813
#> [13] 11197  5313  2861  2280  9111  3806  1058  4122  5814  9111  3921  2341
#> [25]  4767   746  1544   590   812  7333  1799 18076  5441   637 10735  2715
#> [37]  2284 11860   931  2816   746  4173 12237  1203   472  4981  3559  1799
#> [49]  4589   376Subsetting with character vector works for the SQLDataFrame
objects. With composite keys, users need to concatenate the key values
by : for row subsetting (See the vignette for internal
implementation for more details).
rnms <- ROWNAMES(obj)
obj[c("Alabama", "Colorado"), ]
#> SQLDataFrame with 2 rows and 4 columns
#>        state |           division      region population        size
#>  <character> |        <character> <character>  <numeric> <character>
#>      Alabama | East South Central       South       3615      medium
#>     Colorado |           Mountain        West       2541      mediumobj1 <- SQLDataFrame(conn = conn, dbtable = "state",
                     dbkey = c("region", "population"))
rnms <- ROWNAMES(obj1)
obj1[c("South:3615.0", "West:365.0"), ]
#> SQLDataFrame with 2 rows and 3 columns
#>       region population |           division       state        size
#>  <character>  <numeric> |        <character> <character> <character>
#>        South       3615 | East South Central     Alabama      medium
#>         West        365 |            Pacific      Alaska       smallList style subsetting is also allowed to extract certain columns from
the SQLDataFrame object which returns SQLDataFrame by default.
obj[1]
#> SQLDataFrame with 50 rows and 1 column
#>          state |           division
#>    <character> |        <character>
#>        Alabama | East South Central
#>         Alaska |            Pacific
#>        Arizona |           Mountain
#>       Arkansas | West South Central
#>     California |            Pacific
#>            ... .                ...
#>       Virginia |     South Atlantic
#>     Washington |            Pacific
#>  West Virginia |     South Atlantic
#>      Wisconsin | East North Central
#>        Wyoming |           Mountain
obj["region"]
#> SQLDataFrame with 50 rows and 1 column
#>          state |        region
#>    <character> |   <character>
#>        Alabama |         South
#>         Alaska |          West
#>        Arizona |          West
#>       Arkansas |         South
#>     California |          West
#>            ... .           ...
#>       Virginia |         South
#>     Washington |          West
#>  West Virginia |         South
#>      Wisconsin | North Central
#>        Wyoming |          WestWe have also enabled the S3 methods of filter and mutate from
dplyr package, so that users could have the convenience in filtering
data observations and adding new columns.
obj1 %>% filter(division == "South Atlantic" & size == "medium")
#> SQLDataFrame with 5 rows and 3 columns
#>       region population |       division          state        size
#>  <character>  <numeric> |    <character>    <character> <character>
#>        South       4931 | South Atlantic        Georgia      medium
#>        South       4122 | South Atlantic       Maryland      medium
#>        South       2816 | South Atlantic South Carolina      medium
#>        South       4981 | South Atlantic       Virginia      medium
#>        South       1799 | South Atlantic  West Virginia      mediumobj1 %>% mutate(p1 = population/10, s1 = size)
#> SQLDataFrame with 50 rows and 5 columns
#>         region population |           division         state        size
#>    <character>  <numeric> |        <character>   <character> <character>
#>          South       3615 | East South Central       Alabama      medium
#>           West        365 |            Pacific        Alaska       small
#>           West       2280 |           Mountain       Arizona      medium
#>          South       2110 | West South Central      Arkansas      medium
#>           West      21198 |            Pacific    California       large
#>            ...        ... .                ...           ...         ...
#>          South       4981 |     South Atlantic      Virginia      medium
#>           West       3559 |            Pacific    Washington      medium
#>          South       1799 |     South Atlantic West Virginia      medium
#>  North Central       4589 | East North Central     Wisconsin      medium
#>           West        376 |           Mountain       Wyoming       small
#>         p1          s1
#>  <numeric> <character>
#>      361.5      medium
#>       36.5       small
#>      228.0      medium
#>      211.0      medium
#>     2119.8       large
#>        ...         ...
#>      498.1      medium
#>      355.9      medium
#>      179.9      medium
#>      458.9      medium
#>       37.6       smallTo be consistent with DataFrame, union and rbind methods were
implemented for SQLDataFrame, where union returns the
SQLDataFrame sorted by the dbkey(obj), and rbind keeps the
original orders of input objects.
dbfile1 <- system.file("extdata/test.db", package = "SQLDataFrame")
con1 <- DBI::dbConnect(dbDriver("SQLite"), dbname = dbfile1)
dbfile2 <- system.file("extdata/test1.db", package = "SQLDataFrame")
con2 <- DBI::dbConnect(dbDriver("SQLite"), dbname = dbfile2)
ss1 <- SQLDataFrame(conn = con1, dbtable = "state",
                    dbkey = c("state"))
ss2 <- SQLDataFrame(conn = con2, dbtable = "state1",
                    dbkey = c("state"))
ss11 <- ss1[sample(5), ]
ss21 <- ss2[sample(10, 5), ]obj1 <- union(ss11, ss21) 
obj1  ## reordered by the "dbkey()"obj2 <- rbind(ss11, ss21) 
obj2  ## keeping the original order by updating the row index
#> SQLDataFrame with 10 rows and 4 columns
#>        state |           division      region population        size
#>  <character> |        <character> <character>  <numeric> <character>
#>       Alaska |            Pacific        West        365       small
#>   California |            Pacific        West      21198       large
#>      Alabama | East South Central       South       3615      medium
#>     Arkansas | West South Central       South       2110      medium
#>      Arizona |           Mountain        West       2280      medium
#>     Arkansas | West South Central       South       2110      medium
#>   California |            Pacific        West      21198       large
#>     Colorado |           Mountain        West       2541      medium
#>      Georgia |     South Atlantic       South       4931      medium
#>     Delaware |     South Atlantic       South        579       smallThe *_join family methods was implemented for SQLDataFrame
objects, including the left_join, inner_join, semi_join and
anti_join, which provides the capability of merging database files
from different sources.
ss12 <- ss1[1:10, 1:2]
ss22 <- ss2[6:15, 3:4]
left_join(ss12, ss22)
#> Joining, by = "state"
#> SQLDataFrame with 10 rows and 4 columns
#>        state |           division      region population        size
#>  <character> |        <character> <character>  <numeric> <character>
#>      Alabama | East South Central       South       <NA>        <NA>
#>       Alaska |            Pacific        West       <NA>        <NA>
#>      Arizona |           Mountain        West       <NA>        <NA>
#>     Arkansas | West South Central       South       <NA>        <NA>
#>   California |            Pacific        West       <NA>        <NA>
#>     Colorado |           Mountain        West       2541      medium
#>  Connecticut |        New England   Northeast       3100      medium
#>     Delaware |     South Atlantic       South        579       small
#>      Florida |     South Atlantic       South       8277       large
#>      Georgia |     South Atlantic       South       4931      medium
inner_join(ss12, ss22)
#> Joining, by = "state"
#> SQLDataFrame with 5 rows and 4 columns
#>        state |       division      region population        size
#>  <character> |    <character> <character>  <numeric> <character>
#>     Colorado |       Mountain        West       2541      medium
#>  Connecticut |    New England   Northeast       3100      medium
#>     Delaware | South Atlantic       South        579       small
#>      Florida | South Atlantic       South       8277       large
#>      Georgia | South Atlantic       South       4931      medium
semi_join(ss12, ss22)
#> Joining, by = "state"
#> SQLDataFrame with 5 rows and 2 columns
#>        state |       division      region
#>  <character> |    <character> <character>
#>     Colorado |       Mountain        West
#>  Connecticut |    New England   Northeast
#>     Delaware | South Atlantic       South
#>      Florida | South Atlantic       South
#>      Georgia | South Atlantic       South
anti_join(ss12, ss22)
#> Joining, by = "state"
#> SQLDataFrame with 5 rows and 2 columns
#>        state |           division      region
#>  <character> |        <character> <character>
#>      Alabama | East South Central       South
#>       Alaska |            Pacific        West
#>      Arizona |           Mountain        West
#>     Arkansas | West South Central       South
#>   California |            Pacific        WestSQLDataFrame now supports the MySQL database tables through RMySQL,
for local MySQL servers, or remote ones on the web or cloud. The
SQLDataFrame construction, *_join functions, union, rbind, and
saving are all supported. Aggregation operations are supported for
same or cross MySQL databases. Details please see the function
documentations.
Here I’ll show a simple use case for MySQL tables from ensembl.
library(RMySQL)
ensbConn <- dbConnect(dbDriver("MySQL"),
                        host="genome-mysql.soe.ucsc.edu",
                        user = "genome",
                        dbname = "xenTro9")
enssdf <- SQLDataFrame(conn = ensbConn,
                       dbtable = "xenoRefGene",
                       dbkey = c("name", "txStart"))
#> Warning in .local(conn, statement, ...): Unsigned INTEGER in col 0 imported as
#> numeric
#> Warning in .local(conn, statement, ...): Unsigned INTEGER in col 4 imported as
#> numeric
#> Warning in .local(conn, statement, ...): Unsigned INTEGER in col 5 imported as
#> numeric
#> Warning in .local(conn, statement, ...): Unsigned INTEGER in col 6 imported as
#> numeric
#> Warning in .local(conn, statement, ...): Unsigned INTEGER in col 7 imported as
#> numeric
#> Warning in .local(conn, statement, ...): Unsigned INTEGER in col 8 imported as
#> numeric
enssdf1 <- enssdf[1:20, 1:2]
enssdf2 <- enssdf[11:30,3:4]
res <- left_join(enssdf1, enssdf2)
#> Joining, by = c("name", "txStart")SQLDataFrame has just added support for Google BigQuery
tables. Construction and queries using [ and filter are supported!
“Authentication and authorization” will be needed when using bigrquery. Check here for more details.
Also note that, the support of BigQuery tables has implemented
specialized strategy for efficient data representation. The dbkey()
is assigned by default as SurrogateKey, and dbkey argument will be
ignored during construction.
library(bigrquery)
bigrquery::bq_auth()  ## use this to authorize bigrquery in the
                      ## browser.
bqConn <- DBI::dbConnect(dbDriver("bigquery"),
                      project = "bigquery-public-data",
                      dataset = "human_variant_annotation",
                      billing = "") ## if not previous provided
                                    ## authorization, must specify a
                                    ## project name that was already
                                    ## linked with Google Cloud with
                                    ## billing info.
sdf <- SQLDataFrame(conn = bqConn, dbtable = "ncbi_clinvar_hg38_20180701")
sdf[1:5, 1:5]
sdf %>% select(GENEINFO)
sdf %>% filter(GENEINFO == "PYGL:5836")
sdf %>% filter(reference_name == "21")sessionInfo()
#> R version 4.0.3 (2020-10-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.5 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#> [1] RMySQL_0.10.20      DBI_1.1.0           SQLDataFrame_1.4.2 
#> [4] S4Vectors_0.28.0    BiocGenerics_0.36.0 dbplyr_2.0.0       
#> [7] dplyr_1.0.2         BiocStyle_2.18.1   
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.5          pillar_1.4.7        compiler_4.0.3     
#>  [4] BiocManager_1.30.10 tools_4.0.3         digest_0.6.27      
#>  [7] bit_4.0.4           evaluate_0.14       RSQLite_2.2.1      
#> [10] memoise_1.1.0       lifecycle_0.2.0     tibble_3.0.4       
#> [13] pkgconfig_2.0.3     rlang_0.4.9         cli_2.2.0          
#> [16] yaml_2.2.1          xfun_0.19           withr_2.3.0        
#> [19] stringr_1.4.0       knitr_1.30          generics_0.1.0     
#> [22] vctrs_0.3.5         bit64_4.0.5         tidyselect_1.1.0   
#> [25] glue_1.4.2          R6_2.5.0            fansi_0.4.1        
#> [28] rmarkdown_2.5       bookdown_0.21       purrr_0.3.4        
#> [31] blob_1.2.1          magrittr_2.0.1      ellipsis_0.3.1     
#> [34] htmltools_0.5.0     assertthat_0.2.1    utf8_1.1.4         
#> [37] stringi_1.5.3       lazyeval_0.2.2      crayon_1.3.4