write_hdf5 {spatialHeatmap} | R Documentation |
This is a convenience function for constructing the database backend in the Shiny app (shiny_shm). The data to store in the database should be in the class of "data.frame" or "SummarizedExperiment" and should be formatted according to the conventions in the "data" argument of spatial_hm. After formatted, all these data should be arranged in a list and each data slot should have a unique name such as "expr_arab", "expr_chicken", etc..
In addition, a pairing data frame describing the matching relationship between the data and aSVG files must also be included in the list with the exclusive slot name "df_pair". This data frame should contain at least three columns: name, data, aSVG. The name column includes concise description of each data-aSVG pair, and entries in this column will be listed under "Step 1: data sets" on the Shiny app. The data column contains slot names of all data in the list ("expr_arab", "expr_chicken", etc.), and the aSVG column includes the aSVG file names corresponding to each data respectively such as "gallus_gallus.svg", etc. If one data is related to multiple aSVG files (e.g. multiple development stages), these aSVGs should be concatenated by comma, space, or semicolon, e.g. "arabidopsis.thaliana_organ_shm1.svg;arabidopsis.thaliana_organ_shm2.svg". Inclusion of other columns providing metadata of the data and aSVGs are optional, which is up to the users.
After calling this function, all the data including "df_pair" in the list are saved into independent DHF5 databases, and all the DHF5 databases are finally compressed in the file "data_shm.tar". Accordingly, all the corresponding aSVG files listed in the "df_pair" should be compressed in another "tar" file such as "aSVG.tar". If the directory path containing the aSVG files are assigned to svg.dir
, all the SVG files in the diretory are compressed in "aSVGs.tar" automatically. The two tar files compose the database in the Shiny app and should be placed in the "example" folder in the app or uploaded on the user interface.
write_hdf5( dat.lis, dir = "./data_shm", replace = FALSE, chunkdim = NULL, level = NULL, verbose = FALSE, svg.dir = NULL )
dat.lis |
A list of data of class "data.frame" or "SummarizedExperiment", where every data should have a unique slot name such as "expr_arab", "expr_chicken", etc.. In addition to the data, a pairing data frame describing pairing between the data and aSVG files must be included under the exclusive slot name "df_pair". This data frame has three required columns: the "name" column includes concise names of the data-aSVG pair, the "data" column contains all slot names of the data ("expr_arab", "expr_chicken", etc.) and the "aSVG" column contains the aSVG file names corresponding to each data. If one data is related to multiple aSVG files (e.g. multiple development stages), these aSVGs should be concatenated by comma, space, or semicolon, e.g. |
dir |
The directory path to save the "data_shm.tar" file. Default is |
replace |
If data with the same slot names in |
chunkdim |
The dimensions of the chunks and the compression level to use for writing the assay data to disk. Passed to the internal calls to |
level |
The dimensions of the chunks and the compression level to use for writing the assay data to disk. Passed to the internal calls to |
verbose |
Set to In the case of |
svg.dir |
The directory path of aSVG files listed in "df_pair". If provded, all SVG files in the directory are compressed in "aSVGs.tar" and saved in |
A file of "data_shm.tar" is save in dir
. If svg.dir
is assigned a valid value, all relevant SVG files are compressed in "aSVGs.tar" in dir
.
Jianhai Zhang jzhan067@ucr.edu; zhang.jianhai@hotmail.com
Dr. Thomas Girke thomas.girke@ucr.edu
SummarizedExperiment: SummarizedExperiment container. R package version 1.10.1
R Core Team (2018). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/
Hervé Pagès (2020). HDF5Array: HDF5 backend for DelayedArray objects. R package version 1.16.1.
Mustroph, Angelika, M Eugenia Zanetti, Charles J H Jang, Hans E Holtan, Peter P Repetti, David W Galbraith, Thomas Girke, and Julia Bailey-Serres. 2009. “Profiling Translatomes of Discrete Cell Populations Resolves Altered Cellular Priorities During Hypoxia in Arabidopsis.” Proc Natl Acad Sci U S A 106 (44): 18843–8
Davis, Sean, and Paul Meltzer. 2007. “GEOquery: A Bridge Between the Gene Expression Omnibus (GEO) and BioConductor.” Bioinformatics 14: 1846–7
Gautier, Laurent, Leslie Cope, Benjamin M. Bolstad, and Rafael A. Irizarry. 2004. “Affy—analysis of Affymetrix GeneChip Data at the Probe Level.” Bioinformatics 20 (3). Oxford, UK: Oxford University Press: 307–15. doi:10.1093/bioinformatics/btg405
Keays, Maria. 2019. ExpressionAtlas: Download Datasets from EMBL-EBI Expression Atlas
Huber, W., V. J. Carey, R. Gentleman, S. An ders, M. Carlson, B. S. Carvalho, H. C. Bravo, et al. 2015. “Orchestrating High-Throughput Genomic Analysis Wit H Bioconductor.” Nature Methods 12 (2): 115–21. http://www.nature.com/nmeth/journal/v12/n2/full/nmeth.3252.html
Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. "Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2." Genome Biology 15 (12): 550. doi:10.1186/s13059-014-0550-8
McCarthy, Davis J., Chen, Yunshun, Smyth, and Gordon K. 2012. "Differential Expression Analysis of Multifactor RNA-Seq Experiments with Respect to Biological Variation." Nucleic Acids Research 40 (10): 4288–97
Cardoso-Moreira, Margarida, Jean Halbert, Delphine Valloton, Britta Velten, Chunyan Chen, Yi Shao, Angélica Liechti, et al. 2019. “Gene Expression Across Mammalian Organ Development.” Nature 571 (7766): 505–9
## The examples below demonstrate 1) how to dump Expression Atlas data set into the Shiny database; ## 2) how to dump GEO data set into the Shiny database; 3) how to include aSVGs of multiple ## development stages; 4) how to read the database; 5) how to create customized Shiny app with ## the database. # 1. Dump data from Expression Atlas into "data_shm.tar" using ExpressionAtlas package (Keays 2019). # The chicken data derived from an RNA-seq analysis on developments of 7 chicken organs under 9 # time points (Cardoso-Moreira et al. 2019) is chosen as example. # The following searches the Expression Atlas for expression data from ‘heart’ and ‘gallus’. library(ExpressionAtlas) cache.pa <- '~/.cache/shm' # The path of cache. all.chk <- read_cache(cache.pa, 'all.chk') # Retrieve data from cache. if (is.null(all.chk)) { # Save downloaded data to cache if it is not cached. all.chk <- searchAtlasExperiments(properties="heart", species="gallus") save_cache(dir=cache.pa, overwrite=TRUE, all.chk) } all.chk[3, ] rse.chk <- read_cache(cache.pa, 'rse.chk') # Read data from cache. if (is.null(rse.chk)) { # Save downloaded data to cache if it is not cached. rse.chk <- getAtlasData('E-MTAB-6769')[[1]][[1]] save_cache(dir=cache.pa, overwrite=TRUE, rse.chk) } # The downloaded data is stored in "SummarizedExperiment" by default (SE, M. Morgan et al. 2018). # The experiment design is described in the "colData" slot. The following returns first three rows. colData(rse.chk)[1:3, ] # In the "colData" slot, it is required to define the "sample" and "condition" columns respectively. # Both "sample" and "condition" are general terms. The former refers to entities where the numeric # data are measured such as cell organelles, tissues, organs, ect. while the latter denotes # experimental treatments such as drug dosages, gender, trains, time series, PH values, ect. In the # downloaded data, the two columns are not explicitly defined, so "organism_part" and "age" are # selected and renamed as "sample" and "condition" respectively. colnames(colData(rse.chk))[c(6, 8)] <- c('condition', 'sample'); colnames(colData(rse.chk)) # The raw RNA-Seq count are preprocessed with the following steps: (1) normalization, # (2) aggregation of replicates, and (3) filtering of reliable expression data. The details of # these steps are explained in the pacakge vignette. browseVignettes('spatialHeatmap') se.nor.chk <- norm_data(data=rse.chk, norm.fun='ESF', log2.trans=TRUE) # Normalization se.aggr.chk <- aggr_rep(data=se.nor.chk, sam.factor='sample', con.factor='condition', aggr='mean') # Replicate agggregation using mean # Genes are filtered out if not meet these criteria: expression values are at least 5 in at least # 1% of all samples, coeffient of variance is between 0.6 and 100. se.fil.chk <- filter_data(data=se.aggr.chk, sam.factor='sample', con.factor='condition', pOA=c(0.01, 5), CV=c(0.6, 100), dir=NULL) # The aSVG file corresponding with the data is pre-packaged and copied to a temporary directory. dir.svg <- paste0(tempdir(check=TRUE), '/svg_shm') # Temporary directory. if (!dir.exists(dir.svg)) dir.create(dir.svg) # Path of the aSVG file. svg.chk <- system.file("extdata/shinyApp/example", 'gallus_gallus.svg', package="spatialHeatmap") file.copy(svg.chk, dir.svg, overwrite=TRUE) # Copy the aSVG file. # 2. Dump data from GEO into "data_shm.tar" using GEOquery package (S. Davis and Meltzer 2007). # The Arabidopsis thaliana (Arabidopsis) data from an microarray assay of hypoxia treatment on # Arabidopsis root and shoot cell types (Mustroph et al. 2009) is selected as example. # The data set is downloaded with the accession number "GSE14502". It is stored in ExpressionSet # container (W. Huber et al. 2015) by default, and then converted to a SummarizedExperiment object. library(GEOquery) gset <- read_cache(cache.pa, 'gset') # Retrieve data from cache. if (is.null(gset)) { # Save downloaded data to cache if it is not cached. gset <- getGEO("GSE14502", GSEMatrix=TRUE, getGPL=TRUE)[[1]] save_cache(dir=cache.pa, overwrite=TRUE, gset) } se.sh <- as(gset, "SummarizedExperiment") # Converted to SummarizedExperiment # The gene symbol identifiers are extracted from the rowData component to be used as row names. rownames(se.sh) <- make.names(rowData(se.sh)[, 'Gene.Symbol']) # A slice of the experimental design in colData slot is shown. Both the samples and conditions # are contained in the "title" column. The samples are indicated by promoters: pGL2 (root # atrichoblast epidermis), pCO2 (root cortex meristematic zone), pSCR (root endodermis), # pWOL (root vasculature), etc., and conditions are control and hypoxia. colData(se.sh)[60:63, 1:4] # Since the samples and conditions need to be listed in two independent columns, like the the # chicken data above, a targets file is recommended to separate samples and conditions. The main # reason to choose this Arabidopdis data is to illusrate the usage of targets file when necessary. # A pre-packaged targets file is accessed and partially shown below. sh.tar <- system.file('extdata/shinyApp/example/target_arab.txt', package='spatialHeatmap') target.sh <- read_fr(sh.tar); target.sh[60:63, ] # Load custom the targets file into colData slot. colData(se.sh) <- DataFrame(target.sh) # This data set was already normalized with the RMA algorithm (Gautier et al. 2004). Thus, the # pre-processing steps are restricted to aggregation of replicates and filtering of reliably # expressed genes. # Replicate agggregation using mean se.aggr.sh <- aggr_rep(data=se.sh, sam.factor='sample', con.factor='condition', aggr='mean') se.fil.arab <- filter_data(data=se.aggr.sh, sam.factor='sample', con.factor='condition', pOA=c(0.03, 6), CV=c(0.30, 100), dir=NULL) # Filtering of genes with low intensities and variance # Similarly, the aSVG file corresponding to this data is pre-packaged and copied to the same # temporary directory. svg.arab <- system.file("extdata/shinyApp/example", 'arabidopsis.thaliana_organ_shm.svg', package="spatialHeatmap") file.copy(svg.arab, dir.svg, overwrite=TRUE) # 3. The random data and aSVG files of two development stages of Arabidopsis organs. # The gene expression data is randomly generated and pre-packaged. pa.growth <- system.file("extdata/shinyApp/example", 'random_data_multiple_aSVGs.txt', package="spatialHeatmap") dat.growth <- read_fr(pa.growth); dat.growth[1:3, ] # Paths of the two corresponsing aSVG files. svg.arab1 <- system.file("extdata/shinyApp/example", 'arabidopsis.thaliana_organ_shm1.svg', package="spatialHeatmap") svg.arab2 <- system.file("extdata/shinyApp/example", 'arabidopsis.thaliana_organ_shm2.svg', package="spatialHeatmap") # Copy the two aSVG files to the same temporary directory. file.copy(c(svg.arab1, svg.arab2), dir.svg, overwrite=TRUE) # Make the pairing table, which describes matchings between the data and aSVG files. df.pair <- data.frame(name=c('chicken', 'arab', 'growth'), data=c('expr_chicken', 'expr_arab', 'random_data_multiple_aSVGs'), aSVG=c('gallus_gallus.svg', 'arabidopsis.thaliana_organ_shm.svg', 'arabidopsis.thaliana_organ_shm1.svg;arabidopsis.thaliana_organ_shm2.svg')) # Note that multiple aSVGs should be concatenated by comma, semicolon, or single space. df.pair # Organize the data and pairing table in a list, and create the database. dat.lis <- list(df_pair=df.pair, expr_chicken=se.fil.chk, expr_arab=se.fil.arab, random_data_multiple_aSVGs=dat.growth) # Create the database in a temporary directory "db_shm". dir.db <- paste0(tempdir(check=TRUE), '/db_shm') # Temporary directory. if (!dir.exists(dir.db)) dir.create(dir.db) write_hdf5(dat.lis=dat.lis, dir=dir.db, svg.dir=dir.svg, replace=TRUE) # 4. Read data and/or pairing table from "data_shm.tar". dat.lis1 <- read_hdf5(paste0(dir.db, '/data_shm.tar'), names(dat.lis)) # 5. Create customized Shiny app with the database. if (!dir.exists('~/test_shiny')) dir.create('~/test_shiny') lis.tar <- list(data=paste0(dir.db, '/data_shm.tar'), svg=paste0(dir.db, '/aSVGs.tar')) custom_shiny(lis.tar, app.dir='~/test_shiny') # Run the app. shiny::runApp('~/test_shiny/shinyApp') # Except "SummarizedExperiment", the database also accepts data in form of "data.frame". In that # case, the columns should follow the naming scheme "sample__condition", i.e. a sample and a # condition are concatenated by double underscore. The details are seen in the "data" argument # of the function "spatial_hm". # The following takes the Arabidopsis data as example. df.arab <- assay(se.fil.arab); df.arab[1:3, 1:3] # The new data list. dat.lis2 <- list(df_pair=df.pair, expr_chicken=se.fil.chk, expr_arab=df.arab, random_data_multiple_aSVGs=dat.growth) # If the data does not have an corresponding aSVG or vice versa, in the pairing table the slot # of missing data or aSVG should be filled with "none". In that case, on the Shiny user # interface, users will be prompted to select an aSVG for the unpaired data or select a data # for the unpaired aSVG. # For example, if the aSVG "arabidopsis.thaliana_organ_shm.svg" has no matching data, the # pairing table should be made like below. df.pair1 <- data.frame(name=c('chicken', 'arab', 'growth'), data=c('expr_chicken', 'none', 'random_data_multiple_aSVGs'), aSVG=c('gallus_gallus.svg', 'arabidopsis.thaliana_organ_shm.svg', 'arabidopsis.thaliana_organ_shm1.svg;arabidopsis.thaliana_organ_shm2.svg')) df.pair1 # The new data list. dat.lis3 <- list(df_pair=df.pair, expr_chicken=se.fil.chk, none='none', random_data_multiple_aSVGs=dat.growth)