norm_data {spatialHeatmap} | R Documentation |
Normalize Sequencing Count Matrix
Description
This function normalizes sequencing count data. It accepts the count matrix and sample metadata (optional) in form of SummarizedExperiment
or data.frame
. In either class, the columns and rows of the count matix should be sample/conditions and genes respectively.
Usage
norm_data(
data,
norm.fun = "CNF",
parameter.list = NULL,
log2.trans = TRUE,
data.trans
)
Arguments
data |
An object of data.frame or SummarizedExperiment . In either case, the columns and rows should be sample/conditions and assayed items (e.g. genes, proteins, metabolites) respectively. If data.frame , the column names should follow the naming scheme "sample__condition". The "sample" is a general term and stands for cells, tissues, organs, etc., where the values are measured. The "condition" is also a general term and refers to experiment treatments applied to "sample" such as drug dosage, temperature, time points, etc. If certain samples are not expected to be colored in "spatial heatmaps" (see spatial_hm ), they are not required to follow this naming scheme. In the downstream interactive network (see network ), if users want to see node annotation by mousing over a node, a column of row item annotation could be optionally appended to the last column. In the case of SummarizedExperiment , the assays slot stores the data matrix. Similarly, the rowData slot could optionally store a data frame of row item anntation, which is only relevant to the interactive network. The colData slot usually contains a data frame with one column of sample replicates and one column of condition replicates. It is crucial that replicate names of the same sample or condition must be identical. E.g. If sampleA has 3 replicates, "sampleA", "sampleA", "sampleA" is expected while "sampleA1", "sampleA2", "sampleA3" is regarded as 3 different samples. If original column names in the assay slot already follow the "sample__condition" scheme, then the colData slot is not required at all. In the function spatial_hm , this argument can also be a numeric vector. In this vector, every value should be named, and values expected to color the "spatial heatmaps" should follow the naming scheme "sample__condition". In certain cases, there is no condition associated with data. Then in the naming scheme of data frame or vector , the "__condition" part could be discarded. In SummarizedExperiment , the "condition" column could be discarded in colData slot. Note, regardless of data class the double underscore is a special string that is reserved for specific purposes in "spatialHeatmap", and thus should be avoided for naming feature/samples and conditions. In the case of spatial-temporal data, there are three factors: samples, conditions, and time points. The naming scheme is slightly different and includes three options: 1) combine samples and conditions to make the composite factor "sampleCondition", then concatenate the new factor and times with double underscore in between, i.e. "sampleCondition__time"; 2) combine samples and times to make the composite factor "sampleTime", then concatenate the new factor and conditions with double underscore in between, i.e. "sampleTime__condition"; or 3) combine all three factors to make the composite factor "sampleTimeCondition" without double underscore. See the vignette for more details by running browseVignettes('spatialHeatmap') in R.
|
norm.fun |
One of the normalizing functions: "CNF", "ESF", "VST", "rlog", "none". Specifically, "CNF" stands for calcNormFactors from edgeR (McCarthy et al. 2012), and "EST", "VST", and "rlog" is equivalent to estimateSizeFactors , varianceStabilizingTransformation , and rlog from DESeq2 respectively (Love, Huber, and Anders 2014). If "none", no normalization is applied. The default is "CNF" and the output data is processed by cpm (Counts Per Million). The parameters of each normalization function are provided through parameter.list .
|
parameter.list |
A list of parameters for each normalizing function assigned in norm.fun . The default is NULL and list(method='TMM') , list(type='ratio') , list(fitType='parametric', blind=TRUE) , list(fitType='parametric', blind=TRUE) is internally set for "CNF", "ESF", "VST", "rlog" respectively. Note the slot name of each element in the list is required. E.g. list(method='TMM') is expected while list('TMM') would cause errors. Complete parameters of "CNF": https://www.rdocumentation.org/packages/edgeR/ versions/3.14.0/topics/calcNormFactors Complete parameters of "ESF": https://www.rdocumentation.org/packages/ DESeq2/versions/1.12.3/topics/estimateSizeFactors Complete parameters of "VST": https://www.rdocumentation.org/packages/ DESeq2/versions/1.12.3/topics/varianceStabilizingTransformation Complete parameters of "rlog": https://www.rdocumentation.org/packages/ DESeq2/versions/1.12.3/topics/rlog
|
log2.trans |
Logical, TRUE or FALSE. If TRUE (default) and the selected normalization method does not use log2 scale by default ("ESF"), the output data is log2-transformed after normalization. If FALSE and the selected normalization method uses log2 scale by default ("VST", "rlog"), the output data is 2-exponent transformed after normalization.
|
data.trans |
This argument is deprecated and replaced by log2.trans . One of "log2", "exp2", and "none", corresponding to transform the count matrix by "log2", "2-based exponent", and "no transformation" respecitvely. The default is "none".
|
Value
If the input data is SummarizedExperiment
, the retured value is also a SummarizedExperiment
containing normalized data matrix and metadata (optional). If the input data is a data.frame
, the returned value is a data.frame
of normalized data and metadata (optional).
Author(s)
Jianhai Zhang jzhan067@ucr.edu; zhang.jianhai@hotmail.com
Dr. Thomas Girke thomas.girke@ucr.edu
References
SummarizedExperiment: SummarizedExperiment container. R package version 1.10.1
R Core Team (2018). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/
McCarthy, Davis J., Chen, Yunshun, Smyth, and Gordon K. 2012. "Differential Expression Analysis of Multifactor RNA-Seq Experiments with Respect to Biological Variation." Nucleic Acids Research 40 (10): 4288–97
Keays, Maria. 2019. ExpressionAtlas: Download Datasets from EMBL-EBI Expression Atlas
Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. "Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2." Genome Biology 15 (12): 550. doi:10.1186/s13059-014-0550-8
McCarthy, Davis J., Chen, Yunshun, Smyth, and Gordon K. 2012. "Differential Expression Analysis of Multifactor RNA-Seq Experiments with Respect to Biological Variation." Nucleic Acids Research 40 (10): 4288–97
Cardoso-Moreira, Margarida, Jean Halbert, Delphine Valloton, Britta Velten, Chunyan Chen, Yi Shao, Angélica Liechti, et al. 2019. “Gene Expression Across Mammalian Organ Development.” Nature 571 (7766): 505–9
See Also
calcNormFactors
in edgeR, and estimateSizeFactors
, varianceStabilizingTransformation
, rlog
in DESeq2.
Examples
## In the following examples, the 2 toy data come from an RNA-seq analysis on development of 7
## chicken organs under 9 time points (Cardoso-Moreira et al. 2019). For conveninece, they are
## included in this package. The complete raw count data are downloaded using the R package
## ExpressionAtlas (Keays 2019) with the accession number "E-MTAB-6769". Toy data1 is used as
## a "data frame" input to exemplify data of simple samples/conditions, while toy data2 as
## "SummarizedExperiment" to illustrate data involving complex samples/conditions.
## Set up toy data.
# Access toy data1.
cnt.chk.simple <- system.file('extdata/shinyApp/example/count_chicken_simple.txt',
package='spatialHeatmap')
df.chk <- read.table(cnt.chk.simple, header=TRUE, row.names=1, sep='\t', check.names=FALSE)
# Columns follow the namig scheme "sample__condition", where "sample" and "condition" stands
# for organs and time points respectively.
df.chk[1:3, ]
# A column of gene annotation can be appended to the data frame, but is not required.
ann <- paste0('ann', seq_len(nrow(df.chk))); ann[1:3]
df.chk <- cbind(df.chk, ann=ann)
df.chk[1:3, ]
# Access toy data2.
cnt.chk <- system.file('extdata/shinyApp/example/count_chicken.txt', package='spatialHeatmap')
count.chk <- read.table(cnt.chk, header=TRUE, row.names=1, sep='\t')
count.chk[1:3, 1:5]
# Store toy data2 in "SummarizedExperiment".
library(SummarizedExperiment)
se.chk <- SummarizedExperiment(assay=count.chk)
# Normalize raw count data. The normalizing function "calcNormFactors" (McCarthy et al. 2012)
# with default settings is used.
df.nor.chk <- norm_data(data=df.chk, norm.fun='CNF', log2.trans=TRUE)
se.nor.chk <- norm_data(data=se.chk, norm.fun='CNF', log2.trans=TRUE)
[Package
spatialHeatmap version 1.2.0
Index]