correlatePairs {scran} | R Documentation |
Identify pairs of genes that are significantly correlated based on a modified Spearman's rho.
correlateNull(ncells, iters=1e6, block=NULL, design=NULL, residuals=FALSE) ## S4 method for signature 'ANY' correlatePairs(x, null.dist=NULL, tol=1e-8, iters=1e6, block=NULL, design=NULL, residuals=FALSE, lower.bound=NULL, use.names=TRUE, subset.row=NULL, pairings=NULL, per.gene=FALSE, cache.size=100L, BPPARAM=SerialParam()) ## S4 method for signature 'SingleCellExperiment' correlatePairs(x, ..., use.names=TRUE, subset.row=NULL, per.gene=FALSE, lower.bound=NULL, assay.type="logcounts", get.spikes=FALSE)
ncells |
An integer scalar indicating the number of cells in the data set. |
iters |
An integer scalar specifying the number of values in the null distribution. |
block |
A factor specifying the blocking level for each cell. |
design |
A numeric design matrix containing uninteresting factors to be ignored. |
residuals |
A logical scalar, deprecated. |
x |
A numeric matrix-like object of log-normalized expression values, where rows are genes and columns are cells. Alternatively, a SingleCellExperiment object containing such a matrix. |
null.dist |
A numeric vector of rho values under the null hypothesis. |
BPPARAM |
A BiocParallelParam object to use in |
tol |
A numeric scalar indicating the maximum difference under which two expression values are tied. |
use.names |
A logical scalar specifying whether the row names of |
subset.row |
A logical, integer or character vector indicating the rows of |
pairings |
A |
per.gene |
A logical scalar specifying whether statistics should be summarized per gene. |
lower.bound |
A numeric scalar specifying the theoretical lower bound of values in |
cache.size |
An integer scalar specifying the number of cells for which ranked expression values are stored in memory. Smaller values can be used in machines with less memory, at the cost of processing speed. |
... |
Additional arguments to pass to |
assay.type |
A string specifying which assay values to use. |
get.spikes |
A logical specifying whether spike-in transcripts should be used. |
The aim of the correlatePairs
function is to identify significant correlations between all pairs of genes in x
.
This allows prioritization of genes that are driving systematic substructure in the data set.
By definition, such genes should be correlated as they are behaving in the same manner across cells.
In contrast, genes driven by random noise should not exhibit any correlations with other genes.
An approximation of Spearman's rho is used to quantify correlations robustly based on ranks. To identify correlated gene pairs, the significance of non-zero correlations is assessed using a permutation test. The null hypothesis is that the (ranking of) normalized expression across cells should be independent between genes. This allows us to construct a null distribution by randomizing (ranked) expression within each gene.
The correlateNull
function constructs an empirical null distribution for rho computed with ncells
cells.
When design=NULL
, this is done by shuffling the ranks, calculating the rho and repeating until iters
values are obtained.
The p-value for each gene pair is defined as the tail probability of this distribution at the observed correlation (with some adjustment to avoid zero p-values).
Correction for multiple testing is done using the BH method.
For correlatePairs
, a pre-computed empirical distribution can be supplied as null.dist
if available.
Otherwise, it will be automatically constructed via correlateNull
with ncells
set to the number of columns in x
.
For correlatePairs,SingleCellExperiment-method
, correlations should be computed for normalized expression values in the specified assay.type
.
The lower bound of the p-values is determined by the value of iters
.
If the limited
field is TRUE
in the returned dataframe, it may be possible to obtain lower p-values by increasing iters
.
This should be examined for non-significant pairs, in case some correlations are overlooked due to computational limitations.
The function will automatically raise a warning if any genes are limited in their significance at a FDR of 5%.
If per.gene=TRUE
, results are summarized on a per-gene basis.
For each gene, all of its pairs are identified, and the corresponding p-values are combined using Simes' method.
This tests whether the gene is involved in significant correlations to any other gene.
Setting per.gene=TRUE
is useful for identifying correlated genes without regard to what they are correlated with (e.g., during feature selection).
For correlateNull
, a numeric vector of length iters
is returned containing the sorted correlations under the null hypothesis of no correlations.
Arguments to design
and residuals
are stored in the attributes.
For correlatePairs
with per.gene=FALSE
, a DataFrame is returned with one row per gene pair and the following fields:
gene1, gene2
:Character or integer fields specifying the genes in the pair.
If use.names=FALSE
, integers are returned representing row indices of x
, otherwise gene names are returned.
rho
:A numeric field containing the approximate Spearman's rho.
p.value, FDR
:Numeric fields containing the permutation p-value and its BH-corrected equivalent.
limited
:A logical scalar indicating whether the p-value is at its lower bound, defined by iters
.
Rows are sorted by increasing p.value
and, if tied, decreasing absolute size of rho
.
The exception is if subset.row
is a matrix, in which case each row in the dataframe correspond to a row of subset.row
.
For correlatePairs
with per.gene=TRUE
, a dataframe is returned with one row per gene.
For each row, the rho
field contains the correlation with the largest magnitude across all gene pairs involving the corresponding gene.
The p.value
field contains the Simes p-value, and the FDR
field contains the corresponding adjusted p-value.
If the experiment has known (and uninteresting) factors of variation, these can be included in design
or block
.
correlatePairs
will then attempt to ensure that these factors do not drive strong correlations between genes.
Examples might be to block on batch effects or cell cycle phase, which may have substantial but uninteresting effects on expression.
The approach used to remove these factors depends on whether design
or block
is used.
If there is only one factor, e.g., for plate or animal of origin, block
should be used.
Each level of the factor is defined as a separate group of cells.
For each pair of genes, correlations are computed within each group, and a weighted mean based on the group size) of the correlations is taken across all groups.
The same strategy is used to generate the null distribution where ranks are computed and shuffled within each group.
For experiments containing multiple factors or covariates, a design matrix should be passed into design
.
The correlatePairs
function will fit a linear model to the (log-normalized) expression values.
The correlation between a pair of genes is then computed from the residuals of the fitted model.
Similarly, to obtain a null distribution of rho values, normally-distributed random errors are simulated in a fitted model based on design
;
the corresponding residuals are generated from these errors; and the correlation between sets of residuals is computed at each iteration.
We recommend using block
wherever possible (and it will take priority if both block
and design
are specified).
While design
can also be used for one-way layouts, this is not ideal as it involves more work/assumptions:
It assumes normality, during both linear modelling and generation of the null distribution.
This assumption is largely unavoidable for complex designs, where some quantitative constraints are required to remove nuisance effects.
x
should generally be log-transformed here, whereas this is not required for (but does not hurt) the first group-based approach.
Residuals computed from expression values equal to lower.bound
are set to a constant value below all other residuals.
This preserves ties between zeroes and avoids spurious correlations between genes due to arbitrary tie-breaking.
The value of lower.bound
should be equal to log-prior count used during the log-transformation.
It is automatically taken from metadata(x)$log.exprs.offset
if x
is a SingleCellExperiment object.
The pairings
argument specifies the pairs of genes to compute correlations for:
By default, correlations will be computed between all pairs of genes with pairings=NULL
.
Genes that occur earlier in x
are labelled as gene1
in the output DataFrame.
Redundant permutations are not reported.
If pairings
is a list of two vectors, correlations will be computed between one gene in the first vector and another gene in the second vector.
This improves efficiency if the only correlations of interest are those between two pre-defined sets of genes.
Genes in the first vector are always reported as gene1
.
If pairings
is an integer/character matrix of two columns, each row is assumed to specify a gene pair.
Correlations will then be computed for only those gene pairs, and the returned dataframe will not be sorted by p-value.
Genes in the first column of the matrix are always reported as gene1
.
If subset.row
is not NULL
, only the genes in the selected subset are used to compute correlations.
This will iteract properly with pairings
, such that genes in pairings
and not in subset.row
will be ignored.
With correlatePairs,SingleCellExperiment-method
, rows corresponding to spike-in transcripts are also removed by default with get.spikes=FALSE
.
This avoids picking up strong technical correlations between pairs of spike-in transcripts.
We recommend setting subset.row
to contain only the subset of genes of interest.
This reduces computational time by only testing correlations of interest.
For example, we could select only HVGs to focus on genes contributing to cell-to-cell heterogeneity (and thus more likely to be involved in driving substructure).
There is no need to account for HVG pre-selection in multiple testing, because rank correlations are unaffected by the variance.
Lowly-expressed genes can also cause problems when design
is non-NULL
and residuals=TRUE
.
Tied counts, and zeroes in particular, may not result in tied residuals after fitting of the linear model.
Model fitting may break ties in a consistent manner across genes, yielding large correlations between genes with many zero counts.
Focusing on HVGs should mitigate the detection of these uninteresting correlations, as genes dominated by zeroes will usually have low variance.
As previously mentioned, an approximate version of Spearman's rho is used. Specifically, untied ranks are randomly assigned to any tied values. This means that a common empirical distribution can be used for all gene pairs, rather than having to do new permutations for every pair to account for the different pattern of ties. Generally, this modification has little effect on the results for expressed genes (and in any case, differences in library size break ties for normalized expression values). Some correlations may end up being spuriously large, but this should be handled by the error control machinery after multiplicity correction.
Aaron Lun
Phipson B and Smyth GK (2010). Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn. Stat. Appl. Genet. Mol. Biol. 9:Article 39.
Simes RJ (1986). An improved Bonferroni procedure for multiple tests of significance. Biometrika 73:751-754.
set.seed(0) ncells <- 100 null.dist <- correlateNull(ncells, iters=100000) exprs <- matrix(rpois(ncells*100, lambda=10), ncol=ncells) out <- correlatePairs(exprs, null.dist=null.dist) hist(out$p.value)