overlapExprs {scran}R Documentation

Overlap expression profiles

Description

Compute the gene-specific overlap in expression profiles between two groups of cells.

Usage

## S4 method for signature 'ANY'
overlapExprs(x, groups, block=NULL, design=NULL, 
    rank.type=c("any", "all"), direction=c("any", "up", "down"),
    tol=1e-8, BPPARAM=SerialParam(), subset.row=NULL, 
    lower.bound=NULL, residuals=FALSE)

## S4 method for signature 'SingleCellExperiment'
overlapExprs(x, ..., subset.row=NULL, lower.bound=NULL, 
    assay.type="logcounts", get.spikes=FALSE) 

Arguments

x

A numeric matrix of expression values, where each column corresponds to a cell and each row corresponds to an endogenous gene. Alternatively, a SingleCellExperiment object containing such a matrix.

groups

A vector of group assignments for all cells.

block

A factor specifying the blocking level for each cell.

design

A numeric matrix containing blocking terms, i.e., uninteresting factors driving expression across cells.

rank.type

A string specifying which comparisons should be used to rank genes in the output.

direction

A string specifying which direction of change in expression should be used to rank genes in the output.

tol

A numeric scalar specifying the tolerance with which ties are considered.

BPPARAM

A BiocParallelParam object to use in bplapply for parallel processing.

subset.row

A logical, integer or character scalar indicating the rows of x to use.

lower.bound

A numeric scalar specifying the theoretical lower bound of values in x, only used when residuals=TRUE.

residuals

A logical scalar indicating whether overlaps should be computed between residuals of a linear model.

...

Additional arguments to pass to the matrix method.

assay.type

A string specifying which assay values to use, e.g., "counts" or "logcounts".

get.spikes

A logical scalar specifying whether decomposition should be performed for spike-ins.

Details

For two groups of cells A and B, consider the distribution of expression values for gene X across those cells. The overlap proportion is defined as the probability that a randomly selected cell in A has a greater expression value of X than a randomly selected cell in B. Overlap proportions near 0 or 1 indicate that the expression distributions are well-separated. In particular, large proportions indicate that most cells of the first group (A) express the gene more highly than most cells of the second group (B).

This function computes, for each gene, the overlap proportions between all pairs of groups in groups. It will then rank the genes based on how well they differentiate between groups. overlapExprs is designed to complement findMarkers, which reports the log-fold changes between groups. This is useful for prioritizing candidate markers without needing to plot their expression values.

Expression values that are tied between groups are considered to be 50% likely to be greater in either group. Thus, if all values were tied, the overlap proportion would be equal to 0.5. The tolerance with which ties are considered can be set by changing tol.

Users can specify which subset of genes to perform these calculations on, by supplying a non-NULL value of subset.row. By default, spike-in transcripts are ignored in overlapExprs,SingleCellExperiment-method with get.spikes=FALSE. If get.spikes=FALSE and subset.row!=NULL, the function will only use the non-spike-in transcripts in subset.row.

Value

A named list of DataFrames. Each DataFrame corresponds to a group in groups and contains one row per gene in x (or the subset specified by subset.row). Within the DataFrame for each group (e.g., group A), there are the following fields:

Top:

Integer, the minimum rank across all pairwise comparisons if rank.type="any".

Worst:

Numeric, the value of the overlap proportion corresponding to the smallest separation statistic across all comparisons if rank.type="all".

overlap.B:

Numeric for every other group B in groups, containing overlap proportions between groups A and B for that gene.

Genes are ranked by the Top or Best column, depending on rank.type.

Ranking genes in the output

Each overlap proportion is first converted into a separation statistic. The definition of the seperation statistic depends on the specified direction:

If rank.type="any", the genes in each group-specific DataFrame are ranked using a similar logic to that in findMarkers. This involves calculation of a Top value for each gene, representing the minimum ranking of the separation statistics across pairwise comparisons. To illustrate, consider the DataFrame for group A, and take the set of all genes with Top values less than or equal to some integer X. This set is the union of the top X genes with the largest separation statistics from each pairwise comparison between group A and every other group. Ranking genes based on the Top value prioritizes genes that exhibit low overlaps between group A and any other group.

If rank.type="all", the genes in each group-specific DataFrame are ranked by the Worst value instead. This is the overlap proportion corresponding to the smallest separation statistic across all pairwise comparisons between group A and the other groups. (In other words, this is the proportion for the pairwise comparison that exhibits the worst discrimination between distributions.) By using this metric, genes can only achieve a high ranking if the separation statistics between group A and all other groups are large. This tends to be quite conservative but can be helpful for quickly identifying uniquely differentially expressed markers.

Accounting for uninteresting variation

If the experiment has known (and uninteresting) factors of variation, these can be included in design or block. The approach used to remove these factors depends on which argument is used. If there is only one factor, using block is recommended whereby the levels of the factor are defined as separate groups. Overlaps between groups are computed within each block, and a weighted mean (based on the number of cells in each block) of the overlaps is taken across all blocks.

This approach avoids the need for linear modelling and the associated assumptions regarding normality and correct model specification. In particular, it avoids problems with breaking of ties when counts or expression values are converted to residuals. However, it also makes less use of information, e.g., we ignore any blocks containing cells from only one group. NA proportions may also be observed for a pair of groups if there is no block that contains cells from that pair.

For experiments containing multiple factors or covariates, a linear model is fitted to the expression values with an appropriate matrix in design. Overlap proportions are then computed using the residuals of the fitted model. This approach is not ideal, requiring log-transformed x and setting of lower.bound - see ?correlatePairs for a related discussion. Where possible for one-way layouts, we suggest using block instead.

Author(s)

Aaron Lun

See Also

findMarkers

Examples

# Using the mocked-up data 'y2' from this example.
example(computeSpikeFactors) 
y2 <- normalize(y2)
groups <- sample(3, ncol(y2), replace=TRUE)
out <- overlapExprs(y2, groups, subset.row=1:10)

[Package scran version 1.8.4 Index]