findMarkers {scran}R Documentation

Find marker genes

Description

Find candidate marker genes for clusters of cells, by testing for differential expression between clusters.

Usage

## S4 method for signature 'ANY'
findMarkers(x, clusters, block=NULL, design=NULL,
    pval.type=c("any", "all"), direction=c("any", "up", "down"), 
    lfc=0, log.p=FALSE, full.stats=FALSE, subset.row=NULL)

## S4 method for signature 'SingleCellExperiment'
findMarkers(x, ..., subset.row=NULL, assay.type="logcounts", 
    get.spikes=FALSE) 

Arguments

x

A numeric matrix-like object of normalized log-expression values, where each column corresponds to a cell and each row corresponds to an endogenous gene. Alternatively, a SingleCellExperiment object containing such a matrix.

clusters

A vector of cluster identities for all cells.

block

A factor specifying the blocking level for each cell.

design

A numeric matrix containing blocking terms, i.e., uninteresting factors driving expression across cells.

pval.type

A string specifying the type of combined p-value to be computed, i.e., Simes' or IUT.

direction

A string specifying the direction of log-fold changes to be considered for each cluster.

lfc

A positive numeric scalar specifying the log-fold change threshold to be tested against.

log.p

A logical scalar indicating if log-transformed p-values/FDRs should be returned.

full.stats

A logical scalar indicating whether all statistics (i.e., raw and BH-adjusted p-values) should be returned for each pairwise comparison.

subset.row

A logical, integer or character scalar indicating the rows of x to use.

...

Additional arguments to pass to the matrix method.

assay.type

A string specifying which assay values to use, e.g., "counts" or "logcounts".

get.spikes

A logical scalar specifying whether decomposition should be performed for spike-ins.

Details

This function performs t-tests to identify differentially expressed genes (DEGs) between pairs of clusters. For each cluster, the log-fold changes and other statistics from all relevant pairwise comparisons are combined into a single table. A list of such tables is returned for all clusters to define a set of potential marker genes.

Users can specify the genes to check for differential expression (DE) by setting the subset.row argument. In addition, spike-in transcripts are ignored in the SingleCellExperiment method when get.spikes=FALSE. If this is set, it will intersect with any non-NULL value of subset.row, i.e., only non-spike-in transcripts in the specified set will be used.

Value

A named list of DataFrames. Each DataFrame corresponds to a cluster in clusters, where rows correspond to genes and are ranked by importance. Within the DataFrame for each cluster (e.g., cluster X), there are the following fields:

Top:

Integer, the minimum rank across all pairwise comparisons if rank.type="any".

IUT.p:

Numeric, the IUT p-value across all comparisons if rank.type="all". This is log-transformed and reported as log.IUT.p if log.p=TRUE.

FDR:

Numeric, the BH-adjusted p-value for each gene. This is log-transformed and reported as log.FDR if log.p=TRUE.

logFC.Y:

Numeric for every other cluster Y in clusters, containing the log-fold change of X over Y when full.stats=FALSE.

stats.Y:

DataFrame for every other cluster Y in clusters, returned when full.stats=TRUE. This contains the numeric fields logFC, the log-fold change of X over Y; p.value or log.p.value, the (log-transformed) p-value for the pairwise comparison between X and Y; and FDR or log.FDR, the (log-transformed) BH-adjusted p-value. Setting log.p=TRUE will yield the log-transformed output.

Genes are ranked by the Top or IUT.p column, depending on rank.type.

Explanation of the hypothesis tests

By default, this function will perform a Welch t-test to identify DEGs between each pair of clusters. This is simple, fast and performs quite well for single-cell count data (Soneson and Robinson, 2018). However, if one of the clusters contains fewer than two cells, no p-value will be reported for this comparison.

If block is specified, the same t-tests are performed between clusters within each level of block. For each pair of clusters, the p-values for each gene across all levels of block are combined using Stouffer's Z-score method. The p-value for each level is assigned a weight inversely proportional to the expected variance of the log-fold change estimate for that level. Blocking levels are ignored if no p-value was reported, e.g., if there were insufficient cells for a cluster in a particular level.

If design is specified, a linear model is instead fitted to the expression profile for each gene. This linear model will include the clusters as well as any blocking factors in design. A t-test is then performed to identify DEGs between pairs of clusters, using the values of the relevant coefficients and the gene-wise residual variance.

Note that block will override any design if both are specified. This reflects our preference for the former, which accommodates differences in the variance of expression in each cluster via Welch's t-test. As a result, it is more robust to misspecification of the clusters, as misspecified clusters (and inflated variances) do not affect the inferences for other clusters. Use of block also avoids assuming additivity of effects between the blocking factors and the cluster identities.

Nonetheless, use of design is unavoidable when blocking on real-valued covariates. It is also useful for ensuring that log-fold changes/p-values are computed for comparisons between all pairs of clusters (assuming that design is not confounded with the cluster identities). This may not be the case with block if a pair of clusters never co-occur in a single blocking level.

Direction and magnitude of the log-fold change

If direction="any", two-sided tests will be performed for each pairwise comparisons between clusters. Otherwise, one-sided tests in the specified direction will be used to compute p-values for each gene. This can be used to focus on genes that are upregulated in each cluster of interest, which is often easier to interpret.

To interpret the setting of direction, consider the DataFrame for cluster X, in which we are comparing to another cluster Y. If direction="up", genes will only be significant in this DataFrame if they are upregulated in cluster X compared to Y. If direction="down", genes will only be significant if they are downregulated in cluster X compared to Y.

The magnitude of the log-fold changes can also be tested by setting lfc. By default, lfc=0 meaning that we will reject the null upon detecting any differential expression. If this is set to some other positive value, the null hypothesis will change depending on direction:

This is similar to the approach used in treat and allows users to focus on genes with strong log-fold changes.

Consolidating p-values into a ranking

By default, each table is sorted by the Top value when pval.type="any". This is the minimum rank across all pairwise comparisons for each gene, and specifies the size of the candidate marker set. Taking all rows with Top values no greater than some integer X will yield a set containing the top X genes (ranked by significance) from each pairwise comparison. For example, if X is 5, the set will consist of the union of the top 5 genes from each pairwise comparison. The marker set for each cluster allows it to be distinguished from each other cluster based on the expression of at least one gene.

This approach does not explicitly favour genes that are uniquely expressed in a cluster. Such a strategy is often too stringent, especially in cases involving overclustering or cell types defined by combinatorial gene expression. However, if pval.type="all", the null hypothesis is that the gene is not DE in all contrasts, and the IUT p-value is computed for each gene. This yields a IUT.p field instead of a Top field in the output table. Ranking based on the IUT p-value will focus on genes that are uniquely DE in that cluster.

Correcting for multiple testing

When pval.type="any", a combined p-value is calculated by consolidating p-values across contrasts for each gene using Simes' method. This represents the evidence against the null hypothesis is that the gene is not DE in any of the contrasts. The BH method is then applied on the combined p-values across all genes to obtain the FDR field. The same procedure is done with pval.type="all", but using the IUT p-values across genes instead.

If log.p=TRUE, log-transformed p-values and FDRs will be reported. This may be useful in over-powered studies with many cells, where directly reporting the raw p-values would result in many zeroes due to the limits of machine precision.

Note that the reported FDRs are intended only as a rough measure of significance. Properly correcting for multiple testing is not generally possible when clusters is determined from the same x used for DE testing.

Weighting across blocking levels

When block is specified, the weight for the p-value in a particular level is defined as (1/Nx + 1/Ny)^{-1}, where Nx and Ny are the number of cells in clusters X and Y, respectively, for that level. This is inversely proportional to the expected variance of the log-fold change, provided that all clusters and blocking levels have the same variance.

In theory, a better weighting scheme would be to use the estimated standard error of the log-fold change to compute the weight. This would be more responsive to differences in variance between blocking levels, focusing on levels with low variance and high power. However, this is not safe in practice as genes with many zeroes can have very low standard errors, dominating the results inappropriately.

Like the p-values, the reported log-fold change for each gene is a weighted average of log-fold changes from all levels of the blocking factor. The weight for each log-fold change is inversely proportional to the expected variance of the log-fold change in that level. Unlike p-values, though, this calculation will use blocking levels where both clusters contain only one cell.

Author(s)

Aaron Lun

References

Simes RJ (1986). An improved Bonferroni procedure for multiple tests of significance. Biometrika 73:751-754.

Berger RL and Hsu JC (1996). Bioequivalence trials, intersection-union tests and equivalence confidence sets. Statist. Sci. 11, 283-319.

Whitlock MC (2005). Combining probability from independent tests: the weighted Z-method is superior to Fisher's approach. J. Evol. Biol. 18, 5:1368-73.

Soneson C and Robinson MD (2018). Bias, robustness and scalability in single-cell differential expression analysis. Nat. Methods

See Also

normalize

Examples

# Using the mocked-up data 'y2' from this example.
example(computeSpikeFactors) 
y2 <- normalize(y2)
kout <- kmeans(t(logcounts(y2)), centers=2) # Any clustering method is okay.
out <- findMarkers(y2, clusters=kout$cluster)

[Package scran version 1.8.4 Index]