Parallel analysis {scran} | R Documentation |
Perform a parallel analysis to choose the number of principal components.
## S4 method for signature 'ANY' parallelPCA(x, subset.row=NULL, value=c("pca", "n", "lowrank"), min.rank=5, max.rank=100, niters=50, threshold=0.1, approximate=FALSE, irlba.args=list(), BPPARAM=SerialParam()) ## S4 method for signature 'SingleCellExperiment' parallelPCA(x, ..., subset.row=NULL, value=c("pca", "n", "lowrank"), assay.type="logcounts", get.spikes=FALSE, sce.out=TRUE)
x |
A numeric matrix of log-expression values for |
subset.row |
See |
value |
A string specifying the type of value to return; the PCs, the number of retained components, or a low-rank approximation. |
min.rank, max.rank |
Integer scalars specifying the minimum and maximum number of PCs to retain. |
niters |
Integer scalar specifying the number of iterations to use for the parallel analysis. |
threshold |
Numeric scalar representing the “p-value” threshold above which PCs are to be ignored. |
approximate |
A logical scalar indicating whether approximate SVD should be performed via |
irlba.args |
A named list of additional arguments to pass to |
BPPARAM |
A BiocParallelParam object. |
... |
Further arguments to pass to |
assay.type |
A string specifying which assay values to use. |
get.spikes |
See |
sce.out |
A logical scalar specifying whether a modified SingleCellExperiment object should be returned. |
This function performs Horn's parallel analysis to decide how many PCs to retain in a principal components analysis. Parallel analysis involves permuting the expression vector for each gene and repeating the PCA to obtain the fractions of variance explained under a random null model. The number of PCs to retain is determined by the intersection of the “fraction explained” lines on a scree plot. This is justified as discarding PCs that explain less variance than would be expected under a random model.
In practice, we discard all PCs from the first PC that has a fraction explained similar to that under the null.
A PC is considered similar if the permuted fractions exceed the observed fraction in more than threshold
of iterations.
(For want of a better word, we have described this as a “p-value” threshold, though it is not interpretable as a measure of significance.)
This is a more conservative criterion than discarding PCs with fractions below the average null fraction, which tends to overstate the rank in noisy datasets.
Note that the number of PCs will be coerced to lie between min.rank
and max.rank
.
This function can be sped up by specifying approximate=TRUE
, which will use approximate strategies for performing the PCA.
Another option is to set BPPARAM
to perform the iterations in parallel.
For parallelPCA,ANY-method
, a numeric matrix is returned containing the selected PCs (columns) for all cells (rows) if value="pca"
.
If value="n"
, it will return an integer scalar specifying the number of retained components.
If value="lowrank"
, it will return a low-rank approximation of x
with the same dimensions.
For parallelPCA,SingleCellExperiment-method
, the return value is the same as parallelPCA,ANY-method
if sce.out=FALSE
or value="n"
.
Otherwise, a SingleCellExperiment object is returned that is a modified version of x
.
If value="pca"
, the modified object will contain the PCs as the "PCA"
entry in the reducedDims
slot.
If value="lowrank"
, it will return a low-rank approximation in assays
slot, named "lowrank"
.
In all cases, the fractions of variance explained by the first max.rank
PCs will be stored as the "percentVar"
attribute in the return value.
Fractions of variance explained by these PCs after each permutation iteration are also recorded as a matrix in "permuted.percentVar"
.
Aaron Lun
Buja A and Eyuboglu N (1992). Remarks on Parallel Analysis. Multivariate Behav. Res., 27:509-40.
# Mocking up some data. ngenes <- 1000 means <- 2^runif(ngenes, 6, 10) dispersions <- 10/means + 0.2 nsamples <- 50 counts <- matrix(rnbinom(ngenes*nsamples, mu=means, size=1/dispersions), ncol=nsamples) # Choosing the number of PCs lcounts <- log2(counts + 1) parallelPCA(lcounts, min.rank=0, value="n")