calculateQCMetrics {scater} | R Documentation |
Compute quality control (QC) metrics for each feature and cell in a SingleCellExperiment object, accounting for specified control sets.
calculateQCMetrics(object, exprs_values = "counts", feature_controls = NULL, cell_controls = NULL, percent_top = c(50, 100, 200, 500), detection_limit = 0, use_spikes = TRUE, compact = FALSE, BPPARAM = SerialParam())
object |
A SingleCellExperiment object containing expression values, usually counts. |
exprs_values |
A string indicating which |
feature_controls |
A named list containing one or more vectors (a character vector of feature names, a logical vector, or a numeric vector of indices), used to identify feature controls such as ERCC spike-in sets or mitochondrial genes. |
cell_controls |
A named list containing one or more vectors (a character vector of cell (sample) names, a logical vector, or a numeric vector of indices), used to identify cell controls, e.g., blank wells or bulk controls. |
percent_top |
An integer vector.
Each element is treated as a number of top genes to compute the percentage of library size occupied by the most highly expressed genes in each cell.
See |
detection_limit |
A numeric scalar to be passed to |
use_spikes |
A logical scalar indicating whether existing spike-in sets in |
compact |
A logical scalar indicating whether the metrics should be returned in a compact format as a nested DataFrame. |
BPPARAM |
A BiocParallelParam object specifying whether the QC calculations should be parallelized. |
This function calculates useful quality control metrics to help with pre-processing of data and identification of potentially problematic features and cells.
Underscores in assayNames(object)
and in feature_controls
or cell_controls
can cause theoretically cause ambiguities in the names of the output metrics.
While problems are highly unlikely, users are advised to avoid underscores when naming their controls/assays.
If the expression values are double-precision, the per-row means may not be exactly identity for different choices of BPPARAM
.
This is due to differences in rounding error when summation is performed across different numbers of cores.
If it is important to obtain numerically identical results (e.g., when using the per-row means for sensitive procedures like t-SNE) across various parallelization schemes,
we suggest manually calculating those statistics using rowMeans
.
A SingleCellExperiment object containing QC metrics in the row and column metadata.
Denote the value of exprs_values
as X
.
Cell-level metrics are:
total_X
:Sum of expression values for each cell (i.e., the library size, when counts are the expression values).
log10_total_X
:Log10-transformed total_X
after adding a pseudo-count of 1.
total_features_by_X
:The number of features that have expression values above the detection limit.
log10_total_features_by_X
:Log10-transformed total_features_by_X
after adding a pseudo-count of 1.
pct_X_in_top_Y_features
:The percentage of the total that is contained within the top Y
most highly expressed features in each cell.
This is only reported when there are more than Y
features.
The top numbers are specified via percent_top
.
If any controls are specified in feature_controls
, the above metrics will be recomputed using only the features in each control set.
The name of the set is appended to the name of the recomputed metric, e.g., total_X_F
.
A pct_X_F
metric is also calculated for each set, representing the percentage of expression values assigned to features in F
.
In addition to the user-specified control sets, two other sets are automatically generated when feature_controls
is non-empty.
The first is the "feature_control"
set, containing a union of all feature control sets;
and the second is an "endogenous"
set, containing all genes not in any control set.
Metrics are also computed for these sets in the same manner described above, suffixed with _feature_control
and _endogenous
instead of _F
.
Finally, there is the is_cell_control
field, which indicates whether each cell has been defined as a cell control by cell_controls
.
If multiple sets of cell controls are defined (e.g., blanks or bulk libraries), a metric is_cell_control_C
is produced for each cell control set C
.
The union of all sets is stored in is_cell_control
.
All of these cell-level QC metrics are added as columns to the colData
slot of the SingleCellExperiment object.
This allows them to be inspected by the user and makes them readily available for other functions to use.
Denote the value of exprs_values
as X
.
Feature-level metrics are:
mean_X
:Mean expression value for each gene across all cells.
log10_mean_X
:Log10-mean expression value for each gene across all cells.
n_cells_by_X
:Number of cells with expression values above the detection limit for each gene.
pct_dropout_by_X
:Percentage of cells with expression values below the detection limit for each gene.
total_X
:Sum of expression values for each gene across all cells.
log10_total_X
:Log10-sum of expression values for each gene across all cells.
If any controls are specified in cell_controls
, the above metrics will be recomputed using only the cells in each control set.
The name of the set is appended to the name of the recomputed metric, e.g., total_X_C
.
A pct_X_C
metric is also calculated for each set, representing the percentage of expression values assigned to cells in C
.
In addition to the user-specified control sets, two other sets are automatically generated when cell_controls
is non-empty.
The first is the "cell_control"
set, containing a union of all cell control sets;
and the second is an "non_control"
set, containing all genes not in any control set.
Metrics are computed for these sets in the same manner described above, suffixed with _cell_control
and _non_control
instead of_C
.
Finally, there is the is_feature_control
field, which indicates whether each feature has been defined as a control by feature_controls
.
If multiple sets of feature controls are defined (e.g., ERCCs, mitochondrial genes), a metric is_feature_control_F
is produced for each feature control set F
.
The union of all sets is stored in is_feature_control
.
These feature-level QC metrics are added as columns to the rowData
slot of the SingleCellExperiment object.
They can be inspected by the user and are readily available for other functions to use.
If compact=TRUE
, the QC metrics are stored in the "scater_qc"
field of the colData
and rowData
as a nested DataFrame.
This avoids cluttering the metadata with QC metrics, especially if many results are to be stored in a single SingleCellExperiment object.
Assume we have a feature control set F
and a cell control set C
.
The nesting structure in scater_qc
in the colData
is:
scater_qc |-- is_cell_control |-- is_cell_control_C |-- all | |-- total_counts | |-- total_features_by_counts | \-- ... +-- endogenous | |-- total_counts | |-- total_features_by_counts |-- pct_counts | \-- ... +-- feature_control | |-- total_counts | |-- total_features_by_counts |-- pct_counts | \-- ... \-- feature_control_F |-- total_counts |-- total_features_by_counts |-- pct_counts \-- ...
The nesting in scater_qc
in the rowData
is:
scater_qc |-- is_feature_control |-- is_feature_control_F |-- all | |-- total_counts | |-- total_features_by_counts | \-- ... +-- non_control | |-- total_counts | |-- total_features_by_counts |-- pct_counts | \-- ... +-- cell_control | |-- total_counts | |-- total_features_by_counts |-- pct_counts | \-- ... \-- cell_control_C |-- total_counts |-- total_features_by_counts |-- pct_counts \-- ...
No suffixing of the metric names by the control names is performed here. This is not necessary when each control set has its own nested DataFrame.
Several metric names have been changed in scater 1.7.5:
total_features
was changed to total_features_by_X
where X
is the exprs_values
. This avoids ambiguities if
calculateQCMetrics
is called multiple times with different exprs_values
.
n_cells_X
was changed to n_cells_by_X
, to provide
a more sensible name for the metric.
pct_dropout_X
was changed to pct_dropout_by_X
.
pct_X_top_Y_features
was changed to pct_X_in_top_Y_features
.
The old metric names have been removed in version 1.9.10.
Davis McCarthy, with (many!) modifications by Aaron Lun
data("sc_example_counts") data("sc_example_cell_info") example_sce <- SingleCellExperiment( assays = list(counts = sc_example_counts), colData = sc_example_cell_info ) example_sce <- calculateQCMetrics(example_sce) ## with a set of feature controls defined example_sce <- calculateQCMetrics(example_sce, feature_controls = list(set1 = 1:40)) ## with a named set of feature controls defined example_sce <- calculateQCMetrics(example_sce, feature_controls = list(ERCC = 1:40))