isOutlier {scater}R Documentation

Identify outlier values

Description

Convenience function to determine which values in a numeric vector are outliers based on the median absolute deviation (MAD).

Usage

isOutlier(
  metric,
  nmads = 3,
  type = c("both", "lower", "higher"),
  log = FALSE,
  subset = NULL,
  batch = NULL,
  share_medians = FALSE,
  share_mads = FALSE,
  share_missing = TRUE,
  min_diff = NA
)

Arguments

metric

Numeric vector of values.

nmads

A numeric scalar, specifying the minimum number of MADs away from median required for a value to be called an outlier.

type

String indicating whether outliers should be looked for at both tails ("both"), only at the lower tail ("lower") or the upper tail ("higher").

log

Logical scalar, should the values of the metric be transformed to the log2 scale before computing MADs?

subset

Logical or integer vector, which subset of values should be used to calculate the median/MAD? If NULL, all values are used.

batch

Factor of length equal to metric, specifying the batch to which each observation belongs. A median/MAD is calculated for each batch, and outliers are then identified within each batch.

share_medians

Logical scalar indicating whether the median calculation should be shared across batches. Only used if batch is specified.

share_mads

Logical scalar indicating whether the MAD calculation should be shared across batches. Only used if batch is specified.

share_missing

Logical scalar indicating whether values should be shared across batches if they cannot be computed for a batch, e.g., due to subsetting.

min_diff

A numeric scalar indicating the minimum difference from the median to consider as an outlier. Ignored if NA.

Details

Lower and upper thresholds are stored in the "threshold" attribute of the returned vector. By default, this is a numeric vector of length 2 for the threshold on each side. If type="lower", the higher limit is Inf, while if type="higher", the lower limit is -Inf.

If min_diff is not NA, the minimum distance from the median required to define an outlier is set as the larger of nmads MADs and min_diff. This aims to avoid calling many outliers when the MAD is very small, e.g., due to discreteness of the metric. If log=TRUE, this difference is defined on the log2 scale.

If subset is specified, the median and MAD are computed from a subset of cells and the values are used to define the outlier threshold that is applied to all cells. In a quality control context, this can be handy for excluding groups of cells that are known to be low quality (e.g., failed plates) so that they do not distort the outlier definitions for the rest of the dataset.

Missing values trigger a warning and are automatically ignored during estimation of the median and MAD. The corresponding entries of the output vector are also set to NA values.

Value

A logical vector of the same length as the metric argument, specifying the observations that are considered as outliers.

Handling batches

If batch is specified, outliers are defined within each batch separately using batch-specific median and MAD values. This gives the same results as if the input metrics were subsetted by batch and isOutlier was run on each subset, and is often useful when batches are known a priori to have technical differences (e.g., in sequencing depth).

If share_medians=TRUE, a shared median is computed across all cells. If shared_mads=TRUE, a shared MAD is computed using all cells (from either a batch-specific or shared median, depending on share_medians). These settings are useful to enforce a common location or spread across batches, e.g., we might set shared_mads=TRUE for log-library sizes if coverage varies across batches but the variance across cells is expected to be consistent across batches.

If a batch does not have sufficient cells to compute the median or MAD (e.g., after applying subset), the default setting of share_missing=TRUE will set these values to the shared median and MAD. This allows us to define thresholds for low-quality batches based on information in the rest of the dataset. (Note that the use of shared values only affects this batch and not others unless share_medians and share_mads are also set.) Otherwise, if share_missing=FALSE, all cells in that batch will have NA in the output.

If batch is specified, the "threshold" attribute in the returned vector is a matrix with one named column per level of batch and two rows (one per threshold).

Author(s)

Aaron Lun

See Also

quickPerCellQC, a convenience wrapper to perform outlier-based quality control.

perCellQCMetrics, to compute potential QC metrics.

Examples

example_sce <- mockSCE()
stats <- perCellQCMetrics(example_sce)

str(isOutlier(stats$sum))
str(isOutlier(stats$sum, type="lower"))
str(isOutlier(stats$sum, type="higher"))

str(isOutlier(stats$sum, log=TRUE))

b <- sample(LETTERS[1:3], ncol(example_sce), replace=TRUE)
str(isOutlier(stats$sum, log=TRUE, batch=b))


[Package scater version 1.16.2 Index]