motif_pvalue {universalmotif}R Documentation

Motif P-value and scoring utility

Description

For calculating p-values/logodds scores for any number of motifs.

Usage

motif_pvalue(motifs, score, pvalue, bkg.probs, use.freq = 1, k = 6,
  progress = ifelse(length(motifs) > 1, TRUE, FALSE), BP = FALSE)

Arguments

motifs

See convert_motifs() for acceptable motif formats.

score

numeric Get a p-value for a motif from a logodds score.

pvalue

numeric Get a logodds score for a motif from a p-value.

bkg.probs

numeric, list If supplying individual background probabilities for each motif, a list. If missing, retrieves the background from the motif bkg slot. Note that this only influences calculating p-values from an input score; calculating a score from an input p-value currently assumes a uniform background.

use.freq

numeric(1) By default uses the regular motif matrix; otherwise uses the corresponding multifreq matrix. Max is 3.

k

numeric(1) For speed, scores/p-values can be approximated after subsetting the motif every k columns. If k is a value equal or higher to the size of input motif(s), then the calculations are (nearly) exact. The default, 6, is recommended to those looking for a good tradeoff between speed and accuracy for jobs requiring repeated calculations.

progress

logical(1) Show progress. Not recommended if BP = TRUE.

BP

logical(1) Allows the use of BiocParallel within motif_pvalue(). See BiocParallel::register() to change the default backend. Setting BP = TRUE is only recommended for exceptionally large jobs. Furthermore, the behaviour of progress = TRUE is changed if BP = TRUE; the default BiocParallel progress bar will be shown (which unfortunately is much less informative).

Details

Calculating p-values for motifs can be very computationally intensive. This is due to how p-values must be calculated: for a given score, all possible sequences which score equal or higher must be found, and the probability for each of these sequences (based on background probabilities) summed. For a DNA motif of length 10, the number of possible unique sequences is 4^10 = 1,048,576. Finding all possible sequences higher than a given score can be done very efficiently and quickly with a branch-and-bound algorithm, but as the motif length increases this calculation becomes impractical. To get around this, the p-value calculation can be approximated.

In order to calculate p-values for longer motifs, this function uses the approximation proposed by Hartmann et al. (2013), where the motif is subset, p-values calculated for the subsets, and finally combined for a total p-value. The smaller the size of the subsets, the faster the calculation; but also, the bigger the approximation. This can be controlled by setting k. In fact, for smaller motifs (< 13 positions) calculating exact p-values can be done individually in reasonable time by setting k = 12.

To calculate a score based on a given p-value, the means and variances of each motif subsets are combined to estimate the distribution of all possible scores using stats::qnorm():

qnorm(pvalue, mean = sum(subset.means), sd = sqrt(sum(subset.vars)))

For calculating exact scores, stats::ecdf() and stats::quantile() are used:

quantile(ecdf(scores), probs = pvalue)

It is important to keep in mind that both approximate and exact score calculations assume uniform backgrounds, so do not use this function for motifs with extremely imbalanced backgrounds. To get all possible scores for each subset, expand.grid() is used instead of the branch-and-bound algorithm used for calculating p-values. Keep this in mind for determining the best k value for motifs with alphabets longer than those of DNA/RNA motifs.

Value

numeric A vector of scores/p-values.

Author(s)

Benjamin Jean-Marie Tremblay, b2tremblay@uwaterloo.ca

References

Hartmann H, Guthöhrlein E, Siebert M, and S. Luehr J. Söding (2013). “P-value-based regulatory motif discovery using positional weight matrices.” Genome Research, 23, 181–194.

See Also

motif_score()

Examples

data(examplemotif)

## p-value/score calculations are performed using the PWM version of the
## motif; these calculations do not work if any -Inf values are present
examplemotif["pseudocount"] <- 1
# or
examplemotif <- normalize(examplemotif)

## get a minimum score based on a p-value
motif_pvalue(examplemotif, pvalue = 0.001)

## get the probability of a particular sequence hit
motif_pvalue(examplemotif, score = 0)

## the calculations can be performed for multiple motifs
motif_pvalue(list(examplemotif, examplemotif), pvalue = c(0.001, 0.0001))

## Compare score thresholds and P-value:
scores <- motif_score(examplemotif, c(0.6, 0.7, 0.8, 0.9))
motif_pvalue(examplemotif, scores)

## Calculate the probability of getting a certain match or better:
TATATAT <- score_match(examplemotif, "TATATAT")
TATATAG <- score_match(examplemotif, "TATATAG")
motif_pvalue(examplemotif, TATATAT)
motif_pvalue(examplemotif, TATATAG)

## Get all possible matches by P-value:
get_matches(examplemotif, motif_pvalue(examplemotif, pvalue = 0.0001))


[Package universalmotif version 1.2.1 Index]