scan_sequences {universalmotif}R Documentation

Scan sequences for matches to input motifs.

Description

For sequences of any alphabet, scan them using the PWM matrices of a set of input motifs.

Usage

scan_sequences(motifs, sequences, threshold = 0.001,
  threshold.type = "pvalue", RC = FALSE, use.freq = 1, verbose = 1,
  progress = TRUE, BP = FALSE)

Arguments

motifs

See convert_motifs() for acceptable motif formats.

sequences

XStringSet Sequences to scan. Alphabet should match motif.

threshold

numeric(1) Between 0 and 1. See details.

threshold.type

character(1) One of c('logodds', 'pvalue'). See details.

RC

logical(1) If TRUE, check reverse complement of input sequences.

use.freq

numeric(1) The default, 1, uses the motif matrix (from the motif['motif'] slot) to search for sequences. If a higher number is used, then the matching k-let matrix from the motif['multifreq'] slot is used. See add_multifreq().

verbose

numeric(1) Describe progress, from none (0) to very verbose (3).

progress

logical(1) Show progress. Not recommended if BP = TRUE. Set to FALSE if verbose = 0.

BP

logical(1) Allows for the use of BiocParallel within scan_sequences(). See BiocParallel::register() to change the default backend. Setting BP = TRUE is only recommended for exceptionally large jobs. Keep in mind however that this function will not attempt to limit its memory usage. Furthermore, the behaviour of porgress = TRUE is changed if BP = TRUE; the default BiocParallel progress bar will be shown (which unfortunately is much less informative).

Details

Similar to Biostrings::matchPWM(), the scanning method uses logodds scoring. (To see the scoring matrix for any motif, simply run convert_type(motif, "PWM"); for a multifreq scoring matrix: apply(motif["multifreq"]$2, 2, ppm_to_pwm)). In order to score a sequence, at each position within a sequence of length equal to the length of the motif, the scores for each base are summed. If the score sum is above the desired threshold, it is kept.

If threshold.type = 'logodds', then to calculate the minimum allowed score the total possible score for a motif is multiplied by the value set by threshold. To determine the maximum and minimum possible scores a motif (of type PWM), run sum(apply(motif['motif'], 2, max)) and sum(apply(motif['motif'], 2, min)). If threshold.type = 'pvalue', then threshold logodds scores are generated using motif_pvalue().

Non-standard letters (such as "N", "+", "-", ".", etc in DNAString objects) will be safely ignored, resulting only in a warning and a very minor performance cost. This can used to scan masked sequences. See Biostrings::mask() for masking sequences (generating MaskedXString objects), and Biostrings::injectHardMask() to recover masked XStringSet objects for use with scan_sequences().

Value

data.frame with each row representing one hit; if the input sequences are DNAStringSet or RNAStringSet, then an additional column with the strand is included.

Author(s)

Benjamin Jean-Marie Tremblay, b2tremblay@uwaterloo.ca

References

Pagès H, Aboyoun P, Gentleman R, DebRoy S (2018). Biostrings: Efficient manipulation of biological strings. R package version 2.48.0.

See Also

add_multifreq(), Biostrings::matchPWM(), enrich_motifs(), motif_pvalue()

Examples

## any alphabet can be used
## Not run: 
set.seed(1)
alphabet <- paste(c(letters), collapse = "")
motif <- create_motif("hello", alphabet = alphabet)
sequences <- create_sequences(alphabet, seqnum = 1000, seqlen = 100000)
scan_sequences(motif, sequences)

## End(Not run)

## Sequence masking:
library(Biostrings)
data(ArabidopsisMotif)
data(ArabidopsisPromoters)
seq <- ArabidopsisPromoters[[1]]  # Only works for XString, not XStringSet
seq <- mask(seq, pattern = "AAAA")  # MaskedDNAString class
seq <- injectHardMask(seq, letter = "+")  # Recover XString
seq <- DNAStringSet(seq)  # scan_sequences() needs XStringSet
scan_sequences(ArabidopsisMotif, seq, verbose = 0, progress = FALSE)
# A warning regarding the presence of non-standard letters will be given,
# but can be safely ignored in this case.


[Package universalmotif version 1.2.1 Index]