scan_sequences {universalmotif}R Documentation

Scan sequences for matches to input motifs.

Description

For sequences of any alphabet, scan them using the PWM matrices of a set of input motifs.

Usage

scan_sequences(motifs, sequences, threshold = 0.001,
  threshold.type = "pvalue", RC = FALSE, use.freq = 1, verbose = 0,
  nthreads = 1, motif_pvalue.k = 8)

Arguments

motifs

See convert_motifs() for acceptable motif formats.

sequences

XStringSet Sequences to scan. Alphabet should match motif.

threshold

numeric(1) See details.

threshold.type

character(1) One of c('logodds', 'logodds.abs', 'pvalue'). See details.

RC

logical(1) If TRUE, check reverse complement of input sequences.

use.freq

numeric(1) The default, 1, uses the motif matrix (from the motif['motif'] slot) to search for sequences. If a higher number is used, then the matching k-let matrix from the motif['multifreq'] slot is used. See add_multifreq().

verbose

numeric(1) Describe progress, from none (0) to verbose (3).

nthreads

numeric(1) Run scan_sequences() in parallel with nthreads threads. nthreads = 0 uses all available threads. Note that no speed up will occur for jobs with only a single motif and sequence.

motif_pvalue.k

numeric(1) Control motif_pvalue() approximation. See motif_pvalue().

Details

Similar to Biostrings::matchPWM(), the scanning method uses logodds scoring. (To see the scoring matrix for any motif, simply run convert_type(motif, "PWM"). For a multifreq scoring matrix: apply(motif["multifreq"]$2, 2, ppm_to_pwm)). In order to score a sequence, at each position within a sequence of length equal to the length of the motif, the scores for each base are summed. If the score sum is above the desired threshold, it is kept.

If threshold.type = 'logodds', then to calculate the minimum allowed score the max possible score for a motif is multiplied by the value set by threshold. To determine the maximum possible scores a motif (of type PWM), run motif_score(motif, 1). If threshold.type = 'pvalue', then threshold logodds scores are generated using motif_pvalue(). Finally, if threshold.type = 'logodds.abs', then the exact values provided will be used as thresholds.

Non-standard letters (such as "N", "+", "-", ".", etc in DNAString objects) will be safely ignored, resulting only in a warning and a very minor performance cost. This can used to scan masked sequences. See Biostrings::mask() for masking sequences (generating MaskedXString objects), and Biostrings::injectHardMask() to recover masked XStringSet objects for use with scan_sequences().

Value

DataFrame with each row representing one hit. If the input sequences are DNAStringSet or RNAStringSet, then an additional column with the strand is included. Function args are stored in the metadata slot.

Author(s)

Benjamin Jean-Marie Tremblay, b2tremblay@uwaterloo.ca

See Also

add_multifreq(), Biostrings::matchPWM(), enrich_motifs(), motif_pvalue()

Examples

## any alphabet can be used
## Not run: 
set.seed(1)
alphabet <- paste(c(letters), collapse = "")
motif <- create_motif("hello", alphabet = alphabet)
sequences <- create_sequences(alphabet, seqnum = 1000, seqlen = 100000)
scan_sequences(motif, sequences)

## End(Not run)

## Sequence masking:
if (R.Version()$arch != "i386") {
library(Biostrings)
data(ArabidopsisMotif)
data(ArabidopsisPromoters)
seq <- ArabidopsisPromoters[[1]]  # Only works for XString, not XStringSet
seq <- mask(seq, pattern = "AAAA")  # MaskedDNAString class
seq <- injectHardMask(seq, letter = "+")  # Recover XString
seq <- DNAStringSet(seq)  # scan_sequences() needs XStringSet
scan_sequences(ArabidopsisMotif, seq)
# A warning regarding the presence of non-standard letters will be given,
# but can be safely ignored in this case.

# If you'd like to do it to a whole DNAStringSet object:
seq <- ArabidopsisPromoters
for (i in 1:length(ArabidopsisPromoters)) {
  seq[[i]] <- injectHardMask(mask(seq[[i]], pattern = "AAAA"), letter = "+")
}
scan_sequences(ArabidopsisMotif, seq)
}


[Package universalmotif version 1.6.4 Index]