scan_sequences {universalmotif} | R Documentation |
For sequences of any alphabet, scan them using the PWM matrices of a set of input motifs.
scan_sequences(motifs, sequences, threshold = 0.001, threshold.type = "pvalue", RC = FALSE, use.freq = 1, verbose = 1, progress = TRUE, BP = FALSE)
motifs |
See |
sequences |
|
threshold |
|
threshold.type |
|
RC |
|
use.freq |
|
verbose |
|
progress |
|
BP |
|
Similar to Biostrings::matchPWM()
, the scanning method uses
logodds scoring. (To see the scoring matrix for any motif, simply
run convert_type(motif, "PWM")
; for a multifreq
scoring
matrix: apply(motif["multifreq"]$
2, 2, ppm_to_pwm)
). In order
to score a sequence, at each position within a sequence of length equal
to the length of the motif, the scores for each base are summed. If the
score sum is above the desired threshold, it is kept.
If threshold.type = 'logodds'
, then to calculate the minimum
allowed score the total possible score for a motif is multiplied
by the value set by threshold
. To determine the maximum and minimum
possible scores a motif (of type PWM), run
sum(apply(motif['motif'], 2, max))
and
sum(apply(motif['motif'], 2, min))
. If threshold.type = 'pvalue'
,
then threshold logodds scores are generated using motif_pvalue()
.
Non-standard letters (such as "N", "+", "-", ".", etc in DNAString
objects) will be safely ignored, resulting only in a warning and a very
minor performance cost. This can used to scan
masked sequences. See Biostrings::mask()
for masking sequences
(generating MaskedXString
objects), and Biostrings::injectHardMask()
to recover masked XStringSet
objects for use with scan_sequences()
.
data.frame
with each row representing one hit; if the input
sequences are DNAStringSet
or
RNAStringSet
, then an
additional column with the strand is included.
Benjamin Jean-Marie Tremblay, b2tremblay@uwaterloo.ca
Pagès H, Aboyoun P, Gentleman R, DebRoy S (2018). Biostrings: Efficient manipulation of biological strings. R package version 2.48.0.
add_multifreq()
, Biostrings::matchPWM()
,
enrich_motifs()
, motif_pvalue()
## any alphabet can be used ## Not run: set.seed(1) alphabet <- paste(c(letters), collapse = "") motif <- create_motif("hello", alphabet = alphabet) sequences <- create_sequences(alphabet, seqnum = 1000, seqlen = 100000) scan_sequences(motif, sequences) ## End(Not run) ## Sequence masking: library(Biostrings) data(ArabidopsisMotif) data(ArabidopsisPromoters) seq <- ArabidopsisPromoters[[1]] # Only works for XString, not XStringSet seq <- mask(seq, pattern = "AAAA") # MaskedDNAString class seq <- injectHardMask(seq, letter = "+") # Recover XString seq <- DNAStringSet(seq) # scan_sequences() needs XStringSet scan_sequences(ArabidopsisMotif, seq, verbose = 0, progress = FALSE) # A warning regarding the presence of non-standard letters will be given, # but can be safely ignored in this case.