scan_sequences {universalmotif} | R Documentation |
For sequences of any alphabet, scan them using the PWM matrices of a set of input motifs.
scan_sequences(motifs, sequences, threshold = 0.001, threshold.type = "pvalue", RC = FALSE, use.freq = 1, verbose = 0, nthreads = 1, motif_pvalue.k = 8)
motifs |
See |
sequences |
|
threshold |
|
threshold.type |
|
RC |
|
use.freq |
|
verbose |
|
nthreads |
|
motif_pvalue.k |
|
Similar to Biostrings::matchPWM()
, the scanning method uses
logodds scoring. (To see the scoring matrix for any motif, simply
run convert_type(motif, "PWM")
. For a multifreq
scoring
matrix: apply(motif["multifreq"]$
2, 2, ppm_to_pwm)
). In order
to score a sequence, at each position within a sequence of length equal
to the length of the motif, the scores for each base are summed. If the
score sum is above the desired threshold, it is kept.
If threshold.type = 'logodds'
, then to calculate the minimum
allowed score the max possible score for a motif is multiplied
by the value set by threshold
. To determine the maximum
possible scores a motif (of type PWM), run
motif_score(motif, 1)
. If threshold.type = 'pvalue'
,
then threshold logodds scores are generated using motif_pvalue()
.
Finally, if threshold.type = 'logodds.abs'
, then the exact values
provided will be used as thresholds.
Non-standard letters (such as "N", "+", "-", ".", etc in DNAString
objects) will be safely ignored, resulting only in a warning and a very
minor performance cost. This can used to scan
masked sequences. See Biostrings::mask()
for masking sequences
(generating MaskedXString
objects), and Biostrings::injectHardMask()
to recover masked XStringSet
objects for use with scan_sequences()
.
DataFrame
with each row representing one hit. If the input
sequences are DNAStringSet
or
RNAStringSet
, then an
additional column with the strand is included. Function args are stored
in the metadata
slot.
Benjamin Jean-Marie Tremblay, b2tremblay@uwaterloo.ca
add_multifreq()
, Biostrings::matchPWM()
,
enrich_motifs()
, motif_pvalue()
## any alphabet can be used ## Not run: set.seed(1) alphabet <- paste(c(letters), collapse = "") motif <- create_motif("hello", alphabet = alphabet) sequences <- create_sequences(alphabet, seqnum = 1000, seqlen = 100000) scan_sequences(motif, sequences) ## End(Not run) ## Sequence masking: if (R.Version()$arch != "i386") { library(Biostrings) data(ArabidopsisMotif) data(ArabidopsisPromoters) seq <- ArabidopsisPromoters[[1]] # Only works for XString, not XStringSet seq <- mask(seq, pattern = "AAAA") # MaskedDNAString class seq <- injectHardMask(seq, letter = "+") # Recover XString seq <- DNAStringSet(seq) # scan_sequences() needs XStringSet scan_sequences(ArabidopsisMotif, seq) # A warning regarding the presence of non-standard letters will be given, # but can be safely ignored in this case. # If you'd like to do it to a whole DNAStringSet object: seq <- ArabidopsisPromoters for (i in 1:length(ArabidopsisPromoters)) { seq[[i]] <- injectHardMask(mask(seq[[i]], pattern = "AAAA"), letter = "+") } scan_sequences(ArabidopsisMotif, seq) }