compare_motifs {universalmotif} | R Documentation |
Compare motifs using one of the several available metrics. See the "Motif comparisons and P-values" vignette for detailed information.
compare_motifs(motifs, compare.to, db.scores, use.freq = 1, use.type = "PPM", method = "ALLR", tryRC = TRUE, min.overlap = 6, min.mean.ic = 0.25, min.position.ic = 0, relative_entropy = FALSE, normalise.scores = FALSE, max.p = 0.01, max.e = 10, nthreads = 1, score.strat = "a.mean", output.report, output.report.max.print = 10)
motifs |
See |
compare.to |
|
db.scores |
|
use.freq |
|
use.type |
|
method |
|
tryRC |
|
min.overlap |
|
min.mean.ic |
|
min.position.ic |
|
relative_entropy |
|
normalise.scores |
|
max.p |
|
max.e |
|
nthreads |
|
score.strat |
|
output.report |
|
output.report.max.print |
|
The following metrics are available:
Euclidean distance (EUCL
) (Choi et al. 2004)
Weighted Euclidean distance (WEUCL
)
Kullback-Leibler divergence (KL
) (Kullback and Leibler 1951; Roepcke et al. 2005)
Hellinger distance (HELL
) (Hellinger 1909)
Squared Euclidean distance (SEUCL
)
Manhattan distance (MAN
)
Pearson correlation coefficient (PCC
)
Weighted Pearson correlation coefficient (WPCC
)
Sandelin-Wasserman similarity (SW
), or sum of squared distances (Sandelin and Wasserman 2004)
Average log-likelihood ratio (ALLR
) (Wang and Stormo 2003)
Lower limit ALLR (ALLR_LL
) (Mahony et al. 2007)
Bhattacharyya coefficient (BHAT
) (Bhattacharyya 1943)
Comparisons are calculated between two motifs at a time. All possible alignments
are scored, and the best score is reported. In an alignment scores are calculated
individually between columns. How those scores are combined to generate the final
alignment scores depends on score.strat
.
See the "Motif comparisons and P-values" vignette for a description of the
various metrics. Note that PCC
, WPCC
, SW
, ALLR
, ALLR_LL
and BHAT
are similarities;
higher values mean more similar motifs. For the remaining metrics, values closer
to zero represent more similar motifs.
Small pseudocounts are automatically added when one of the following methods
is used: KL
, ALLR
, ALLR_LL
, IS
. This is avoid
zeros in the calculations.
To note regarding p-values: P-values are pre-computed using the
make_DBscores()
function. If not given, then uses a set of internal
precomputed P-values from the JASPAR2018 CORE motifs. These precalculated
scores are dependent on the length of the motifs being compared. This takes
into account that comparing small motifs with larger motifs leads to higher
scores, since the probability of finding a higher scoring alignment is
higher.
The default P-values have been precalculated for regular DNA motifs. They
are of little use for motifs with a different number of alphabet letters
(or even the multifreq
slot).
matrix
if compare.to
is missing; DataFrame
otherwise. For the
latter, function args are stored in the metadata
slot.
Benjamin Jean-Marie Tremblay, b2tremblay@uwaterloo.ca
Bhattacharyya A (1943). “On a measure of divergence between two statistical populations defined by their probability distributions.” Bulletin of the Calcutta Mathematical Society, 35, 99–109.
Choi I, Kwon J, Kim S (2004). “Local feature frequency profile: a method to measure structural similarity in proteins.” PNAS, 101, 3797–3802.
Hellinger E (1909). “Neue Begrundung der Theorie quadratischer Formen von unendlichvielen Veranderlichen.” Journal fur die reine und angewandte Mathematik, 136, 210–271.
Khan A, Fornes O, Stigliani A, Gheorghe M, Castro-Mondragon J, van der Lee R, Bessy A, Cheneby J, Kulkarni SR, Tan G, Baranasic D, Arenillas D, Sandelin A, Vandepoele K, Lenhard B, Ballester B, Wasserman W, Parcy F, Mathelier A (2018). “JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework.” Nucleic Acids Research, 46, D260-D266.
Kullback S, Leibler RA (1951). “On information and sufficiency.” The Annals of Mathematical Statistics, 22, 79-86.
Itakura F, Saito S (1968). “Analysis synthesis telephony based on the maximum likelihood method.” In 6th International Congress on Acoustics, C-17–C-20.
Mahony S, Auron P, Benos P (2007). “DNA Familial Binding Profiles Made Easy: Comparison of Various Motif Alignment and Clustering Strategies.” PLoS Computational Biology, 3(3), e61.
Pietrokovski S (1996). “Searching databases of conserved sequence regions by aligning protein multiple-alignments.” Nucleic Acids Research, 24, 3836–3845.
Roepcke S, Grossmann S, Rahmann S, Vingron M (2005). “T-Reg Comparator: an analysis tool for the comparison of position weight matrices.” Nucleic Acids Research, 33, W438–W441.
Sandelin A, Wasserman W (2004). “Constrained binding site diversity within families of transcription factors enhances pattern discovery bioinformatics.” Journal of Molecular Biology, 338(2), 207–215.
Wang T, Stormo G (2003). “Combining phylogenetic data with co-regulated genes to identify motifs.” Bioinformatics, 19(18), 2369–2380.
convert_motifs()
, motif_tree()
, view_motifs()
,
make_DBscores()
motif1 <- create_motif(name = "1") motif2 <- create_motif(name = "2") motif1vs2 <- compare_motifs(c(motif1, motif2), method = "PCC") ## To get a dist object: as.dist(1 - motif1vs2) motif3 <- create_motif(name = "3") motif4 <- create_motif(name = "4") motifs <- c(motif1, motif2, motif3, motif4) ## Compare motif "2" to all the other motifs: if (R.Version()$arch != "i386") { compare_motifs(motifs, compare.to = 2, max.p = 1, max.e = Inf) }