shuffle_sequences {universalmotif}R Documentation

Shuffle input sequences.

Description

Given a set of input sequences, shuffle the letters within those sequences with any k-let size.

Usage

shuffle_sequences(sequences, k = 1, method = "euler",
  leftovers = "asis", progress = FALSE, BP = FALSE)

Arguments

sequences

XStringSet Set of sequences to shuffle. Works with any set of characters.

k

numeric(1) K-let size.

method

character(1) One of c('euler', 'markov', 'linear', 'random'). Only relevant is k > 1. See details. The 'random' method is deprecated and will be removed in the next minor version.

leftovers

character(1) For method = 'random'. One of c('asis', 'first', 'split', 'discard').

progress

logical(1) Show progress. Not recommended if BP = TRUE.

BP

logical(1) Allows the use of BiocParallel within shuffle_sequences(). See BiocParallel::register() to change the default backend. Setting BP = TRUE is only recommended for large jobs (such as shuffling billions of letters). Furthermore, the behaviour of progress = TRUE is changed if BP = TRUE; the default BiocParallel progress bar will be shown (which unfortunately is much less informative).

Details

If method = 'markov', then the Markov model is used to generate sequences which will maintain (on average) the k-let frequencies. Please note that this method is not a 'true' shuffling, and for short sequences (e.g. <100bp) this can result in slightly more dissimilar sequences versus true shuffling. See Fitch (1983) for a discussion on the topic.

If method = 'euler', then the sequence shuffling method proposed by Altschul and Erickson (1985) is used. As opposed to the 'markov' method, this one preserves exact k-let frequencies. This is done by creating a k-let edge graph, then following a random Eulerian walk through the graph. Not all walks will use up all available letters however, so the cycle-popping algorithm proposed by Propp and Wilson (1998) is used to find a random Eulerian path. A side effect of using this method is that the starting and ending sequence letters will remain unshuffled.

If method = 'linear', then the input sequences are split linearly every k letters; for example, for k = 3 'ACAGATAGACCC' becomes 'ACA GAT AGA CCC'; after which these 3-lets are shuffled randomly.

Do note however, that the method parameter is only relevant for k > 1. For k = 1, a simple sample call is performed.

Value

XStringSet The input sequences will be returned with identical names and lengths.

Author(s)

Benjamin Jean-Marie Tremblay, b2tremblay@uwaterloo.ca

References

Altschul SF, Erickson BW (1985). “Significance of Nucleotide Sequence Alignments: A Method for Random Sequence Permutation That Preserves Dinucleotide and Codon Usage.” Molecular Biology and Evolution, 2, 526-538.

Fitch WM (1983). “Random sequences.” Journal of Molecular Biology, 163, 171-176.

Propp J, Wilson D (1998). “How to get a perfectly random sample from a generic markov chain and generate a random spanning tree of a directed graph.” Journal of Algorithms, 27, 170–217.

See Also

create_sequences(), scan_sequences(), enrich_motifs(), shuffle_motifs()

Examples

sequences <- create_sequences()
sequences.shuffled <- shuffle_sequences(sequences, k = 2)


[Package universalmotif version 1.2.1 Index]