seqComplexity {dada2}R Documentation

Determine if input sequence(s) are low complexity.

Description

This function calculates the oligonucleotide complexity of input sequences. Complexity is quantified as the Shannon richness of oligonucleotides, which can be thought of as the effective number of oligonucleotides if they were all at equal frequencies. If a window size is provided, the minimum Shannon richness observed over sliding window along the sequence is returned.

Usage

seqComplexity(seqs, wordSize = 2, window = NULL, by = 5)

Arguments

seqs

(Required). A character vector of A/C/G/T sequences, or any object coercible by getSequences.

wordSize

(Optional). Default 2. The size of the oligonucleotides (or "words" or "kmers") to use.

window

(Optional). Default NULL. The width in nucleotides of the moving window. If NULL the whole sequence is used.

by

(Optional). Default 5. The step size in nucleotides between each moving window tested.

Details

This function can be used to identify potentially artefactual or undesirable low-complexity sequences, or sequences with low-complexity regions, as are sometimes observed in Illumina sequencing runs. When such artefactual sequences are present, a simple plot of the Shannon oligonucleotide richness values returned by this function will typically show a clear bimodal signal.

Value

numeric. A vector of minimum olignucleotide complexities for each sequence.

See Also

oligonucleotideFrequency

Examples

sq.norm <- "TACGGAAGGTCCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCCGGAGATTAAGCGTGTTGTGA"
sq.lowc <- "TCCTTCTTCTCCTCTCTTTCTCCTTCTTTCTTTTTTTTCCCTTTCTCTTCTTCTTTTTCTTCCTTCCTTTTTTC"
sq.part <- "TTTTTCTTCTCCCCCTTCCCCTTTCCTTTTCTCCTTTTTTCCTTTAGTGCAGTTGAGGCAGGCGGAATTCGTGG"
sqs <- c(sq.norm, sq.lowc, sq.part)
seqComplexity(sqs)
seqComplexity(sqs, window=25)


[Package dada2 version 1.10.1 Index]