generateCytosineReport {epialleleR} | R Documentation |
This function counts methylated and unmethylated DNA bases taking into the account average methylation level of the entire sequence read.
generateCytosineReport( bam, report.file = NULL, threshold.reads = TRUE, threshold.context = c("CG", "CHG", "CHH", "CxG", "CX"), min.context.sites = 2, min.context.beta = 0.5, max.outofcontext.beta = 0.1, report.context = threshold.context, min.mapq = 0, skip.duplicates = FALSE, gzip = FALSE, verbose = TRUE )
bam |
BAM file location string OR preprocessed output of
|
report.file |
file location string to write the cytosine report. If NULL
(the default) then report is returned as a
|
threshold.reads |
boolean defining if sequence reads (read pairs) should be thresholded before counting methylated cytosines (default: TRUE). Disabling thresholding makes the report virtually indistinguishable from the ones generated by other software, such as Bismark or Illumina DRAGEN Bio IT Platform. |
threshold.context |
string defining cytosine methylation context used for thresholding the reads:
This option has no effect when read thresholding is disabled. |
min.context.sites |
non-negative integer for minimum number of cytosines within the 'threshold.context' (default: 2). Reads containing fewer within-the-context cytosines are considered completely unmethylated (all C are counted as T). This option has no effect when read thresholding is disabled. |
min.context.beta |
real number in the range [0;1] (default: 0.5). Reads with average beta value for within-the-context cytosines below this threshold are considered completely unmethylated (all C are counted as T). This option has no effect when read thresholding is disabled. |
max.outofcontext.beta |
real number in the range [0;1] (default: 0.1). Reads with average beta value for out-of-context cytosines above this threshold are considered completely unmethylated (all C are counted as T). This option has no effect when read thresholding is disabled. |
report.context |
string defining cytosine methylation context to report (default: value of 'threshold.context'). |
min.mapq |
non-negative integer threshold for minimum read mapping quality (default: 0). Option has no effect if preprocessed BAM data was supplied as an input. |
skip.duplicates |
boolean defining if duplicate aligned reads should be skipped (default: FALSE). Option has no effect if preprocessed BAM data was supplied as an input OR duplicate reads were not marked by alignment software. |
gzip |
boolean to compress the report (default: FALSE). |
verbose |
boolean to report progress and timings (default: TRUE). |
The function reports cytosine methylation information using BAM file as an input. In contrast to the other currently available software, reads (or read pairs as a single entity) can be thresholded by their average methylation level before counting methylated bases, effectively resulting in hypermethylated variant epiallele frequency (VEF) being reported instead of beta value. The function's logic is explained below.
Let's suppose we have a BAM file with four reads, all mapped to the "+" strand of chromosome 1, positions 1-16. Assuming the default values for the thresholding parameters (threshold.reads = TRUE, threshold.context = "CG", min.context.sites = 2, min.context.beta = 0.5, max.outofcontext.beta = 0.1), the input and results will look as following:
methylation string | threshold | explained | methylation reported |
...Z..x+.h..x..h. | below | min.context.sites < 2 (only one zZ base) | all cytosines unmethylated |
...Z..z.h..x..h. | above | pass all criteria | only C4 is methylated |
...Z..z.h..X..h. | below | max.outofcontext.beta > 0.1 (1XH / 3xXhH = 0.33) | all cytosines unmethylated |
...Z..z.h..z-.h. | below | min.context.beta < 0.5 (1Z / 3zZ = 0.33) | all cytosines unmethylated |
Only the second read will satisfy all of the thresholding criteria, leading to the following CX report (given that all reads map to chr1:+:1-16):
rname | strand | pos | context | meth | unmeth | triad |
chr1 | + | 4 | CG | 1 | 3 | NNN |
chr1 | + | 7 | CG | 0 | 3 | NNN |
chr1 | + | 9 | CHH | 0 | 4 | NNN |
chr1 | + | 12 | CHG | 0 | 3 | NNN |
chr1 | + | 15 | CHH | 0 | 4 | NNN |
With the thresholding disabled (threshold.reads = FALSE) all methylated bases will retain their status, so the CX report will be similar to the reports produced by other methylation callers (such as Bismark or Illumina DRAGEN Bio IT Platform):
rname | strand | pos | context | meth | unmeth | triad |
chr1 | + | 4 | CG | 4 | 0 | NNN |
chr1 | + | 7 | CG | 0 | 3 | NNN |
chr1 | + | 9 | CHH | 0 | 4 | NNN |
chr1 | + | 12 | CHG | 1 | 2 | NNN |
chr1 | + | 15 | CHH | 0 | 4 | NNN |
Other notes:
Methylation string bases in unknown context ("uU") are simply ignored, which, to the best of our knowledge, is consistent with the behaviour of other tools.
In order to mitigate sequencing errors (leading to rare variations in the methylation context, as in reads 1 and 4 above), the context present in more than 50% of the reads is assumed to be correct, while all bases at the same position but having other methylation context are simply ignored. This allows reports to be prepared without using the reference genome sequence.
The downside of not using the reference genome sequence is the inability to determine the actual sequence of triplet for every base in the cytosine report. For now the sequence is reported as "NNN" and this will stay until such information will be considered as worth adding.
data.table
object containing cytosine
report in Bismark format or NULL if report.file was specified. The report
columns are:
rname – reference sequence name (as in BAM)
strand – strand
pos – cytosine position
context – methylation context
meth – number of methylated cytosines
unmeth – number of unmethylated cytosines
triad – sequence spanning region [<pos>:<pos+2>]. Always equals to "NNN" due to the reference genome-independent reporting
preprocessBam
for preloading BAM data,
generateBedReport
for genomic region-based statistics,
generateVcfReport
for evaluating epiallele-SNV associations,
generateBedEcdf
for analysing the distribution of per-read
beta values, and 'epialleleR' vignettes for the description of usage and
sample data.
capture.bam <- system.file("extdata", "capture.bam", package="epialleleR") # CpG report with thresholding cg.report <- generateCytosineReport(capture.bam) # CX report without thresholding cx.report <- generateCytosineReport(capture.bam, threshold.reads=FALSE, report.context="CX")