generateVcfReport {epialleleR}R Documentation

generateVcfReport

Description

This function calculates base frequencies at particular genomic positions and tests their association with the methylation status of the sequencing reads.

Usage

generateVcfReport(
  bam,
  vcf,
  vcf.style = NULL,
  bed = NULL,
  report.file = NULL,
  zero.based.bed = FALSE,
  threshold.reads = TRUE,
  threshold.context = c("CG", "CHG", "CHH", "CxG", "CX"),
  min.context.sites = 2,
  min.context.beta = 0.5,
  max.outofcontext.beta = 0.1,
  min.mapq = 0,
  skip.duplicates = FALSE,
  gzip = FALSE,
  verbose = TRUE
)

Arguments

bam

BAM file location string OR preprocessed output of preprocessBam function.

vcf

Variant Call Format (VCF) file location string OR a VCF object returned by readVcf function. If VCF object is supplied, its seqlevels must match the seqlevels of the BAM file/object used.

vcf.style

string for the seqlevels style of the VCF file, if different from BED file/object. Only has effect when 'vcf' parameter points to the VCF file location and 'bed' is not NULL. Possible values:

  • NULL (the default) – seqlevels in BED file/object and VCF file are the same

  • "NCBI", "UCSC", ... – valid parameters of seqlevelsStyle function

bed

Browser Extensible Data (BED) file location string OR object of class GRanges holding genomic coordinates for regions of interest. It is used to include only the specific genomic ranges when the VCF file is loaded. This option has no effect when VCF object is supplied as a 'vcf' parameter. The seqlevels of BED file/object must match the seqlevels of the BAM file/object used.

report.file

file location string to write the VCF report. If NULL (the default) then report is returned as a data.table object.

zero.based.bed

boolean defining if BED coordinates are zero based (default: FALSE).

threshold.reads

boolean defining if sequence reads should be thresholded before counting bases in reference and variant epialleles (default: TRUE). Disabling thresholding is possible but makes no sense in this context as all the reads will be assigned to the variant epiallele, which will result in Fisher's Exact test p-value of 1 (in columns 'FEp+' and 'FEP-').

threshold.context

string defining cytosine methylation context used for thresholding the reads:

  • "CG" (the default) – within-the-context: CpG cytosines (called as zZ), out-of-context: all the other cytosines (hHxX)

  • "CHG" – within-the-context: CHG cytosines (xX), out-of-context: hHzZ

  • "CHH" – within-the-context: CHH cytosines (hH), out-of-context: xXzZ

  • "CxG" – within-the-context: CG and CHG cytosines (zZxX), out-of-context: CHH cytosines (hH)

  • "CX" – all cytosines are considered within-the-context, this effectively results in no thresholding

This option has no effect when read thresholding is disabled.

min.context.sites

non-negative integer for minimum number of cytosines within the 'threshold.context' (default: 2). Reads containing fewer within-the-context cytosines are considered completely unmethylated (thus belonging to the reference epiallele). This option has no effect when read thresholding is disabled.

min.context.beta

real number in the range [0;1] (default: 0.5). Reads with average beta value for within-the-context cytosines below this threshold are considered completely unmethylated (thus belonging to the reference epiallele). This option has no effect when read thresholding is disabled.

max.outofcontext.beta

real number in the range [0;1] (default: 0.1). Reads with average beta value for out-of-context cytosines above this threshold are considered completely unmethylated (thus belonging to the reference epiallele). This option has no effect when read thresholding is disabled.

min.mapq

non-negative integer threshold for minimum read mapping quality (default: 0). Option has no effect if preprocessed BAM data was supplied as an input.

skip.duplicates

boolean defining if duplicate aligned reads should be skipped (default: FALSE). Option has no effect if preprocessed BAM data was supplied as an input OR duplicate reads were not marked by alignment software.

gzip

boolean to compress the report (default: FALSE).

verbose

boolean to report progress and timings (default: TRUE).

Details

Using BAM reads and sequence variation information as an input, 'generateVcfReport' function thresholds the reads (or read pairs as a single entity) according to supplied parameters and calculates the occurrence of Reference and Alternative bases within reads, taking into the account DNA strand the read mapped to and average methylation level (epiallele status) of the read.

The information on sequence variation can be supplied as a Variant Call Format (VCF) file location or an object of class VCF, returned by the readVcf function call. As whole-genome VCF files can be extremely large, it is strongly advised to use only relevant subset of their data, prefiltering the VCF object manually before calling 'generateVcfReport' or specifying 'bed' parameter when 'vcf' points to the location of such large VCF file. Please note that all the BAM, BED and VCF files must use the same seqlevels (i.e. chromosome names).

After counting, function checks if certain bases occur more often within reads belonging to certain epialleles using Fisher's Exact test (fisher.test) and reports separate p-values for reads mapped to "+" (forward) and "-" (reverse) DNA strands.

Please note that the final report currently includes only the VCF entries with single-base REF and ALT alleles. Also, there is no filtering by the base quality at the moment, thus the output of 'generateVcfReport' is equivalent to the one of 'samtools mplieup -Q 0 ...', and may result in false SNVs caused by misalignments. These will be fixed in the future.

Value

data.table object containing VCF report or NULL if report.file was specified. The report columns are:

See Also

preprocessBam for preloading BAM data, generateCytosineReport for methylation statistics at the level of individual cytosines, generateBedReport for genomic region-based statistics, generateBedEcdf for analysing the distribution of per-read beta values, and 'epialleleR' vignettes for the description of usage and sample data.

GRanges class for working with genomic ranges, readVcf function for loading VCF data, seqlevelsStyle function for getting or setting the seqlevels style.

Examples

  capture.bam <- system.file("extdata", "capture.bam", package="epialleleR")
  capture.bed <- system.file("extdata", "capture.bed", package="epialleleR")
  capture.vcf <- system.file("extdata", "capture.vcf.gz",
                             package="epialleleR")
  
  # VCF report
  vcf.report <- generateVcfReport(bam=capture.bam, bed=capture.bed,
                                  vcf=capture.vcf)

[Package epialleleR version 1.0.0 Index]