We first set global chunk options and load the packages that are necessary to run the vignette.
library(DChIPRep)
## Warning: replacing previous import 'ggplot2::Position' by 'BiocGenerics::Position' when loading
## 'soGGi'
library(ggplot2)
library(DESeq2)
The package implements the analysis strategy of (Chabbert et al. 2015). Along with an experimental protocol to multiplex ChIP–Seq experiments, (Chabbert et al. 2015) developped a methodology to assess differences between chromatin modification profiles in replicated ChIP–Seq studies that builds on the packages DESeq2 and fdrtool. The package also includes a preprocessing script in Python that allows the user to import sam files into data structures than can be used with the package.
Together with the release of the R package DChIPRep, we provide a python framework that should allow users to generate count tables. These may then be directly used to evaluate metagene enrichment profiles. This framework is entirely contained in a script that only requires the installation of Python 2.7 and of the packages HTSeq and Numpy, which are free to download. This section describes briefly the various possibilities of analysis offered by this tool together with the various input options and parameters. Additional information may be found directly in the source code or via the usual help framework offered by Python.
The script comes with the package and can be located on your system by the following commands:
pythonScriptsDir <- system.file( "exec" , package = "DChIPRep" )
pythonScriptsDir
## [1] "/tmp/Rtmp9NNWOU/Rinst419935403df0/DChIPRep/exec"
list.files(pythonScriptsDir)
## [1] "DChIPRep.py"
This script will process alignments of paired-end reads, filter out divergent pairs and low quality alignments and ultimately identify potential PCR duplicates. For the pairs of read retained after filtering, the center of the genomic loci determined by each pair is estimated using mapping coordinates. This center constitutes an approximation of the center of each observed nucleosome. The provided annotation file is then used to estimate the nucleosome counts around the features of interest. These counts are ultimately used to provide a count table in which each row corresponds to one feature and each column to a genomic position around the feature start. This table may then be imported in R using the importData function of the DChIPRep package.
Here we show how to quickly generate a count table from an alignment file. The DChIPRep.py script only requires 4 arguments (all other options have a default setting, cf below):
An alignment file in the SAM format (some HTSeq functions are not currently supported for BAM files) that may be zipped. It is specified by the ‘-i’ option on the command line.
An annotation file in the gff format (specifications for the gff format may be found here) specified in the ‘-a’ option
…
It is crucial to ensure that the chromosome names used in the alignment file and this information file are identical. This information is used by the program to pre–allocate memory to store the counts. A valid path to this text file should be specified under the ‘-g’ option.
A valid path for an output file specified under the option ‘-o’. A new file will be created automatically - if it already exists, its content will be erased and replaced by the script output.
Here is an example of the command line to be run
python DChIPRep.py -i alignment.sam -a my_gff.gff -g chromosome_sizes.txt -o my_count_table.txt
In order to provide greater flexibility in data pre-processing, we have added a panel of additional options that may be specified when running the script. The script is executable, With the –help option, one can get an overview of the available options. They are given below.
usage: DChIPRep.py [-h] -i SAM/BAM -a GFF -g Chromosome Sizes File -o Count Table [-v] [-q TH_QC] [-l TH_LOW] [-L TH_HIGH] [-d TH_PCR] [-f FEATURETYPE]
[-w DOWNSTREAM] [-u UPSTREAM]
optional arguments:
-h, --help show this help message and exit
-i SAM/BAM, --input SAM/BAM
The alignment file to be used for to generate the count table. The file may be in the sam (zipped or not)format. The extension of
the file should contain either the '.sam' indication. Bam files are not supported at the moment due to soome instability in the BAM
reader regarding certain aligner formats.
-a GFF, --annotation GFF
The annotation file that will be used to generate the counts. The file should be in the gff format (see
https://www.sanger.ac.uk/resources/software/gff/spec.html for details).
-g Chromosome Sizes File, --genome_details Chromosome Sizes File
A tabulated file containing the names and sizes of each chromosome. !!! The the chromosome names should be identical to the ones
used to generate the alignment file !!! The file should look like this example (no header): chromI 1304563 chromII 6536634 ...
-o Count Table, --output_file Count Table
The output file where the count table should be stored. If the specified file does not already exist it will be created
automatically. Otherwise, it will be overwritten
-v, --verbose When specified, the option will result in the printing of informations on the alignments and the filtering steps (number of
ambiguous alignments...etc). Default: OFF
-q TH_QC, --quality_threshold TH_QC
The quality threshold below which alignments will be discarded. The alignment quality index typically ranges between 1 and 41.
Default: 30.
-l TH_LOW, --lowest_size TH_LOW
The lowest possible size accepted for DNA fragments. Any pair of reads with an insert size below that value will be discarded.
Default: 130.
-L TH_HIGH, --longest_size TH_HIGH
The longest possible size accepted for DNA fragments. Any pair of reads with an insert size above this value will be discarded.
Default: 180.
-d TH_PCR, --duplicate_filter TH_PCR
The number of estimated PCR duplicates for accepted for a given genomic location. Default: 1.
-f FEATURETYPE, --feature_type FEATURETYPE
The feature types to be used when generating the count table. The feature type will be matched 3rd column of the GFF file. Default:
'Transcript
-w DOWNSTREAM, --downstream_window DOWNSTREAM
The window size used to obtain the counts downstream of the TSS. Default: 1000bp
-u UPSTREAM, --upstream_window UPSTREAM
The window size used to obtain the counts upstream of the TSS. Default: 1500bp
As an example, let us assume that we want to process an alignment file with the following criteria:
Remove reads with an alignment quality below 20
Focus on the pairs with a spanning size ranging between 120 and 160 bp
Get the counts information around the TSS of the coding sequences (CDS) in a symmetrical window of 500bp
Only tolerate two copies on the same molecules upstream
The command line would then be:
python DChIPRep.py -i alignment.sam -a my_gff.gff
-g chromosome_sizes.txt -o my_count_table.txt -v -q 20 -l 120
-L 160 -d 2 -f CDS -u 500 -w 500
The default number of upstream position is 1500bp, the default number of downstream positions is 1000bp. This results in count tables with 2502 columns in total, corresponding to the basepairs up– and downstream as well as one column for
for the 0 position and one column for the feature IDs.
In what follows, we assume that we count nucleosomes around transcription start sites (TSS), that is the start of transcripts, which is also the default option in the preprocessing script.
After the data has been preproccesed, we first need an annotation table for our samples that looks like this:
data(exampleSampleTable)
exampleSampleTable
## ChIP Input upstream downstream condition
## 1 BY_conf_K4me3_1__ORF_count.txt BY_conf_WCE_1__ORF_count.txt 1000 1500 K4me3_WT
## 2 BY_conf_K4me3_2__ORF_count.txt BY_conf_WCE_2__ORF_count.txt 1000 1500 K4me3_WT
## 3 BY_conf_K4me3_3__ORF_count.txt BY_conf_WCE_3__ORF_count.txt 1000 1500 K4me3_WT
## 4 SET2_conf_K4me3_1__ORF_count.txt SET2_conf_WCE_1__ORF_count.txt 1000 1500 K4me3_SET2
## 5 SET2_conf_K4me3_2__ORF_count.txt SET2_conf_WCE_2__ORF_count.txt 1000 1500 K4me3_SET2
## 6 SET2_conf_K4me3_3__ORF_count.txt SET2_conf_WCE_3__ORF_count.txt 1000 1500 K4me3_SET2
## sampleID
## 1 K4me3_WT_1
## 2 K4me3_WT_2
## 3 K4me3_WT_3
## 4 K4me3_SET2_1
## 5 K4me3_SET2_2
## 6 K4me3_SET2_3
It gives the names for the input count files in the columns ChIP and Input respectively and the number of base pairs used for analysis upstream and downstream of the TSS in the columns upstream and downstream. Note that the number of upstream and downstream positions needs to be the same for all samples.
The input files must be tab separated files with genomic features in the rows and the positions around the TSS in the columns. In addition to the position columns, a column containing a feature ID must be present.
As mentioned above, the tables have upstream + downstream + 2 columns in total, the two extra columns correspond to the center of the profile (e.g. a TSS) and a column containing the feature IDs.
The sampleID column contains unique sample IDs.
We can then import the data as follows using the function importData which needs the sample annotation table, a data.frame
and the directory where the raw data files are stored. We use the down–sampled raw data that come with the package to illustrate the function use.
By default the feature ID column is assumed to have the name name, however this can specified in the call to importData via the ID argument.
directory <- file.path(system.file("extdata", package="DChIPRep"))
importedData <- importData(exampleSampleTable, directory)
The imported data is a DChIPRepResults object that contains the data as a DESeqDataSet. It can be accessed via the function DESeq2Data. The DESeqDataSet also contains normalization factors that are equal to the counts from the chromatin inputs, after being multiplied by the coverage ratio between the input and the ChIP data sets.
importedData
## Summary of the DESeqDataSet:
## ============================
## class: DESeqDataSet
## dim: 2501 6
## metadata(1): version
## assays(2): counts normalizationFactors
## rownames(2501): Pos_-1000 Pos_-999 ... Pos_1499 Pos_1500
## rowData names(0):
## colnames(6): K4me3_WT_1 K4me3_WT_2 ... K4me3_SET2_2 K4me3_SET2_3
## colData names(6): ChIP Input ... condition sampleID
##
##
##
## Results:
## ============================
## No results available yet, call runTesting first
DESeq2Data(importedData)
## class: DESeqDataSet
## dim: 2501 6
## metadata(1): version
## assays(2): counts normalizationFactors
## rownames(2501): Pos_-1000 Pos_-999 ... Pos_1499 Pos_1500
## rowData names(0):
## colnames(6): K4me3_WT_1 K4me3_WT_2 ... K4me3_SET2_2 K4me3_SET2_3
## colData names(6): ChIP Input ... condition sampleID
head(normalizationFactors(DESeq2Data(importedData)))
## K4me3_WT_1 K4me3_WT_2 K4me3_WT_3 K4me3_SET2_1 K4me3_SET2_2 K4me3_SET2_3
## Pos_-1000 4.069 0.8868 2.8974 0.3647 2.58 0.2305
## Pos_-999 5.426 3.1038 1.0865 0.2084 1.29 0.1536
## Pos_-998 1.356 0.8868 0.7244 0.1042 1.29 0.3073
## Pos_-997 8.139 0.4434 1.8109 0.2084 1.72 0.2305
## Pos_-996 2.713 0.8868 3.2596 0.1042 1.29 0.6146
## Pos_-995 2.713 2.2170 2.8974 0.2084 1.72 0.2305
If your data is already available within R, you can import it directly via importDataFromMatrices from an input and a ChIP Matrix. The rows of the matrices should correspond to the positions and the columns to the samples. You furthermore need a sample table as described above, however the columns Input and ChIP are not needed.
If you have data that is still on the level of the individual features, you can use the helper function summarizeCountsPerPosition to summarize the counts for individual features per position.
The package comes with example data sets that correspond to the example sample table shown above.
We first show 10 random rows from the two data matrices and then use the function importDataFromMatrices to import them to a DChIPRepResults object.
As you can see the rows should be named according to the positions up– and downstream of the TSS that they contain and the columns should be named after the samples.
Note however, that a correct ordering is not checked, it is assumed that the rows are ordered according to the position they represent (from upstream to downstream)
data(exampleInputData)
data(exampleChipData)
exampleSampleTable
## ChIP Input upstream downstream condition
## 1 BY_conf_K4me3_1__ORF_count.txt BY_conf_WCE_1__ORF_count.txt 1000 1500 K4me3_WT
## 2 BY_conf_K4me3_2__ORF_count.txt BY_conf_WCE_2__ORF_count.txt 1000 1500 K4me3_WT
## 3 BY_conf_K4me3_3__ORF_count.txt BY_conf_WCE_3__ORF_count.txt 1000 1500 K4me3_WT
## 4 SET2_conf_K4me3_1__ORF_count.txt SET2_conf_WCE_1__ORF_count.txt 1000 1500 K4me3_SET2
## 5 SET2_conf_K4me3_2__ORF_count.txt SET2_conf_WCE_2__ORF_count.txt 1000 1500 K4me3_SET2
## 6 SET2_conf_K4me3_3__ORF_count.txt SET2_conf_WCE_3__ORF_count.txt 1000 1500 K4me3_SET2
## sampleID
## 1 K4me3_WT_1
## 2 K4me3_WT_2
## 3 K4me3_WT_3
## 4 K4me3_SET2_1
## 5 K4me3_SET2_2
## 6 K4me3_SET2_3
exampleInputData[1:10, ]
## K4me3_WT_1 K4me3_WT_2 K4me3_WT_3 K4me3_SET2_1 K4me3_SET2_2 K4me3_SET2_3
## X.1000 1417 2036 2957 1729 1741 1842
## X.999 1442 2075 2889 1756 1703 1796
## X.998 1408 2031 2784 1761 1803 1795
## X.997 1433 1962 2853 1740 1702 1783
## X.996 1465 2109 2905 1759 1701 1845
## X.995 1466 2077 2778 1703 1690 1852
## X.994 1504 2074 2839 1706 1729 1867
## X.993 1433 2028 2787 1725 1670 1845
## X.992 1416 1930 2739 1668 1729 1719
## X.991 1372 2005 2743 1693 1666 1785
exampleChipData[1:10, ]
## K4me3_WT_1 K4me3_WT_2 K4me3_WT_3 K4me3_SET2_1 K4me3_SET2_2 K4me3_SET2_3
## X.1000 4374 1928 2235 198 1595 334
## X.999 4139 1951 2200 182 1591 306
## X.998 4348 1907 2209 211 1581 281
## X.997 4326 1939 2265 225 1555 343
## X.996 4256 1883 2214 211 1609 340
## X.995 4227 1888 2198 198 1565 332
## X.994 4219 1884 2203 195 1584 313
## X.993 4222 1911 2230 198 1599 321
## X.992 4192 1866 2182 191 1549 330
## X.991 4126 1892 2157 199 1546 314
imDataFromMatrices <- importDataFromMatrices(inputData = exampleInputData,
chipData = exampleChipData,
sampleTable = exampleSampleTable)
The imported data is again a DChIPRepResults object that contains the data as a DESeqDataSet.
importData_soGGi
It is also possible to use the function importData_soGGi
, which is based on the function regionPlot
from the package soGGi to import data from .bam files directly.
It will return a matrix with one column per .bam file and the respective counts per postion in the rows.
In the example below, we use a subsampled .bam file (0.1 % of the reads) from the Galonska et. al. WCE (whole cell extract) H3Kme3 data and associated TSS near identified peaks. For additional details on the data, see the help pages on input_galonska
and TSS_galonska
. The code below is not evaluated, since it takes some time to compute.
data(sample_table_galonska)
data(TSS_galonska)
bam_dir <- file.path(system.file("extdata", package="DChIPRep"))
wce_bam <- "subsampled_0001_pc_SRR2144628_WCE_bowtie2_mapped-only_XS-filt_no-dups.bam"
mat_wce <- importData_soGGi(bam_paths = file.path(bam_dir, wce_bam),
TSS = TSS_galonska,
fragment_lengths = sample_table_galonska$input_fragment_length[1],
sample_ids = sample_table_galonska$input[1],
paired = FALSE,
removeDup=FALSE
)
head(mat_wce)
After the data import, we are ready to perform the tests for differential enrichment. The tests are performed position–wise and wrap DESeq2 and fdrtool. Briefly the DChIPRep testing workflow is as follows:
The function estimateDispersions from DESeq2 is called and the dispersions are estimated.
Then the position–wise tests to compare the experimental conditions are performed. A minimum fold change is used for the null hypothesis, the default value used is 0.05 on a log2 scale.
A possible strategy to infer this threshold from the data is to look a the average fold–change between technical replicates.
imDataFromMatrices <- runTesting(imDataFromMatrices, plotFDR = TRUE)
## Step 1... determine cutoff point
## Step 2... estimate parameters of null distribution and eta0
## Step 3... compute p-values and estimate empirical PDF/CDF
## Step 4... compute q-values and local fdr
## Step 5... prepare for plotting
The results can now be accessed via the function resultsDChIPRep.
res <- resultsDChIPRep(imDataFromMatrices)
head(res)
## baseMean log2FoldChange lfcSE stat pvalue padj lfdr
## Pos_-1000 1013.4 -0.02983 0.05319 0.00000 1.0000 1.0000 1
## Pos_-999 978.5 -0.03272 0.05592 0.00000 1.0000 1.0000 1
## Pos_-998 996.3 0.05379 0.05360 0.07079 0.9436 1.0000 1
## Pos_-997 1056.4 -0.04934 0.05194 0.00000 1.0000 1.0000 1
## Pos_-996 1011.3 -0.11783 0.05191 -1.30672 0.1913 0.8986 1
## Pos_-995 1005.2 -0.06681 0.05287 -0.31790 0.7506 1.0000 1
table( res$lfdr < 0.2)
##
## FALSE TRUE
## 2452 49
At an lfdr of 0.2 we identify 49 significant positions.
We can first of all plot the TSS profiles by coloring the individual points by significance.
Points corresponding to significant positions are colored black in both of the conditions. The replicate–samples are sumarized by using a positionwise robust Huber estimator for the mean (Hampel, Hennig, and Ronchetti 2011).
The function returns the plot as a ggplot2 object that can be modified afterwards.
sigPlot <- plotSignificance(imDataFromMatrices)
sigPlot
This plot is similar to Figure S17B of (Chabbert et al. 2015). We see an enrichment for significant position near the end of the downstream window considered.
We can produce the typical, smoothed plots of the TSS profiles as well. Here we use again the smoothed Huber estimator for the mean to compute a summary per experimental group.
profilePlot <- plotProfiles(imDataFromMatrices)
profilePlot
This plot is similar to Figure 5B of (Chabbert et al. 2015).
sessionInfo()
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.6-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.6-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
## [4] LC_COLLATE=C LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_2.2.1 DChIPRep_1.8.0 DESeq2_1.18.0
## [4] SummarizedExperiment_1.8.0 DelayedArray_0.4.0 matrixStats_0.52.2
## [7] Biobase_2.38.0 GenomicRanges_1.30.0 GenomeInfoDb_1.14.0
## [10] IRanges_2.12.0 S4Vectors_0.16.0 BiocGenerics_0.24.0
## [13] rmarkdown_1.6 knitr_1.17 BiocStyle_2.6.0
##
## loaded via a namespace (and not attached):
## [1] colorspace_1.3-2 hwriter_1.3.2 seqinr_3.4-5
## [4] rprojroot_1.2 htmlTable_1.9 futile.logger_1.4.3
## [7] XVector_0.18.0 base64enc_0.1-3 ChIPpeakAnno_3.12.0
## [10] bit64_0.9-7 interactiveDisplayBase_1.16.0 AnnotationDbi_1.40.0
## [13] codetools_0.2-15 splines_3.4.2 geneplotter_1.56.0
## [16] ade4_1.7-8 Formula_1.2-2 Rsamtools_1.30.0
## [19] annotate_1.56.0 cluster_2.0.6 GO.db_3.4.2
## [22] graph_1.56.0 shiny_1.0.5 compiler_3.4.2
## [25] httr_1.3.1 backports_1.1.1 assertthat_0.2.0
## [28] Matrix_1.2-11 lazyeval_0.2.1 limma_3.34.0
## [31] acepack_1.4.1 htmltools_0.3.6 prettyunits_1.0.2
## [34] tools_3.4.2 bindrcpp_0.2 smoothmest_0.1-2
## [37] glue_1.2.0 gtable_0.2.0 GenomeInfoDbData_0.99.1
## [40] dplyr_0.7.4 reshape2_1.4.2 ShortRead_1.36.0
## [43] Rcpp_0.12.13 Biostrings_2.46.0 multtest_2.34.0
## [46] nlme_3.1-131 preprocessCore_1.40.0 rtracklayer_1.38.0
## [49] stringr_1.2.0 mime_0.5 ensembldb_2.2.0
## [52] XML_3.98-1.9 idr_1.2 AnnotationHub_2.10.0
## [55] zlibbioc_1.24.0 MASS_7.3-47 scales_0.5.0
## [58] BSgenome_1.46.0 BiocInstaller_1.28.0 ProtGenerics_1.10.0
## [61] RBGL_1.54.0 AnnotationFilter_1.2.0 lambda.r_1.2
## [64] RColorBrewer_1.1-2 yaml_2.1.14 curl_3.0
## [67] memoise_1.1.0 gridExtra_2.3 biomaRt_2.34.0
## [70] rpart_4.1-11 latticeExtra_0.6-28 stringi_1.1.5
## [73] RSQLite_2.0 genefilter_1.60.0 RMySQL_0.10.13
## [76] checkmate_1.8.5 GenomicFeatures_1.30.0 BiocParallel_1.12.0
## [79] chipseq_1.28.0 rlang_0.1.2 pkgconfig_2.0.1
## [82] bitops_1.0-6 evaluate_0.10.1 lattice_0.20-35
## [85] bindr_0.1 purrr_0.2.4 labeling_0.3
## [88] GenomicAlignments_1.14.0 htmlwidgets_0.9 tidyselect_0.2.2
## [91] bit_1.1-12 plyr_1.8.4 magrittr_1.5
## [94] bookdown_0.5 R6_2.2.2 Hmisc_4.0-3
## [97] DBI_0.7 mgcv_1.8-22 foreign_0.8-69
## [100] survival_2.41-3 RCurl_1.95-4.8 nnet_7.3-12
## [103] tibble_1.3.4 futile.options_1.0.0 fdrtool_1.2.15
## [106] soGGi_1.10.0 progress_1.1.2 locfit_1.5-9.1
## [109] grid_3.4.2 data.table_1.10.4-3 blob_1.1.0
## [112] digest_0.6.12 xtable_1.8-2 VennDiagram_1.6.17
## [115] tidyr_0.7.2 httpuv_1.3.5 regioneR_1.10.0
## [118] munsell_0.4.3
Chabbert, C. D., S. H. Adjalley, B. Klaus, E. S. Fritsch, I. Gupta, V. Pelechano, and L. M. Steinmetz. 2015. “A High-Throughput ChIP-Seq for Large-Scale Chromatin Studies.” Molecular Systems Biology 11 (1). EMBO: 777–77. doi:10.15252/msb.20145776.
Hampel, Frank, Christian Hennig, and Elvezio Ronchetti. 2011. “A Smoothing Principle for the Huber and Other Location M-Estimators.” Computational Statistics & Data Analysis 55 (1). Elsevier BV: 324–37. doi:10.1016/j.csda.2010.05.001.