derfinder
If you wish, you can view this vignette online here.
In this vignette we present derfinder
(Collado-Torres, Frazee, Love, Irizarry, et al., 2015)
which enables annotation-agnostic differential expression analysis at base-pair resolution. A example with publicly available data is showcased in this vignette. With it, the two main type of analyses that can be performed with derfinder
are illustrated.
As an example, we will analyze a small subset of the samples from the BrainSpan Atlas of the Human Brain (BrainSpan, 2011)
publicly available data.
We first load the required packages.
## Load libraries library('derfinder') library('derfinderData') library('GenomicRanges')
For this example, we created a small table with the relevant phenotype data for 12 samples: 6 from fetal samples and 6 from adult samples. We chose at random a brain region, in this case the amygdaloid complex. For this example we will only look at data from chromosome 21. Other variables include the age in years and the gender. The data is shown below.
library('knitr') ## Get pheno table pheno <- subset(brainspanPheno, structure_acronym == 'AMY') ## Display the main information p <- pheno[, -which(colnames(pheno) %in% c('structure_acronym', 'structure_name', 'file'))] rownames(p) <- NULL kable(p, format = 'html', row.names = TRUE)
gender | lab | Age | group | |
---|---|---|---|---|
1 | F | HSB97.AMY | -0.5476190 | fetal |
2 | M | HSB92.AMY | -0.4523810 | fetal |
3 | M | HSB178.AMY | -0.5714286 | fetal |
4 | M | HSB159.AMY | -0.3809524 | fetal |
5 | F | HSB153.AMY | -0.6666667 | fetal |
6 | F | HSB113.AMY | -0.6666667 | fetal |
7 | F | HSB130.AMY | 21.0000000 | adult |
8 | M | HSB136.AMY | 23.0000000 | adult |
9 | F | HSB126.AMY | 30.0000000 | adult |
10 | M | HSB145.AMY | 36.0000000 | adult |
11 | M | HSB123.AMY | 37.0000000 | adult |
12 | F | HSB135.AMY | 40.0000000 | adult |
derfinder
offers three functions related to loading data. The first one, rawFiles()
, is a helper function for identifying the full paths to the input files. Next, loadCoverage()
loads the base-level coverage data from either BAM or BigWig files for a specific chromosome. Finally, fullCoverage()
will load the coverage for a set of chromosomes using loadCoverage()
.
We can load the data from derfinderData
(Collado-Torres, Jaffe, and Leek, 2014)
by first identifying the paths to the BigWig files with rawFiles()
and then loading the data with fullCoverage()
.
## Determine the files to use and fix the names files <- rawFiles(system.file('extdata', 'AMY', package = 'derfinderData'), samplepatt = 'bw', fileterm = NULL) names(files) <- gsub('.bw', '', names(files)) ## Load the data from disk system.time(fullCov <- fullCoverage(files = files, chrs = 'chr21'))
Alternatively, since the BigWig files are publicly available from BrainSpan (see here), we can extract the relevant coverage data using fullCoverage()
. Note that as of rtracklayer
1.25.16 BigWig files are not supported on Windows: you can find the fullCov
object inside derfinderData
to follow the examples.
## Determine the files to use and fix the names files <- pheno$file names(files) <- gsub('.AMY', '', pheno$lab) ## Load the data from the web system.time(fullCov <- fullCoverage(files = files, chrs = 'chr21'))
Note how loading the coverage for 12 samples from the web was quite fast. Although in this case we only retained the information for chromosome 21.
The result of fullCov
is a list with one element per chromosome. If no filtering was performed, each chromosome has a DataFrame
with the number of rows equaling the number of bases in the chromosome with one column per sample.
## Lets explore it fullCov
## $chr21 ## DataFrame with 48129895 rows and 12 columns ## HSB97 HSB92 HSB178 HSB159 HSB153 HSB113 HSB130 HSB136 HSB126 ## <Rle> <Rle> <Rle> <Rle> <Rle> <Rle> <Rle> <Rle> <Rle> ## 1 0 0 0 0 0 0 0 0 0 ## 2 0 0 0 0 0 0 0 0 0 ## 3 0 0 0 0 0 0 0 0 0 ## 4 0 0 0 0 0 0 0 0 0 ## 5 0 0 0 0 0 0 0 0 0 ## ... ... ... ... ... ... ... ... ... ... ## 48129891 0 0 0 0 0 0 0 0 0 ## 48129892 0 0 0 0 0 0 0 0 0 ## 48129893 0 0 0 0 0 0 0 0 0 ## 48129894 0 0 0 0 0 0 0 0 0 ## 48129895 0 0 0 0 0 0 0 0 0 ## HSB145 HSB123 HSB135 ## <Rle> <Rle> <Rle> ## 1 0 0 0 ## 2 0 0 0 ## 3 0 0 0 ## 4 0 0 0 ## 5 0 0 0 ## ... ... ... ... ## 48129891 0 0 0 ## 48129892 0 0 0 ## 48129893 0 0 0 ## 48129894 0 0 0 ## 48129895 0 0 0
If filtering was performed, each chromosome also has a logical Rle
indicating which bases of the chromosome passed the filtering. This information is useful later on to map back the results to the genome coordinates.
Depending on the use case, you might want to filter the base level coverage at the time of reading it, or you might want to keep an unfiltered version. By default both loadCoverage()
and fullCoverage()
will not filter.
If you decide to filter, set the cutoff
argument to a positive value. This will run filterData()
. Note that you might want to standardize the library sizes prior to filtering, which can be done by supplying the totalMapped
and targetSize
arguments.
In this example, we prefer to keep both an unfiltered and filtered version. For the filtered version, we will retain the bases where at least one sample has coverage greater than 2.
## Filter coverage filteredCov <- lapply(fullCov, filterData, cutoff = 2)
The result is similar to fullCov
but with the genomic position index as shown below.
## Similar to fullCov but with $position filteredCov
## $chr21 ## $chr21$coverage ## DataFrame with 130356 rows and 12 columns ## HSB97 HSB92 HSB178 HSB159 HSB153 HSB113 HSB130 HSB136 HSB126 HSB145 ## <Rle> <Rle> <Rle> <Rle> <Rle> <Rle> <Rle> <Rle> <Rle> <Rle> ## 1 1.58 0.07 0.05 0.15 0.59 2.01 0.04 0.13 0.00 0 ## 2 1.58 0.07 0.05 0.15 0.63 2.17 0.04 0.13 0.00 0 ## 3 1.61 0.10 0.05 0.15 0.67 2.21 0.04 0.13 0.00 0 ## 4 1.65 0.13 0.05 0.15 0.71 2.37 0.04 0.13 0.00 0 ## 5 1.68 0.13 0.10 0.25 0.75 2.37 0.04 0.13 0.06 0 ## ... ... ... ... ... ... ... ... ... ... ... ## 130352 1.51 2.23 2.07 2.46 2.21 1.25 0.79 1.37 2.05 1.03 ## 130353 1.51 2.23 2.07 2.46 2.21 1.21 0.79 1.37 2.05 1.03 ## 130354 1.47 2.13 2.11 2.21 2.01 1.21 0.79 1.28 1.93 0.73 ## 130355 1.47 2.13 2.11 2.11 2.01 1.17 0.79 1.28 1.93 0.60 ## 130356 1.47 2.06 1.78 1.96 1.93 1.17 0.79 1.02 1.87 0.56 ## HSB123 HSB135 ## <Rle> <Rle> ## 1 0 0.23 ## 2 0 0.23 ## 3 0 0.23 ## 4 0 0.23 ## 5 0 0.23 ## ... ... ... ## 130352 1.28 1.63 ## 130353 1.24 1.63 ## 130354 1.20 1.63 ## 130355 1.20 1.63 ## 130356 1.09 1.55 ## ## $chr21$position ## logical-Rle of length 48129895 with 3235 runs ## Lengths: 9825448 149 2 9 ... 740 598 45091 ## Values : FALSE TRUE FALSE TRUE ... FALSE TRUE FALSE
In terms of memory, the filtered version requires less resources. Although this will depend on how rich the data set is and how aggressive was the filtering step.
## Compare the size in Mb round(c(fullCov = object.size(fullCov), filteredCov = object.size(filteredCov)) / 1024^2, 1)
## fullCov filteredCov ## 22.7 8.5
Note that with your own data, filtering for bases where at least one sample has coverage greater than 2 might not make sense: maybe you need a higher or lower filter. The amount of bases remaining after filtering will impact how long the analysis will take to complete. We thus recommend exploring this before proceeding.
One form of base-level differential expression analysis implemented in derfinder
is to calculate F-statistics for every base and use them to define candidate differentially expressed regions. This type of analysis is further explained in this section.
Once we have the base-level coverage data for all 12 samples, we can construct the models. In this case, we want to find differences between fetal and adult samples while adjusting for gender and a proxy of the library size.
We can use sampleDepth()
and it's helper function collapseFullCoverage()
to get a proxy of the library size. Note that you would normally use the unfiltered data from all the chromosomes in this step and not just one.
## Get some idea of the library sizes sampleDepths <- sampleDepth(collapseFullCoverage(fullCov), 1)
sampleDepths
## HSB97.100% HSB92.100% HSB178.100% HSB159.100% HSB153.100% HSB113.100% ## 19.47733 19.55904 19.44252 19.24186 19.41285 19.82106 ## HSB130.100% HSB136.100% HSB126.100% HSB145.100% HSB123.100% HSB135.100% ## 19.52017 19.97758 19.53045 19.49827 19.40505 20.33392
sampleDepth()
is similar to calcNormFactors()
from metagenomeSeq with some code underneath tailored for the type of data we are using. collapseFullCoverage()
is only needed to deal with the size of the data.
We can then define the nested models we want to use using makeModels()
. This is a helper function that assumes that you will always adjust for the library size. You then need to define the variable to test, in this case we are comparing fetal vs adult samples. Optionally, you can adjust for other sample covariates, such as the gender in this case.
## Define models models <- makeModels(sampleDepths, testvars = pheno$group, adjustvars = pheno[, c('gender')]) ## Explore the models lapply(models, head)
## $mod ## (Intercept) testvarsadult sampleDepths adjustVar1M ## 1 1 0 19.47733 0 ## 2 1 0 19.55904 1 ## 3 1 0 19.44252 1 ## 4 1 0 19.24186 1 ## 5 1 0 19.41285 0 ## 6 1 0 19.82106 0 ## ## $mod0 ## (Intercept) sampleDepths adjustVar1M ## 1 1 19.47733 0 ## 2 1 19.55904 1 ## 3 1 19.44252 1 ## 4 1 19.24186 1 ## 5 1 19.41285 0 ## 6 1 19.82106 0
Note how the null model (mod0
) is nested in the alternative model (mod
). Use the same models for all your chromosomes unless you have a specific reason to use chromosome-specific models. Note that derfinder
is very flexible and works with any type of nested model.
Next, we can find candidate differentially expressed regions (DERs) using as input the segments of the genome where at least one sample has coverage greater than 2. That is, the filtered coverage version we created previously.
The main function in derfinder
for this type of analysis is analyzeChr()
. It works at a chromosome level and runs behinds the scenes several other derfinder
functions. To use it, you have to provide the models, the grouping information, how to calculate the F-statistic cutoff and most importantly, the number of permutations.
By default analyzeChr()
will use a theoretical cutoff. In this example, we use the cutoff that would correspond to a p-value of 0.05. To assign p-values to the candidate DERs, derfinder
permutes the rows of the model matrices, re-calculates the F-statistics and identifies null regions. Then it compares the area of the observed regions versus the areas from the null regions to assign an empirical p-value.
In this example we will use twenty permutations, although in a real case scenario you might consider a larger number of permutations.
In real scenario, you might consider saving the results from all the chromosomes in a given directory. Here we will use analysisResults. For each chromosome you analyze, a new directory with the chromosome-specific data will be created. So in this case, we will have analysisResults/chr21.
## Create a analysis directory dir.create('analysisResults') originalWd <- getwd() setwd(file.path(originalWd, 'analysisResults')) ## Perform differential expression analysis system.time(results <- analyzeChr(chr = 'chr21', filteredCov$chr21, models, groupInfo = pheno$group, writeOutput = TRUE, cutoffFstat = 5e-02, nPermute = 20, seeds = 20140923 + seq_len(20), returnOutput = TRUE))
## Warning: Calling species() on a TxDb object is *deprecated*. ## Please use organism() instead.
## ...................
## user system elapsed ## 320.35 1.15 321.84
To speed up analyzeChr()
, you might need to use several cores via the mc.cores
argument. If memory is limiting, you might want to use a smaller chunksize
(default is 5 million). Note that if you use too many cores, you might hit the input/output ceiling of your data network and/or hard drives speed.
Before running with a large number of permutations we recommend exploring how long each permutation cycle takes using a single permutation.
Note that analyzing each chromosome with a large number of permutations and a rich data set can take several hours, so we recommend running one job running analyzeChr()
per chromosome, and then merging the results via mergeResults()
. This process is further described in the advanced derfinder
vignette.
When using returnOutput = TRUE
, analyzeChr()
will return a list with the results to explore interactively. However, by default it writes the results to disk (one .Rdata file per result).
The following code explores the results.
## Explore names(results)
## [1] "timeinfo" "optionsStats" "coveragePrep" "fstats" ## [5] "regions" "annotation"
optionStats
stores the main options used in the analyzeChr()
call including the models used, the type of cutoff, number of permutations, seeds for the permutations. All this information can be useful to reproduce the analysis.
## Explore optionsStats names(results$optionsStats)
## [1] "models" "cutoffPre" "scalefac" ## [4] "chunksize" "cutoffFstat" "cutoffType" ## [7] "nPermute" "seeds" "groupInfo" ## [10] "lowMemDir" "analyzeCall" "cutoffFstatUsed" ## [13] "returnOutput"
## Call used results$optionsStats$analyzeCall
## analyzeChr(chr = "chr21", coverageInfo = filteredCov$chr21, models = models, ## cutoffFstat = 0.05, nPermute = 20, seeds = 20140923 + seq_len(20), ## groupInfo = pheno$group, writeOutput = TRUE, returnOutput = TRUE)
coveragePrep
has the result from the preprocessCoverage()
step. This includes the genomic position index, the mean coverage (after scaling and the log2 transformation) for all the samples, and the group mean coverages. By default, the chunks are written to disk in optionsStats$lowMemDir
(chr21/chunksDir
in this example) to help reduce the required memory resources. Otherwise it is stored in coveragePrep$coverageProcessed
.
## Explore coveragePrep names(results$coveragePrep)
## [1] "coverageProcessed" "mclapplyIndex" "position" ## [4] "meanCoverage" "groupMeans"
## Group means results$coveragePrep$groupMeans
## $fetal ## numeric-Rle of length 130356 with 117984 runs ## Lengths: 1 1 ... 1 ## Values : 0.741666669026017 0.775000020240744 ... 1.72833331425985 ## ## $adult ## numeric-Rle of length 130356 with 111765 runs ## Lengths: 4 3 ... 1 ## Values : 0.066666666418314 0.0766666661947966 ... 1.14666666587194
The F-statistics are then stored in fstats
. These are calculated using calculateStats()
.
## Explore optionsStats results$fstats
## numeric-Rle of length 130356 with 126807 runs ## Lengths: 1 1 ... 1 ## Values : 6.08556061709451 6.07331772618165 ... 4.07115923947177
## Note that the length matches the number of bases used identical(length(results$fstats), sum(results$coveragePrep$position))
## [1] TRUE
The candidate DERs and summary results from the permutations is then stored in regions
. This is the output from calculatePvalues()
which uses several underneath other functions including calculateStats()
and findRegions()
.
## Explore regions names(results$regions)
## [1] "regions" "nullStats" "nullWidths" "nullPermutation"
For the null regions, the summary information is composed of the mean F-statistic for the null regions (regions$nullStats
), the width of the null regions (regions$nullWidths
), and the permutation number under which they were identified (regions$nullPermutation
).
## Permutation summary information results$regions[2:4]
## $nullStats ## numeric-Rle of length 13694 with 13694 runs ## Lengths: 1 1 ... 1 ## Values : 6.43272407684557 5.37974310711935 ... 5.86111054419367 ## ## $nullWidths ## integer-Rle of length 13694 with 12029 runs ## Lengths: 1 1 1 1 1 1 1 1 ... 2 1 1 1 1 1 1 ## Values : 14 1 3 44 1 2 151 7 ... 2 12 6 5 3 9 10 ## ## $nullPermutation ## integer-Rle of length 13694 with 20 runs ## Lengths: 366 1388 890 656 418 168 ... 624 1070 1058 102 938 540 ## Values : 1 2 3 4 5 6 ... 15 16 17 18 19 20
The most important part of the output is the GRanges
object with the candidate DERs shown below.
## Candidate DERs results$regions$regions
## GRanges object with 1912 ranges and 14 metadata columns: ## seqnames ranges strand | value area ## <Rle> <IRanges> <Rle> | <numeric> <numeric> ## up chr21 [48018861, 48019420] * | 84.87779 47531.56 ## up chr21 [40196060, 40196788] * | 46.50926 33905.25 ## up chr21 [37444744, 37445435] * | 45.89208 31757.32 ## up chr21 [34001753, 34003041] * | 21.83766 28148.75 ## up chr21 [45180666, 45182147] * | 15.92796 23605.24 ## ... ... ... ... ... ... ... ## up chr21 [38525436, 38525436] * | 5.319762 5.319762 ## up chr21 [44473957, 44473957] * | 5.319113 5.319113 ## up chr21 [15753814, 15753814] * | 5.318060 5.318060 ## up chr21 [38525534, 38525534] * | 5.318025 5.318025 ## up chr21 [19161933, 19161933] * | 5.317960 5.317960 ## indexStart indexEnd cluster clusterL meanCoverage meanfetal ## <integer> <integer> <Rle> <Rle> <numeric> <numeric> ## up 126383 126942 451 712 18.707251 1.0025714 ## up 76025 76753 258 2186 1.334406 0.3312163 ## up 62161 62852 214 692 3.820737 1.4989234 ## up 32434 33722 126 2728 1.489127 0.6547854 ## up 96597 98078 339 6589 3.119423 1.7551878 ## ... ... ... ... ... ... ... ## up 65147 65147 228 301 2.746667 3.9683334 ## up 88214 88214 310 677 1.569167 2.2333333 ## up 4949 4949 10 17 2.060833 3.3250000 ## up 65245 65245 228 301 2.008333 3.0150000 ## up 11541 11541 29 306 0.650000 0.4616667 ## meanadult log2FoldChangeadultvsfetal pvalues significant ## <numeric> <numeric> <numeric> <factor> ## up 36.411932 5.182634 7.301935e-05 TRUE ## up 2.337595 2.819179 7.301935e-05 TRUE ## up 6.142551 2.034911 7.301935e-05 TRUE ## up 2.323468 1.827186 7.301935e-05 TRUE ## up 4.483658 1.353051 7.301935e-05 TRUE ## ... ... ... ... ... ## up 1.5250000 -1.3797240 0.9951807 FALSE ## up 0.9050000 -1.3032089 0.9975173 FALSE ## up 0.7966667 -2.0613062 0.9995619 FALSE ## up 1.0016667 -1.5897555 0.9995619 FALSE ## up 0.8383333 0.8606724 0.9997079 FALSE ## qvalues significantQval ## <numeric> <factor> ## up 0.003906062 TRUE ## up 0.003906062 TRUE ## up 0.003906062 TRUE ## up 0.003906062 TRUE ## up 0.003906062 TRUE ## ... ... ... ## up 0.6696318 FALSE ## up 0.6708524 FALSE ## up 0.6712707 FALSE ## up 0.6712707 FALSE ## up 0.6712707 FALSE ## ------- ## seqinfo: 1 sequence from an unspecified genome; no seqlengths
The metadata columns are:
Note that for this type of analysis you might want to try a few coverage cutoffs and/or F-statistic cutoffs. One quick way to evaluate the results is to compare the width of the regions.
## Width of potential DERs summary(width(results$regions$regions))
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.00 1.00 5.00 31.49 25.00 1482.00
sum(width(results$regions$regions) > 50)
## [1] 342
## Width of candidate DERs sig <- as.logical(results$regions$regions$significant) summary(width(results$regions$regions[ sig ]))
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 20.0 79.0 105.0 149.9 144.0 1482.0
sum(width(results$regions$regions[ sig ]) > 50)
## [1] 281
analyzeChr()
will find the nearest annotation feature using matchGenes()
from bumphunter (version >= 1.7.3). This information is useful considering that the candidate DERs were identified without relying on annotation. Yet at the end, we are interested to check if they are inside a known exon, upstream a gene, etc.
## Nearest annotation head(results$annotation)
## name ## 1 S100B ## 2 ETS2 ## 3 SETD4 ## 4 SYNJ1 ## 5 PDXK ## 6 OLIG1 ## annotation ## 1 NM_006272 NP_006263 ## 2 NM_001256295 NM_005239 NP_001243224 NP_005230 XM_005260935 XP_005260992 ## 3 NM_001007258 NM_001007259 NM_001007260 NM_001007261 NM_001007262 NM_001286752 NM_017438 NP_001007260 NP_001007262 NP_001273681 NP_059134 NR_040087 XM_011529636 XM_011529637 XM_011529638 XM_011529639 XM_011529640 XM_011529641 XM_011529642 XM_011529643 XM_011529644 XM_011529645 XM_011529646 XM_011529647 XP_011527938 XP_011527939 XP_011527940 XP_011527941 XP_011527942 XP_011527943 XP_011527944 XP_011527945 XP_011527946 XP_011527947 XP_011527948 XP_011527949 XR_937523 XR_937525 ## 4 NM_001160302 NM_001160306 NM_003895 NM_203446 NP_001153774 NP_001153778 NP_003886 NP_982271 ## 5 NM_003681 NM_021941 NP_003672 XM_005261195 XM_005261196 XM_005261198 XM_005261199 XM_011529758 XM_011529759 XM_011529760 XM_011529761 XM_011529762 XP_005261252 XP_005261253 XP_005261255 XP_005261256 XP_011528060 XP_011528061 XP_011528062 XP_011528063 XP_011528064 ## 6 NM_138983 NP_620450 ## description region distance subregion ## 1 overlaps exon upstream inside 5615 overlaps exon upstream ## 2 inside exon inside 18829 inside exon ## 3 inside intron inside 6252 inside intron ## 4 inside exon inside 97310 inside exon ## 5 inside exon inside 41688 inside exon ## 6 inside exon inside 837 inside exon ## insideDistance exonnumber nexons UTR strand geneL codingL ## 1 0 3 3 overlaps 3'UTR - 6504 NA ## 2 0 11 11 3'UTR + 19647 NA ## 3 6167 1 8 5' UTR - 35705 NA ## 4 0 32 32 3'UTR - 99282 NA ## 5 0 11 11 3'UTR + 43210 NA ## 6 0 1 1 overlaps 3'UTR + 2278 NA ## Entrez subjectHits ## 1 6285 51508 ## 2 2114 17339 ## 3 54093 41910 ## 4 8867 68101 ## 5 8566 67103 ## 6 116448 8622
For more details on the output please check the bumphunter package.
Check the section about non-human data (specifically, using annotation different from hg19) on the advanced vignette.
The final piece is the wallclock time spent during each of the steps in analyzeChr()
.
## Time spent results$timeinfo
## init setup ## "2015-06-15 22:27:17 PDT" "2015-06-15 22:27:17 PDT" ## prepData savePrep ## "2015-06-15 22:27:22 PDT" "2015-06-15 22:27:23 PDT" ## calculateStats saveStats ## "2015-06-15 22:27:24 PDT" "2015-06-15 22:27:24 PDT" ## saveStatsOpts calculatePvalues ## "2015-06-15 22:27:24 PDT" "2015-06-15 22:27:35 PDT" ## saveRegs annotate ## "2015-06-15 22:27:35 PDT" "2015-06-15 22:32:39 PDT" ## saveAnno ## "2015-06-15 22:32:39 PDT"
## Use this information to make a plot timed <- diff(results$timeinfo) timed.df <- data.frame(Seconds = as.numeric(timed), Step = factor(names(timed), levels = rev(names(timed)))) library('ggplot2') ggplot(timed.df, aes(y = Step, x = Seconds)) + geom_point()
Once you have analyzed each chromosome using analyzeChr()
, you can use mergeResults()
to merge the results. This function does not return an object in R but instead creates several Rdata files with the main results from the different chromosomes.
## Go back to the original directory setwd(originalWd) ## Merge results from several chromosomes. In this case we only have one. mergeResults(chrs='chr21', prefix="analysisResults", genomicState = genomicState$fullGenome, optionsStats = results$optionsStats)
## Files created by mergeResults() dir('analysisResults', pattern = '.Rdata')
## [1] "fullAnnotatedRegions.Rdata" "fullFstats.Rdata" ## [3] "fullNullSummary.Rdata" "fullRegions.Rdata" ## [5] "fullTime.Rdata" "optionsMerge.Rdata"
For reproducibility purposes, the options used the merge the results are stored in optionsMerge
.
## Options used to merge load(file.path('analysisResults', 'optionsMerge.Rdata')) ## Contents names(optionsMerge)
## [1] "chrs" "significantCut" "minoverlap" "mergeCall" ## [5] "cutoffFstatUsed" "optionsStats"
## Merge call optionsMerge$mergeCall
## mergeResults(chrs = "chr21", prefix = "analysisResults", genomicState = genomicState$fullGenome, ## optionsStats = results$optionsStats)
The main result from mergeResults()
is in fullRegions
. This is a GRanges
object with the candidate DERs from all the chromosomes. It also includes the nearest annotation metadata as well as FWER adjusted p-values (fwer) and whether the FWER adjusted p-value is less than 0.05 (significantFWER).
## Load all the regions load(file.path('analysisResults', 'fullRegions.Rdata')) ## Metadata columns names(mcols(fullRegions))
## [1] "value" "area" ## [3] "indexStart" "indexEnd" ## [5] "cluster" "clusterL" ## [7] "meanCoverage" "meanfetal" ## [9] "meanadult" "log2FoldChangeadultvsfetal" ## [11] "pvalues" "significant" ## [13] "qvalues" "significantQval" ## [15] "name" "annotation" ## [17] "description" "region" ## [19] "distance" "subregion" ## [21] "insideDistance" "exonnumber" ## [23] "nexons" "UTR" ## [25] "annoStrand" "geneL" ## [27] "codingL" "Entrez" ## [29] "subjectHits" "fwer" ## [31] "significantFWER"
Note that analyzeChr()
only has the information for a given chromosome at a time, so mergeResults()
re-calculates the p-values and q-values using the information from all the chromosomes.
In preparation for visually exploring the results, mergeResults()
will run annotateRegions()
which counts how many known exons, introns and intergenic segments each candidate DER overlaps (by default with a minimum overlap of 20bp). annotateRegions()
uses a summarized version of the genome annotation created with makeGenomicState()
. For this example, we can use the data included in derfinder
which is the summarized annotation of hg19 for chromosome 21.
## Load annotateRegions() output load(file.path('analysisResults', 'fullAnnotatedRegions.Rdata')) ## Information stored names(fullAnnotatedRegions)
## [1] "countTable" "annotationList"
## Take a peak lapply(fullAnnotatedRegions, head)
## $countTable ## exon intergenic intron ## 1 1 0 0 ## 2 1 0 0 ## 3 1 0 1 ## 4 1 0 0 ## 5 1 0 0 ## 6 1 0 0 ## ## $annotationList ## GRangesList object of length 6: ## $1 ## GRanges object with 1 range and 4 metadata columns: ## seqnames ranges strand | theRegion tx_id ## <Rle> <IRanges> <Rle> | <character> <IntegerList> ## 4986 chr21 [48018531, 48019416] - | exon 73464,73465 ## tx_name gene ## <CharacterList> <IntegerList> ## 4986 uc002zju.1,uc002zjv.1 227 ## ## $2 ## GRanges object with 1 range and 4 metadata columns: ## seqnames ranges strand | theRegion tx_id ## 1189 chr21 [40194598, 40196878] + | exon 72757,72758 ## tx_name gene ## 1189 uc002yxf.3,uc002yxg.4 77 ## ## $3 ## GRanges object with 2 ranges and 4 metadata columns: ## seqnames ranges strand | theRegion ## 834 chr21 [37444113, 37445464] + | exon ## 3843 chr21 [37442784, 37447178] - | intron ## tx_id tx_name gene ## 834 72703,72704 uc002yvb.1,uc010gmy.1 282 ## 3843 73230,73231,73233,... uc002yuu.3,uc002yux.2,uc002yuy.3,... 4,206 ## ## ... ## <3 more elements> ## ------- ## seqinfo: 1 sequence from hg19 genome
Optionally, we can use the addon package derfinderPlot to visually explore the results. For more details, please check its vignette.
To make the region level plots, we will need to extract the region level coverage data. We can do so using getRegionCoverage()
as shown below.
## Find overlaps between regions and summarized genomic annotation annoRegs <- annotateRegions(fullRegions, genomicState$fullGenome)
## Indeed, the result is the same because we only used chr21 identical(annoRegs, fullAnnotatedRegions)
## [1] FALSE
## Get the region coverage regionCov <- getRegionCoverage(fullCov, fullRegions)
## Explore the result head(regionCov[[1]])
## HSB97 HSB92 HSB178 HSB159 HSB153 HSB113 HSB130 HSB136 HSB126 HSB145 ## 1 0.10 0 0 0.05 0 0 0.39 1.10 0.54 1.24 ## 2 0.10 0 0 0.10 0 0 0.50 1.15 0.60 1.24 ## 3 0.10 0 0 0.10 0 0 0.54 1.10 0.54 1.24 ## 4 0.10 0 0 0.10 0 0 0.57 1.10 0.48 1.29 ## 5 0.14 0 0 0.10 0 0 0.61 1.10 0.48 1.29 ## 6 0.14 0 0 0.10 0 0 0.64 1.02 0.48 1.29 ## HSB123 HSB135 ## 1 1.60 2.24 ## 2 1.68 2.58 ## 3 1.64 2.69 ## 4 1.75 2.73 ## 5 1.82 2.73 ## 6 1.75 2.73
With this, we are all set to visually explore the results.
library('derfinderPlot') ## Overview of the candidate DERs in the genome plotOverview(regions = fullRegions, annotation = results$annotation, type = 'fwer') suppressPackageStartupMessages(library('TxDb.Hsapiens.UCSC.hg19.knownGene')) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene ## Base-levle coverage plots for the first 10 regions plotRegionCoverage(regions = fullRegions, regionCoverage = regionCov, groupInfo = pheno$group, nearestAnnotation = results$annotation, annotatedRegions = annoRegs, whichRegions=1:10, txdb = txdb, scalefac = 1, ask = FALSE) ## Cluster plot for the first region plotCluster(idx = 1, regions = fullRegions, annotation = results$annotation, coverageInfo = fullCov$chr21, txdb = txdb, groupInfo = pheno$group, titleUse = 'fwer')
We have also developed an addon package called regionReport
available via GitHub. For more information check its vignette.
The function derfinderReport()
in regionReport
basically takes advantage of the results from mergeResults()
and plotting functions available in derfinderPlot
as well as other neat features from rCharts and knitrBoostrap
(Hester, 2013)
.
The resulting HTML report promotes reproducibility of the analysis and allows you to explore in more detail the results through some diagnostic plots.
An alternative type of analysis is driven by regionMatrix()
. The idea is to consider consecutive bases with mean coverage above a given cutoff as potentially differentially expressed regions. Then, get a matrix where each row is one of these regions and each column represents a sample. Each cell of the matrix has the mean coverage for the specific region - sample pair. Then, other packages specialized in differential expression at the counts level can be used, for example limma.
In this example, we will use regionMatrix()
where we filter by mean coverage greater than 30 out of counts standardized to libraries of 40 million reads. Note that read of the BrainSpan data are 76bp long.
## Use regionMatrix() system.time(regionMat <- regionMatrix(fullCov, cutoff = 30, L = 76, totalMapped = 2^(sampleDepths), targetSize = 40e6))
## user system elapsed ## 8.82 0.38 9.19
## Explore results names(regionMat$chr21)
## [1] "regions" "coverageMatrix" "bpCoverage"
regionMatrix()
returns three pieces of output.
filterData()
and then defining the regions with findRegions()
. Note that the metadata variable value
represents the mean coverage for the given region while area
is the sum of the base-level coverage (before adjusting for read length) from all samples.plotRegionCoverage()
from derfinderPlot
.Similar to what we saw earlier, the regions are arranged in a GRanges
object. In this case, the metadata is simpler because no annotation information is included.
## regions output regionMat$chr21$regions
## GRanges object with 1822 ranges and 6 metadata columns: ## seqnames ranges strand | value ## <Rle> <IRanges> <Rle> | <numeric> ## 1 chr21 [9825461, 9825583] * | 64.1835795934005 ## 2 chr21 [9825645, 9825646] * | 30.6206716563443 ## 3 chr21 [9825872, 9826023] * | 41.4259978624777 ## 4 chr21 [9826149, 9826225] * | 34.5680066935166 ## 5 chr21 [9826990, 9827584] * | 12578.8649688978 ## ... ... ... ... ... ... ## 1818 chr21 [48084207, 48084835] * | 227.754828611473 ## 1819 chr21 [48085444, 48085455] * | 31.5989673322147 ## 1820 chr21 [48085458, 48085458] * | 30.078238709839 ## 1821 chr21 [48085469, 48085470] * | 30.3090276995855 ## 1822 chr21 [48085475, 48085477] * | 30.5711564090198 ## area indexStart indexEnd cluster clusterL ## <numeric> <integer> <integer> <Rle> <Rle> ## 1 7894.58028998826 1 123 1 765 ## 2 61.2413433126887 124 125 1 765 ## 3 6296.75167509661 126 277 1 765 ## 4 2661.73651540078 278 354 1 765 ## 5 7484424.65649418 355 949 2 595 ## ... ... ... ... ... ... ## 1818 143257.787196617 189005 189633 800 629 ## 1819 379.187607986577 189634 189645 801 34 ## 1820 30.078238709839 189646 189646 801 34 ## 1821 60.618055399171 189647 189648 801 34 ## 1822 91.7134692270595 189649 189651 801 34 ## ------- ## seqinfo: 1 sequence from an unspecified genome
## Number of regions length(regionMat$chr21$regions)
## [1] 1822
bpCoverage
is the base level coverage list which can then be used for plotting.
## Base-level coverage matrices for each of the regions ## Useful for plotting lapply(regionMat$chr21$bpCoverage[1:2], head, n = 2)
## $`1` ## HSB97 HSB92 HSB178 HSB159 HSB153 HSB113 HSB130 HSB136 ## 1 110.7007 6.732025 7.859689 22.58134 69.91491 117.8935 3.723920 6.974008 ## 2 110.7007 6.732025 7.859689 22.58134 69.91491 117.8935 5.851874 6.974008 ## HSB126 HSB145 HSB123 HSB135 ## 1 3.169279 0 4.033258 8.171553 ## 2 3.169279 0 4.033258 8.171553 ## ## $`2` ## HSB97 HSB92 HSB178 HSB159 HSB153 HSB113 HSB130 HSB136 ## 124 67.40687 13.9819 10.66672 38.71087 36.1036 130.4170 13.29971 17.04757 ## 125 69.59896 13.9819 10.66672 38.71087 36.1036 128.6896 13.29971 17.04757 ## HSB126 HSB145 HSB123 HSB135 ## 124 12.67711 9.182136 8.642696 9.079504 ## 125 12.67711 9.182136 8.642696 9.079504
## Check dimensions. First region is 123 long, second one is 2 bp long. ## The columns match the number of samples (12 in this case). lapply(regionMat$chr21$bpCoverage[1:2], dim)
## $`1` ## [1] 123 12 ## ## $`2` ## [1] 2 12
The end result of the coverage matrix is shown below. Note that the coverage has been adjusted for read length. Because reads might not fully align inside a given region, the numbers are generally not integers but can be rounded if needed.
## Dimensions of the coverage matrix dim(regionMat$chr21$coverageMatrix)
## [1] 1822 12
## Coverage for each region. This matrix can then be used with limma or other pkgs head(regionMat$chr21$coverageMatrix)
## HSB97 HSB92 HSB178 HSB159 HSB153 ## 1 290.091832 4.329483e+01 1.806103e+01 83.678304 2.160410e+02 ## 2 1.802708 3.679447e-01 2.807032e-01 1.018707 9.500947e-01 ## 3 609.755286 1.210811e+01 0.000000e+00 18.973422 4.147993e+01 ## 4 92.053499 7.392963e+00 0.000000e+00 5.560443 8.007187e+01 ## 5 72717.000773 6.054930e+04 7.904146e+04 19258.506223 8.379324e+04 ## 6 11113.935080 6.652904e+03 1.092042e+04 4400.059413 1.806119e+04 ## HSB113 HSB130 HSB136 HSB126 HSB145 ## 1 4.464305e+02 3.476125e+01 4.344990e+01 1.647191e+01 4.733207e+00 ## 2 3.409298e+00 3.499925e-01 4.486204e-01 3.336083e-01 2.416352e-01 ## 3 1.674249e+02 4.820096e+01 3.629747e+01 1.822335e+01 1.333258e+01 ## 4 1.524354e+02 9.890788e+00 3.898409e+01 7.255980e+00 1.063905e+01 ## 5 1.578244e+05 5.888896e+04 2.159025e+05 7.380968e+04 7.185880e+04 ## 6 1.450049e+04 1.429600e+04 4.520007e+04 2.457445e+04 1.394158e+04 ## HSB123 HSB135 ## 1 1.489728e+01 3.460167e+01 ## 2 2.274394e-01 2.389343e-01 ## 3 1.058351e+01 1.784441e+01 ## 4 7.535825e+00 8.454292e+00 ## 5 1.596985e+04 2.721376e+05 ## 6 5.774746e+03 3.642908e+04
Similar to the region level analysis, you might be interested in performing a feature level analysis. More specifically, this could be getting a count matrix at the exon level. coverageToExon()
allows you to get such a matrix by taking advantage of the summarized annotation produced by makeGenomicState()
.
In this example, we use the genomic state included in the package which has the information for chr21 TxDb.Hsapiens.UCSC.hg19.knownGene
annotation.
## Get the exon level matrix system.time(exonCov <- coverageToExon(fullCov, genomicState$fullGenome, L = 76))
## class: SerialParam ## bpworkers: 1 ## bptasks: 0 ## bpcatchErrors: TRUE ## bpstopOnError: FALSE ## class: SerialParam ## bpworkers: 1 ## bptasks: 0 ## bpcatchErrors: TRUE ## bpstopOnError: FALSE
## class: SerialParam ## bpworkers: 1 ## bptasks: 0 ## bpcatchErrors: TRUE ## bpstopOnError: FALSE
## user system elapsed ## 10.34 1.37 11.72
## Dimensions of the matrix dim(exonCov)
## [1] 2658 12
## Explore a little bit tail(exonCov)
## HSB97 HSB92 HSB178 HSB159 HSB153 HSB113 ## 4983 0.44605264 0.03986842 0.1289474 0.04934211 0.1052632 0.1173684 ## 4985 10.87723678 1.65184211 7.3119736 4.39013167 10.4906578 4.6542105 ## 4986 6.23315790 11.40447366 4.9807895 10.64368415 4.5310526 7.1510526 ## 4988 0.00000000 0.05644737 0.0000000 0.00000000 0.0000000 0.0000000 ## 4990 1.96131579 3.98013159 0.9542105 2.67486840 1.2284210 2.3064474 ## 4992 0.07473684 0.66907895 0.2601316 0.73039474 0.2402632 0.1652632 ## HSB130 HSB136 HSB126 HSB145 HSB123 ## 4983 0.0000000 0.03947368 0.0000000 5.263158e-03 0.1382895 ## 4985 0.7798684 0.72986842 1.6034210 9.964474e-01 1.4557895 ## 4986 152.1905258 263.63486853 252.4892106 2.345414e+02 291.0205261 ## 4988 0.2639474 0.37657895 0.7223684 6.156579e-01 0.7063158 ## 4990 35.5769737 76.96736838 69.6584212 5.978842e+01 64.9423686 ## 4992 2.4193421 14.41013157 9.7538158 5.325658e+00 8.1834211 ## HSB135 ## 4983 0.07894737 ## 4985 1.29592103 ## 4986 439.69500020 ## 4988 0.48657895 ## 4990 101.76394746 ## 4992 10.08618420
With this matrix, rounded if necessary, you can proceed to use packages such as limma, DESeq, edgeR, among others.
We can certainly make region level plots using plotRegionCoverage()
or cluster plots using plotCluster()
or overview plots using plotOveview()
, all from derfinderPlot
.
First we need to get the relevant annotation information.
## Annotate regions as exonic, intronic or intergenic system.time(annoGenome <- annotateRegions(regionMat$chr21$regions, genomicState$fullGenome))
## user system elapsed ## 0.38 0.01 0.39
## Note that the genomicState object included in derfinder only has information ## for chr21 (hg19). ## Identify closest genes to regions suppressPackageStartupMessages(library('bumphunter')) suppressPackageStartupMessages(library('TxDb.Hsapiens.UCSC.hg19.knownGene')) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene genes <- annotateTranscripts(txdb)
## Warning: Calling species() on a TxDb object is *deprecated*. ## Please use organism() instead.
system.time(annoNear <- matchGenes(regionMat$chr21$regions, genes))
## ..................
## user system elapsed ## 259.77 0.03 259.87
Now we can proceed to use derfinderPlot
to make the region level plots for the top 100 regions.
## Identify the top regions by highest total coverage top <- order(regionMat$chr21$regions$area, decreasing = TRUE)[1:100] ## Base-level plots for the top 100 regions with transcript information library('derfinderPlot') plotRegionCoverage(regionMat$chr21$regions, regionCoverage = regionMat$chr21$bpCoverage, groupInfo = pheno$group, nearestAnnotation = annoNear, annotatedRegions = annoGenome, whichRegions = top, scalefac = 1, txdb = txdb, ask = FALSE)
However, we can alternatively use epivizr to view the candidate DERs and the region matrix results in a genome browser.
## Load epivizr, it's available from Bioconductor library('epivizr') ## Load data to your browser mgr <- startEpiviz() ders_dev <- mgr$addDevice( fullRegions[as.logical(fullRegions$significantFWER) ], "Candidate DERs") ders_potential_dev <- mgr$addDevice( fullRegions[!as.logical(fullRegions$significantFWER) ], "Potential DERs") regs_dev <- mgr$addDevice(regionMat$chr21$regions, "Region Matrix") ## Go to a place you like in the genome mgr$navigate("chr21", start(regionMat$chr21$regions[top[1]]) - 100, end(regionMat$chr21$regions[top[1]]) + 100) ## Stop the navigation mgr$stopServer()
derfinder
also includes createBw()
with related functions createBwSample()
and coerceGR()
to export the output of fullCoverage()
to BigWig files. These functions can be useful in the case where you start with BAM files and later on want to save the coverage data into BigWig files, which are generally smaller.
## Subset only the first sample fullCovSmall <- lapply(fullCov, '[', 1) ## Export to BigWig bw <- createBw(fullCovSmall)
## Warning in FUN(X[[i]], ...): As of rtracklayer 1.25.16, BigWig is not ## supported on Windows. Thus exporting to BigWig will most likely fail!
## See the file. Note that the sample name is used to name the file. dir(pattern = '.bw')
## [1] "HSB97.bw"
## Internally createBw() coerces each sample to a GRanges object before ## exporting to a BigWig file. If more than one sample was exported, the ## GRangesList would have more elements. bw
## GRangesList object of length 1: ## $HSB97 ## GRanges object with 216538 ranges and 1 metadata column: ## seqnames ranges strand | score ## <Rle> <IRanges> <Rle> | <numeric> ## chr21 chr21 [9422446, 9422520] * | 0.0299999993294477 ## chr21 chr21 [9474029, 9474103] * | 0.0299999993294477 ## chr21 chr21 [9478919, 9478993] * | 0.0299999993294477 ## chr21 chr21 [9544822, 9544896] * | 0.0299999993294477 ## chr21 chr21 [9548499, 9548573] * | 0.0299999993294477 ## ... ... ... ... ... ... ## chr21 chr21 [48112669, 48112694] * | 0.140000000596046 ## chr21 chr21 [48112695, 48112730] * | 0.100000001490116 ## chr21 chr21 [48112731, 48112731] * | 0.0700000002980232 ## chr21 chr21 [48112732, 48112737] * | 0.0299999993294477 ## chr21 chr21 [48115056, 48115130] * | 0.0299999993294477 ## ## ------- ## seqinfo: 1 sequence from an unspecified genome
If you are interested in using the advanced arguments, use advancedArg()
as shown below:
## URLs to advanced arguemtns sapply(c('analyzeChr', 'loadCoverage'), advancedArg, browse = FALSE)
## analyzeChr ## "https://github.com/search?utf8=%E2%9C%93&q=advanced_argument+filename%3Aanalyze+repo%3Alcolladotor%2Fderfinder+path%3A%2FR&type=Code" ## loadCoverage ## "https://github.com/search?utf8=%E2%9C%93&q=advanced_argument+filename%3Aload+repo%3Alcolladotor%2Fderfinder+path%3A%2FR&type=Code"
## Set browse = TRUE if you want to open them in your browser
The most common advanced arguments are chrsStyle
(default is UCSC
) and verbose
(by default TRUE
). verbose
controls whether to print status updates for nearly all the functions. chrsStyle
is used to determine the chromosome naming style and is powered by GenomeInfoDb. Note that chrsStyle
is used in any of the functions that call extendedMapSeqlevels()
. If you are working with a different organism than Homo sapiens set the global species
option using options(species = 'your species')
with the syntax used in names(GenomeInfoDb::genomeStyles())
. For example:
## Set global species and chrsStyle options options(species = 'arabidopsis_thaliana') options(chrsStyle = 'NCBI')
The third commonly used advanced argument is mc.cores
. It controls the number of cores to use for the functions that can run with more than one core to speed up. In nearly all the cases, the maximum number of cores depends on the number of chromosomes. One notable exception is analyzeChr()
where the maximum number of cores depends on the chunksize
used and the dimensions of the data for the chromosome under study.
We have illustrated how to identify candidate differentially expressed regions without using annotation in the identification process by using analyzeChr()
. Furthermore, we covered how to perform the region matrix analysis with regionMatrix()
. We also highlighted other uses of the derfinder
package.
This implementation of derfinder
has its origins in Alyssa C. Frazee's derfinder (Frazee, Sabunciyan, Hansen, Irizarry, et al., 2014)
. The statistical methods and implementation by now are very different.
derfinder
Please use:
## Citation info citation('derfinder')
## ## Collado-Torres L, Frazee AC, Love MI, Irizarry RA, Jaffe AE and ## Leek JT (2015). "derfinder: Software for annotation-agnostic ## RNA-seq differential expression analysis." _bioRxiv_. <URL: ## http://doi.org/10.1101/015370>, <URL: ## http://www.biorxiv.org/content/early/2015/02/19/015370.abstract>. ## ## Frazee AC, Sabunciyan S, Hansen KD, Irizarry RA and Leek JT ## (2014). "Differential expression analysis of RNA-seq data at ## single-base resolution." _Biostatistics_, *15 (3)*, pp. 413-426. ## <URL: http://doi.org/10.1093/biostatistics/kxt053>, <URL: ## http://biostatistics.oxfordjournals.org/content/15/3/413.long>.
We would like to thank Jessica Hekman for testing derfinder
with non-human data, thus discovering some small bugs or sections of the documentation that were not clear.
This package was made possible thanks to:
(R Core Team, 2015)
(Pages, Carlson, Falcon, and Li, 2015)
(Morgan, Lang, and Thompson, 2015)
and (Jaffe, E, Murakami, Peter, et al., 2012)
(Arora, Morgan, Carlson, and Pages, 2015)
(Lawrence, Huber, Pagès, Aboyoun, et al., 2013)
(Lawrence, Huber, Pagès, Aboyoun, et al., 2013)
(Obenchain, Love, and Morgan, 2015)
(Lawrence, Huber, Pagès, Aboyoun, et al., 2013)
(Jr, with contributions from
Charles Dupont, and many
others., 2015)
(Lawrence, Huber, Pagès, Aboyoun, et al., 2013)
(Storey, 2015)
(Morgan, Pagès, Obenchain, and Hayden, 2015)
(Lawrence, Gentleman, and Carey, 2009)
(Pages, Lawrence, and Aboyoun, 2015)
(Yin, Lawrence, and Cook, 2015)
(Collado-Torres, Jaffe, and Leek, 2014)
(Wickham and Chang, 2015)
(Wickham, 2009)
(Boettiger, 2015)
(Xie, 2014)
(Hester, 2013)
(Allaire, Cheng, Xie, McPherson, et al., 2015)
(Wickham, 2011)
(Carlson, 2015)
Code for creating the vignette
## Create the vignette library('knitrBootstrap') knitrBootstrapFlag <- packageVersion('knitrBootstrap') < '1.0.0' if(knitrBootstrapFlag) { ## CRAN version library('knitrBootstrap') system.time(knit_bootstrap('derfinder.Rmd', chooser=c('boot', 'code'), show_code = TRUE)) unlink('derfinder.md') } else { ## GitHub version library('rmarkdown') system.time(render('derfinder.Rmd', 'knitrBootstrap::bootstrap_document')) } ## Note: if you prefer the knitr version use: # library('rmarkdown') # system.time(render('derfinder.Rmd', 'html_document')) ## Clean up unlink('analysisResults', recursive = TRUE) file.remove('HSB113.bw') file.remove('derfinderRef.bib') ## Extract the R code library('knitr') knit('derfinder.Rmd', tangle = TRUE)
Date the vignette was generated.
## [1] "2015-06-15 22:38:01 PDT"
Wallclock time spent generating the vignette.
## Time difference of 11.115 mins
R
session information.
## setting value ## version R version 3.2.1 beta (2015-06-07 r68485) ## system x86_64, mingw32 ## ui RTerm ## language (EN) ## collate C ## tz America/Los_Angeles
## package * version date source ## AnnotationDbi * 1.30.1 2015-06-16 Bioconductor ## Biobase * 2.28.0 2015-06-16 Bioconductor ## BiocGenerics * 0.14.0 2015-06-16 Bioconductor ## BiocParallel 1.2.3 2015-06-16 Bioconductor ## Biostrings 2.36.1 2015-06-16 Bioconductor ## DBI * 0.3.1 2014-09-24 CRAN (R 3.2.0) ## Formula 1.2-1 2015-04-07 CRAN (R 3.2.0) ## GenomeInfoDb * 1.4.0 2015-06-16 Bioconductor ## GenomicAlignments 1.4.1 2015-06-16 Bioconductor ## GenomicFeatures * 1.20.1 2015-06-16 Bioconductor ## GenomicFiles 1.4.0 2015-06-16 Bioconductor ## GenomicRanges * 1.20.5 2015-06-16 Bioconductor ## Hmisc 3.16-0 2015-04-30 CRAN (R 3.2.0) ## IRanges * 2.2.4 2015-06-16 Bioconductor ## MASS 7.3-40 2015-03-21 CRAN (R 3.2.1) ## Matrix 1.2-1 2015-06-01 CRAN (R 3.2.1) ## RColorBrewer 1.1-2 2014-12-07 CRAN (R 3.2.0) ## RCurl 1.95-4.6 2015-04-24 CRAN (R 3.2.0) ## RJSONIO 1.3-0 2014-07-28 CRAN (R 3.2.0) ## RSQLite * 1.0.0 2014-10-25 CRAN (R 3.2.0) ## Rcpp 0.11.6 2015-05-01 CRAN (R 3.2.0) ## RefManageR 0.8.63 2015-06-09 CRAN (R 3.2.0) ## Rsamtools 1.20.4 2015-06-16 Bioconductor ## S4Vectors * 0.6.0 2015-06-16 Bioconductor ## TxDb.Hsapiens.UCSC.hg19.knownGene * 3.1.2 2015-06-10 Bioconductor ## XML 3.98-1.2 2015-05-31 CRAN (R 3.2.0) ## XVector * 0.8.0 2015-06-16 Bioconductor ## acepack 1.3-3.3 2013-05-03 CRAN (R 3.2.0) ## bibtex 0.4.0 2014-12-31 CRAN (R 3.2.0) ## biomaRt 2.24.0 2015-06-16 Bioconductor ## bitops 1.0-6 2013-08-17 CRAN (R 3.2.0) ## bumphunter * 1.8.0 2015-06-16 Bioconductor ## cluster 2.0.1 2015-01-31 CRAN (R 3.2.1) ## codetools 0.2-11 2015-03-10 CRAN (R 3.2.1) ## colorspace 1.2-6 2015-03-11 CRAN (R 3.2.0) ## curl 0.8 2015-06-06 CRAN (R 3.2.0) ## derfinder * 1.2.1 2015-06-16 Bioconductor ## derfinderData * 0.102.0 2015-06-10 Bioconductor ## derfinderHelper 1.2.0 2015-06-16 Bioconductor ## devtools * 1.8.0 2015-05-09 CRAN (R 3.2.0) ## digest 0.6.8 2014-12-31 CRAN (R 3.2.0) ## doRNG 1.6 2014-03-07 CRAN (R 3.2.0) ## evaluate 0.7 2015-04-21 CRAN (R 3.2.0) ## foreach * 1.4.2 2014-04-11 CRAN (R 3.2.0) ## foreign 0.8-63 2015-02-20 CRAN (R 3.2.1) ## formatR 1.2 2015-04-21 CRAN (R 3.2.0) ## futile.logger 1.4.1 2015-04-20 CRAN (R 3.2.0) ## futile.options 1.0.0 2010-04-06 CRAN (R 3.2.0) ## ggplot2 * 1.0.1 2015-03-17 CRAN (R 3.2.0) ## git2r 0.10.1 2015-05-07 CRAN (R 3.2.0) ## gridExtra 0.9.1 2012-08-09 CRAN (R 3.2.0) ## gtable 0.1.2 2012-12-05 CRAN (R 3.2.0) ## highr 0.5 2015-04-21 CRAN (R 3.2.0) ## httr 0.6.1 2015-01-01 CRAN (R 3.2.0) ## iterators * 1.0.7 2014-04-11 CRAN (R 3.2.0) ## knitcitations * 1.0.6 2015-05-26 CRAN (R 3.2.0) ## knitr * 1.10.5 2015-05-06 CRAN (R 3.2.0) ## knitrBootstrap * 0.9.0 2013-10-17 CRAN (R 3.2.0) ## labeling 0.3 2014-08-23 CRAN (R 3.2.0) ## lambda.r 1.1.7 2015-03-20 CRAN (R 3.2.0) ## lattice 0.20-31 2015-03-30 CRAN (R 3.2.1) ## latticeExtra 0.6-26 2013-08-15 CRAN (R 3.2.0) ## locfit * 1.5-9.1 2013-04-20 CRAN (R 3.2.0) ## lubridate 1.3.3 2013-12-31 CRAN (R 3.2.0) ## magrittr 1.5 2014-11-22 CRAN (R 3.2.0) ## markdown 0.7.7 2015-04-22 CRAN (R 3.2.0) ## matrixStats 0.14.0 2015-02-14 CRAN (R 3.2.0) ## memoise 0.2.1 2014-04-22 CRAN (R 3.2.0) ## munsell 0.4.2 2013-07-11 CRAN (R 3.2.0) ## nnet 7.3-9 2015-02-11 CRAN (R 3.2.1) ## org.Hs.eg.db * 3.1.2 2015-06-10 Bioconductor ## pkgmaker 0.22 2014-05-14 CRAN (R 3.2.0) ## plyr 1.8.3 2015-06-12 CRAN (R 3.2.1) ## proto 0.3-10 2012-12-22 CRAN (R 3.2.0) ## qvalue 2.0.0 2015-06-16 Bioconductor ## registry 0.2 2012-01-24 CRAN (R 3.2.0) ## reshape2 1.4.1 2014-12-06 CRAN (R 3.2.0) ## rngtools 1.2.4 2014-03-06 CRAN (R 3.2.0) ## rpart 4.1-9 2015-02-24 CRAN (R 3.2.1) ## rtracklayer 1.28.4 2015-06-16 Bioconductor ## rversions 1.0.1 2015-06-06 CRAN (R 3.2.0) ## scales 0.2.5 2015-06-12 CRAN (R 3.2.0) ## stringi 0.4-1 2014-12-14 CRAN (R 3.2.0) ## stringr 1.0.0 2015-04-30 CRAN (R 3.2.0) ## survival 2.38-2 2015-06-12 CRAN (R 3.2.1) ## xml2 0.1.1 2015-06-02 CRAN (R 3.2.0) ## xtable 1.7-4 2014-09-12 CRAN (R 3.2.0) ## zlibbioc 1.14.0 2015-06-16 Bioconductor
This vignette was generated using knitrBootstrap
(Hester, 2013)
with knitr
(Xie, 2014)
and rmarkdown
(Allaire, Cheng, Xie, McPherson, et al., 2015)
running behind the scenes.
Citations made with knitcitations
(Boettiger, 2015)
.
[1] J. Allaire, J. Cheng, Y. Xie, J. McPherson, et al. rmarkdown: Dynamic Documents for R. R package version 0.7. 2015. URL: http://CRAN.R-project.org/package=rmarkdown.
[2] S. Arora, M. Morgan, M. Carlson and H. Pages. GenomeInfoDb: Utilities for manipulating chromosome and other 'seqname' identifiers. R package version 1.4.0. 2015.
[3] C. Boettiger. knitcitations: Citations for Knitr Markdown Files. R package version 1.0.6. 2015. URL: http://CRAN.R-project.org/package=knitcitations.
[4] BrainSpan. “Atlas of the Developing Human Brain [Internet]. Funded by ARRA Awards 1RC2MH089921-01, 1RC2MH090047-01, and 1RC2MH089929-01.” 2011. URL: http://developinghumanbrain.org.
[5] M. Carlson. TxDb.Hsapiens.UCSC.hg19.knownGene: Annotation package for TxDb object(s). R package version 3.1.2. 2015.
[6] L. Collado-Torres, A. C. Frazee, M. I. Love, R. A. Irizarry, et al. “derfinder: Software for annotation-agnostic RNA-seq differential expression analysis”. In: bioRxiv (2015). DOI: 10.1101/015370. URL: http://www.biorxiv.org/content/early/2015/02/19/015370.abstract.
[7] L. Collado-Torres, A. Jaffe and J. Leek. derfinderData: Processed BigWigs from BrainSpan for examples. R package version 0.102.0. 2014. URL: https://github.com/lcolladotor/derfinderData.
[8] A. C. Frazee, S. Sabunciyan, K. D. Hansen, R. A. Irizarry, et al. “Differential expression analysis of RNA-seq data at single-base resolution”. In: Biostatistics 15 (3) (2014), pp. 413-426. DOI: 10.1093/biostatistics/kxt053. URL: http://biostatistics.oxfordjournals.org/content/15/3/413.long.
[9] J. Hester. knitrBootstrap: Knitr Bootstrap framework. R package version 0.9.0. 2013. URL: http://CRAN.R-project.org/package=knitrBootstrap.
[10] Jaffe, A. E, Murakami, Peter, et al. “Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies”. In: International Journal of Epidemiology (2012).
[11] F. E. H. Jr, with contributions from Charles Dupont and many others. Hmisc: Harrell Miscellaneous. R package version 3.16-0. 2015. URL: http://CRAN.R-project.org/package=Hmisc.
[12] M. Lawrence, R. Gentleman and V. Carey. “rtracklayer: an R package for interfacing with genome browsers”. In: Bioinformatics 25 (2009), pp. 1841-1842. DOI: 10.1093/bioinformatics/btp328. URL: http://bioinformatics.oxfordjournals.org/content/25/14/1841.abstract.
[13] M. Lawrence, W. Huber, H. Pagès, P. Aboyoun, et al. “Software for Computing and Annotating Genomic Ranges”. In: PLoS Computational Biology 9 (8 2013). DOI: 10.1371/journal.pcbi.1003118. URL: http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003118}.
[14] M. Morgan, M. Lang and R. Thompson. BiocParallel: Bioconductor facilities for parallel evaluation. R package version 1.2.3. 2015.
[15] M. Morgan, H. Pagès, V. Obenchain and N. Hayden. Rsamtools: Binary alignment (BAM), FASTA, variant call (BCF), and tabix file import. R package version 1.20.4. 2015. URL: http://bioconductor.org/packages/release/bioc/html/Rsamtools.html.
[16] V. Obenchain, M. Love and M. Morgan. GenomicFiles: Distributed computing by file or by range. R package version 1.4.0. 2015.
[17] H. Pages, M. Carlson, S. Falcon and N. Li. AnnotationDbi: Annotation Database Interface. R package version 1.30.1. 2015.
[18] H. Pages, M. Lawrence and P. Aboyoun. S4Vectors: S4 implementation of vectors and lists. R package version 0.6.0. 2015.
[19] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2015. URL: http://www.R-project.org/.
[20] J. Storey. qvalue: Q-value estimation for false discovery rate control. R package version 2.0.0. 2015. URL: http://qvalue.princeton.edu/, http://github.com/jdstorey/qvalue.
[21] H. Wickham. ggplot2: elegant graphics for data analysis. Springer New York, 2009. ISBN: 978-0-387-98140-6. URL: http://had.co.nz/ggplot2/book.
[22] H. Wickham. “testthat: Get Started with Testing”. In: The R Journal 3 (2011), pp. 5–10. URL: http://journal.r-project.org/archive/2011-1/RJournal_2011-1_Wickham.pdf.
[23] H. Wickham and W. Chang. devtools: Tools to Make Developing R Packages Easier. R package version 1.8.0. 2015. URL: http://CRAN.R-project.org/package=devtools.
[24] Y. Xie. “knitr: A Comprehensive Tool for Reproducible Research in R”. In: Implementing Reproducible Computational Research. Ed. by V. Stodden, F. Leisch and R. D. Peng. ISBN 978-1466561595. Chapman and Hall/CRC, 2014. URL: http://www.crcpress.com/product/isbn/9781466561595.
[25] T. Yin, M. Lawrence and D. Cook. biovizBase: Basic graphic utilities for visualization of genomic data. R package version 1.16.0. 2015.