The following vignette describes the nullranges implementation of the block bootstrap with respect to a genomic segmentation. See the main nullranges vignette for an overview of the idea of bootstrapping, or below diagram, and there is additionally a vignette on block boostrapping without respect to segmentation-Unsegmented block bootstrap.
As proposed by Bickel et al. (2010), nullranges contains an implementation of a block bootstrap, such that features are sampled from the genome in blocks. The original block bootstrapping algorithm is implemented in a python software called Genome Structure Correlation, GSC.
Our description of the bootRanges methods is described in Mu et al. (2022).
In a segmented block bootstrap, the blocks are sampled and placed within regions of a genome segmentation. That is, for a genome segmented into states 1,2,…,S, blocks from state s will be used to tile the ranges of state s in each bootstrap sample. The process can be visualized in (A), a block with length \(L_b\) is \(\color{brown}{\text{randomly}}\) selected from state “red” and move to a \(\color{brown}{\text{tile}}\) block across chromosome within same states.
An example workflow of bootRanges used in combination with plyranges is diagrammed in (B), and can be summarized as:
bootRanges()
with optional segmentation
and exclude
to create a bootRanges object \(y'\)The segmented block bootstrap has two options, either:
In this vignette, we give an example of segmenting the hg38 genome by Ensembl gene density, create bootstrapped peaks and evaluate overlaps for observed peaks and bootstrap peaks, then we profile the time to generate a single block bootstrap sample. Finally, we use a toy dataset to visualize what a segmented block bootstrap sample looks like with respect to a genome segmentation.
A finally consideration is whether the blocks should scale
proportionally to the segment state length, with the default setting of
proportionLength=TRUE
. When blocks scale proportionally,
blockLength
provides the maximal length of a block, while
the actual block length used for a segmentation state is proportional to
the fraction of genomic basepairs covered by that state. This option is
visualized on toy data at the end of this vignette.
\(\color{brown}{\text{To avoid placing
bootstrap features into regions of the genome that don’t typically have
features}}\). We import excluded regions including
ENCODE-produced excludable regions(Amemiya,
Kundaje, and Boyle 2019), telomeres from UCSC, centromeres (Commo 2022). For easy use, pre-combined
excludable regions is stored in ExperimentHub. These steps
using excluderanges package (Dozmorov et
al. 2022) are included in nullrangesData in the
inst/scripts/make-segmentation-hg38.R
script.
suppressPackageStartupMessages(library(ExperimentHub))
= ExperimentHub()
eh # query(eh, "nullrangesdata")
<- eh[["EH7306"]] exclude
nullranges has generated pre-built segmentations for easy use by following below section Segmentation by gene density. Either pre-built segmentations using CBS or HMM methods with \(L_s=2e6\) considering excludable regions can be selected from ExperimentHub.
<- eh[["EH7307"]]
seg_cbs <- eh[["EH7308"]]
seg_hmm <- seg_cbs seg
First we obtain the Ensembl genes (Howe et al. 2020) for segmenting by gene density. We obtain these using the ensembldb package (Rainer, Gatto, and Weichenberger 2019).
suppressPackageStartupMessages(library(ensembldb))
suppressPackageStartupMessages(library(EnsDb.Hsapiens.v86))
<- EnsDb.Hsapiens.v86
edb <- AnnotationFilterList(GeneIdFilter("ENSG", "startsWith"))
filt <- genes(edb, filter = filt) g
We perform some processing to align the sequences (chromosomes) of
g
with our excluded regions and our features of interest
(DNase hypersensitive sites, or DHS, defined below).
library(GenomeInfoDb)
<- keepStandardChromosomes(g, pruning.mode = "coarse")
g seqlevels(g, pruning.mode="coarse") <- setdiff(seqlevels(g), "MT")
# normally we would assign a new style, but for recent host issues
## seqlevelsStyle(g) <- "UCSC"
seqlevels(g) <- paste0("chr", seqlevels(g))
genome(g) <- "hg38"
<- sortSeqlevels(g)
g <- sort(g)
g table(seqnames(g))
##
## chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13
## 5194 3971 3010 2505 2868 2863 2867 2353 2242 2204 3235 2940 1304
## chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY
## 2224 2152 2511 2995 1170 2926 1386 835 1318 2359 523
We first demonstrate the use a CBS segmentation as implemented in DNAcopy (Olshen et al. 2004).
We load the nullranges and plyranges packages, and patchwork in order to produce grids of plots.
library(nullranges)
suppressPackageStartupMessages(library(plyranges))
library(patchwork)
We subset the excluded ranges to those which are 500 bp or larger.
The motivation for this step is to avoid segmenting the genome into many
small pieces due to an abundance of short excluded regions. Note that we
re-save the excluded ranges to exclude
.
Here, and below, we need to specify plyranges::filter
as
it conflicts with filter
exported by
ensembldb.
set.seed(5)
<- exclude %>%
exclude ::filter(width(exclude) >= 500)
plyranges<- 1e6
L_s <- segmentDensity(g, n = 3, L_s = L_s,
seg_cbs exclude = exclude, type = "cbs")
## Analyzing: Sample.1
<- lapply(c("ranges","barplot","boxplot"), function(t) {
plots plotSegment(seg_cbs, exclude, type = t)
})1]] plots[[
2]] + plots[[3]] plots[[
Note here, the default ranges plot gives whole genome and every fractured bind regions represents state transformations happens. However, some transformations within small ranges cannot be visualized, e.g 1kb. If user want to look into specific ranges of segmentation state, the region argument is flexible to support.
<- GRanges("chr16", IRanges(3e7,4e7))
region plotSegment(seg_cbs, exclude, type="ranges", region=region)
Here we use an alternative segmentation implemented in the
RcppHMM CRAN package, using the initGHMM
,
learnEM
, and viterbi
functions.
<- segmentDensity(g, n = 3, L_s = L_s,
seg_hmm exclude = exclude, type = "hmm")
## Finished at Iteration: 111 with Error: 9.22293e-06
<- lapply(c("ranges","barplot","boxplot"), function(t) {
plots plotSegment(seg_hmm, exclude, type = t)
})1]] plots[[
2]] + plots[[3]] plots[[
We use a set of DNase hypersensitivity sites (DHS) from the ENCODE project (ENCODE 2012) in A549 cell line (ENCSR614GWM). Here, for speed, we work with a pre-processed data object from ExperimentHub, created using the following steps:
These steps are included in nullrangesData in the
inst/scripts/make-dhs-data.R
script.
For speed of the vignette, we restrict to a smaller number of DHS, filtering by the signal value. We also remove metadata columns that we don’t need for the bootstrap analysis. Consider, when creating bootstrapped data, that you will be creating an object many times larger than your original features, so \(\color{brown}{\text{filtering and trimming}}\) extra metadata can help make the analysis more efficient.
suppressPackageStartupMessages(library(nullrangesData))
<- DHSA549Hg38() dhs
## see ?nullrangesData and browseVignettes('nullrangesData') for documentation
## loading from cache
<- dhs %>% plyranges::filter(signalValue > 100) %>%
dhs mutate(id = seq_along(.)) %>%
::select(id)
plyrangeslength(dhs)
## [1] 6214
table(seqnames(dhs))
##
## chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13
## 1436 252 108 30 148 51 184 146 155 443 436 526 20
## chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY
## 197 265 214 715 20 649 142 31 19 17 10
Now we apply a segmented block bootstrap with blocks of size 500kb, to the peaks. Here we show generation of 50 iterations of a block bootstrap followed by a typical overlap analysis using plyranges (Lee, Cook, and Lawrence 2019). (We might normally do 100 iterations or more, depending on the granularity of the bootstrap distribution that is needed.)
set.seed(5) # for reproducibility
<- 50
R <- 5e5
blockLength <- bootRanges(dhs, blockLength, R = R, seg = seg, exclude=exclude)
boots boots
## BootRanges object with 310726 ranges and 4 metadata columns:
## seqnames ranges strand | id block iter
## <Rle> <IRanges> <Rle> | <integer> <integer> <Rle>
## [1] chr1 242791-242940 * | 347 5 1
## [2] chr1 256031-256180 * | 348 5 1
## [3] chr1 391535-391684 * | 5301 8 1
## [4] chr1 421046-421195 * | 5302 8 1
## [5] chr1 438186-438335 * | 5303 8 1
## ... ... ... ... . ... ... ...
## [310722] chrY 27090908-27091057 * | 2133 12441 50
## [310723] chrY 27194968-27195117 * | 2134 12441 50
## [310724] chrY 27224188-27224337 * | 2135 12441 50
## [310725] chrY 27234153-27234302 * | 2136 12441 50
## [310726] chrY 27789879-27790028 * | 2116 12442 50
## blockLength
## <Rle>
## [1] 500000
## [2] 500000
## [3] 500000
## [4] 500000
## [5] 500000
## ... ...
## [310722] 500000
## [310723] 500000
## [310724] 500000
## [310725] 500000
## [310726] 500000
## -------
## seqinfo: 24 sequences from hg38 genome
What is returned here? The bootRanges
function returns a
bootRanges object, which is a simple sub-class of
GRanges. The iteration (iter
) and the block length
(blockLength
) are recorded as metadata columns, accessible
via mcols
. We return the bootstrapped ranges as
GRanges rather than GRangesList, as the former is more
compatible with downstream overlap joins with plyranges, where
the iteration column can be used with group_by
to provide
per bootstrap summary statistics, as shown below.
Note that we use the exclude
object from the previous
step, which does not contain small ranges. If one wanted to also avoid
generation of bootstrapped features that overlap small excluded ranges,
then omit this filtering step (use the original, complete
exclude
feature set).
We can examine properties of permuted y over iterations, and compare to the original y. To do so, we first add the original features as iter=0. Then compute summaries:
suppressPackageStartupMessages(library(tidyr))
<- dhs %>%
combined mutate(iter=0) %>%
bind_ranges(boots) %>%
::select(iter)
plyranges<- combined %>%
stats group_by(iter) %>%
summarize(n = n()) %>%
as_tibble()
head(stats)
## # A tibble: 6 × 2
## iter n
## <fct> <int>
## 1 0 6214
## 2 1 6150
## 3 2 6276
## 4 3 6249
## 5 4 6293
## 6 5 6365
We can also look at distributions of various aspects, e.g. here the inter-feature distance of features, across a few of the bootstraps and the original feature set y.
suppressPackageStartupMessages(library(ggridges))
suppressPackageStartupMessages(library(purrr))
suppressPackageStartupMessages(library(ggplot2))
<- function(dat) {
interdist = dat[-1,]
x = dat[-nrow(dat),]
y ifelse(x$seqnames == y$seqnames,
$start + floor((x$width - 1)/2) -
x$start-floor((y$width - 1)/2), NA)}
y
%>% plyranges::filter(iter %in% 0:3) %>%
combined ::select(iter) %>%
plyrangesas.data.frame() %>%
nest(-iter) %>%
mutate(interdist = map(data, ~interdist(.))) %>%
::select(iter,interdist) %>%
dplyrunnest(interdist) %>%
mutate(type = ifelse(iter == 0, "original", "boot")) %>%
ggplot(aes(log10(interdist), iter, fill=type)) +
geom_density_ridges(alpha = 0.75) +
geom_text(data=head(stats,4),
aes(x=1.5, y=iter, label=paste0("n=",n), fill=NULL),
vjust=1.5)
## Picking joint bandwidth of 0.198
Suppose we have a set of features x
and we are
interested in evaluating the \(\color{brown}{\text{enrichment of this set with
the DHS}}\). We can calculate for example the sum observed number
of overlaps for features in x
with DHS in whole genome (or
something more complicated, e.g. the maximum log fold change or signal
value for DHS peaks within a maxgap
window of
x
).
<- GRanges("chr2", IRanges(1 + 50:99 * 1e6, width=1e6), x_id=1:50)
x <- x %>% mutate(n_overlaps = count_overlaps(., dhs))
x mean( x$n_overlaps )
## [1] 1.28
We can repeat this with the bootstrapped features using a
group_by
command, a summarize
, followed by a
second group_by
and summarize
. It may help to
step through these commands one by one to understand what the
intermediate output is.
Note that we need to use tidyr::complete
in order to
fill in combinations of x
and iter
where the
overlap was 0.
<- x %>% join_overlap_inner(boots) %>%
boot_stats group_by(x_id, iter) %>%
summarize(n_overlaps = n()) %>%
as.data.frame() %>%
complete(x_id, iter, fill=list(n_overlaps = 0)) %>%
group_by(iter) %>%
summarize(meanOverlaps = mean(n_overlaps))
The above code, first grouping by x_id
and
iter
, then subsequently by iter
is general and
allows for more complex analysis than just mean overlap (e.g. how many
times an x
range has 1 or more overlap, what is the mean or
max signal value for peaks overlapping ranges in x
,
etc.).
If one is interested in assessing \(\color{brown}{\text{feature-wise}}\)
statistics instead of \(\color{brown}{\text{genome-wise}}\)
statistics, eg.,the mean observed number of overlaps per feature or mean
base pair overlap in x
, one can also group by both
(block
,iter
). 10,000 total blocks may
therefore be sufficient to derive a bootstrap distribution, avoiding the
need to generate many bootstrap genomes of data.
Finally we can plot a histogram. In this case, as the x
features were arbitrary, our observed value falls within the
distribution of mean overlap with bootstrapped data.
suppressPackageStartupMessages(library(ggplot2))
ggplot(boot_stats, aes(meanOverlaps)) +
geom_histogram(binwidth=.2)
For more examples of combining bootRanges
from
nullranges with plyranges piped operations, see the
relevant chapter in the tidy-ranges-tutorial
book.
Here, we test the speed of the various options for bootstrapping (see below for visualization of the difference).
library(microbenchmark)
microbenchmark(
list=alist(
prop = bootRanges(dhs, blockLength, seg = seg, proportionLength = TRUE),
no_prop = bootRanges(dhs, blockLength, seg = seg, proportionLength = FALSE)
times=10) ),
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## prop 83.00469 86.95662 94.27665 90.46532 94.20725 130.50440 10 b
## no_prop 78.45001 79.83539 80.91782 81.10273 82.00584 83.21263 10 a
Below we present a toy example for visualizing the segmented block bootstrap. First, we define a helper function for plotting GRanges using plotgardener (Kramer et al. 2021). A key aspect here is that we color the original and bootstrapped ranges by the genomic state (the state of the segmentation that the original ranges fall in).
suppressPackageStartupMessages(library(plotgardener))
<- function(n) {
my_palette head(c("red","green3","red3","dodgerblue",
"blue2","green4","darkred"), n)
}<- function(gr) {
plotGRanges pageCreate(width = 5, height = 5, xgrid = 0,
ygrid = 0, showGuides = TRUE)
for (i in seq_along(seqlevels(gr))) {
<- seqlevels(gr)[i]
chrom <- seqlengths(gr)[[chrom]]
chromend suppressMessages({
<- pgParams(chromstart = 0, chromend = chromend,
p x = 0.5, width = 4*chromend/500, height = 2,
at = seq(0, chromend, 50),
fill = colorby("state_col", palette=my_palette))
<- plotRanges(data = gr, params = p,
prngs chrom = chrom,
y = 2 * i,
just = c("left", "bottom"))
annoGenomeLabel(plot = prngs, params = p, y = 0.1 + 2 * i)
})
} }
Create a toy genome segmentation:
library(GenomicRanges)
<- rep(c("chr1","chr2"), c(4,3))
seq_nms <- GRanges(
seg seqnames = seq_nms,
IRanges(start = c(1, 101, 201, 401, 1, 201, 301),
width = c(100, 100, 200, 100, 200, 100, 100)),
seqlengths=c(chr1=500,chr2=400),
state = c(1,2,1,3,3,2,1),
state_col = factor(1:7)
)
We can visualize with our helper function:
plotGRanges(seg)
Now create small ranges distributed uniformly across the toy genome:
set.seed(1)
<- 200
n <- GRanges(
gr seqnames=sort(sample(c("chr1","chr2"), n, TRUE)),
IRanges(start=round(runif(n, 1, 500-10+1)), width=10)
)suppressWarnings({
seqlengths(gr) <- seqlengths(seg)
})<- gr[!(seqnames(gr) == "chr2" & end(gr) > 400)]
gr <- sort(gr)
gr <- findOverlaps(gr, seg, type="within", select="first")
idx <- gr[!is.na(idx)]
gr <- idx[!is.na(idx)]
idx $state <- seg$state[idx]
gr$state_col <- factor(seg$state_col[idx])
grplotGRanges(gr)
We can visualize block bootstrapped ranges when the blocks do not scale to segment state length:
set.seed(1)
<- bootRanges(gr, blockLength = 25, seg = seg,
gr_prime proportionLength = FALSE)
plotGRanges(gr_prime)
This time the blocks scale to the segment state length. Note that in
this case blockLength
is the maximal block length
possible, but the actual block lengths per segment will be smaller
(proportional to the fraction of basepairs covered by that state in the
genome segmentation).
set.seed(1)
<- bootRanges(gr, blockLength = 50, seg = seg,
gr_prime proportionLength = TRUE)
plotGRanges(gr_prime)
Note that some ranges from adjacent states are allowed to be placed within different states in the bootstrap sample. This is because, during the random sampling of blocks of original data, a block is allowed to extend beyond the segmentation region of the state being sampled, and features from adjacent states are not excluded from the sampled block.
sessionInfo()
## R version 4.2.1 Patched (2022-07-09 r82577)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_GB/en_US.UTF-8
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] microbenchmark_1.4.9 purrr_0.3.5
## [3] ggridges_0.5.4 tidyr_1.2.1
## [5] EnsDb.Hsapiens.v86_2.99.0 ensembldb_2.22.0
## [7] AnnotationFilter_1.22.0 GenomicFeatures_1.50.0
## [9] AnnotationDbi_1.60.0 patchwork_1.1.2
## [11] plyranges_1.18.0 nullrangesData_1.3.0
## [13] ExperimentHub_2.6.0 AnnotationHub_3.6.0
## [15] BiocFileCache_2.6.0 dbplyr_2.2.1
## [17] ggplot2_3.3.6 plotgardener_1.4.0
## [19] nullranges_1.4.0 InteractionSet_1.26.0
## [21] SummarizedExperiment_1.28.0 Biobase_2.58.0
## [23] MatrixGenerics_1.10.0 matrixStats_0.62.0
## [25] GenomicRanges_1.50.0 GenomeInfoDb_1.34.0
## [27] IRanges_2.32.0 S4Vectors_0.36.0
## [29] BiocGenerics_0.44.0
##
## loaded via a namespace (and not attached):
## [1] RcppHMM_1.2.2 lazyeval_0.2.2
## [3] splines_4.2.1 BiocParallel_1.32.0
## [5] TH.data_1.1-1 digest_0.6.30
## [7] yulab.utils_0.0.5 htmltools_0.5.3
## [9] fansi_1.0.3 magrittr_2.0.3
## [11] memoise_2.0.1 ks_1.13.5
## [13] Biostrings_2.66.0 sandwich_3.0-2
## [15] prettyunits_1.1.1 jpeg_0.1-9
## [17] colorspace_2.0-3 blob_1.2.3
## [19] rappdirs_0.3.3 xfun_0.34
## [21] dplyr_1.0.10 crayon_1.5.2
## [23] RCurl_1.98-1.9 jsonlite_1.8.3
## [25] survival_3.4-0 zoo_1.8-11
## [27] glue_1.6.2 gtable_0.3.1
## [29] zlibbioc_1.44.0 XVector_0.38.0
## [31] strawr_0.0.9 DelayedArray_0.24.0
## [33] scales_1.2.1 mvtnorm_1.1-3
## [35] DBI_1.1.3 Rcpp_1.0.9
## [37] xtable_1.8-4 progress_1.2.2
## [39] gridGraphics_0.5-1 bit_4.0.4
## [41] mclust_6.0.0 httr_1.4.4
## [43] RColorBrewer_1.1-3 speedglm_0.3-4
## [45] ellipsis_0.3.2 pkgconfig_2.0.3
## [47] XML_3.99-0.12 farver_2.1.1
## [49] sass_0.4.2 utf8_1.2.2
## [51] DNAcopy_1.72.0 ggplotify_0.1.0
## [53] tidyselect_1.2.0 labeling_0.4.2
## [55] rlang_1.0.6 later_1.3.0
## [57] munsell_0.5.0 BiocVersion_3.16.0
## [59] tools_4.2.1 cachem_1.0.6
## [61] cli_3.4.1 generics_0.1.3
## [63] RSQLite_2.2.18 evaluate_0.17
## [65] stringr_1.4.1 fastmap_1.1.0
## [67] yaml_2.3.6 knitr_1.40
## [69] bit64_4.0.5 KEGGREST_1.38.0
## [71] mime_0.12 pracma_2.4.2
## [73] xml2_1.3.3 biomaRt_2.54.0
## [75] compiler_4.2.1 filelock_1.0.2
## [77] curl_4.3.3 png_0.1-7
## [79] interactiveDisplayBase_1.36.0 tibble_3.1.8
## [81] bslib_0.4.0 stringi_1.7.8
## [83] highr_0.9 lattice_0.20-45
## [85] ProtGenerics_1.30.0 Matrix_1.5-1
## [87] vctrs_0.5.0 pillar_1.8.1
## [89] lifecycle_1.0.3 BiocManager_1.30.19
## [91] jquerylib_0.1.4 data.table_1.14.4
## [93] bitops_1.0-7 httpuv_1.6.6
## [95] rtracklayer_1.58.0 R6_2.5.1
## [97] BiocIO_1.8.0 promises_1.2.0.1
## [99] KernSmooth_2.23-20 codetools_0.2-18
## [101] MASS_7.3-58.1 assertthat_0.2.1
## [103] rjson_0.2.21 withr_2.5.0
## [105] GenomicAlignments_1.34.0 Rsamtools_2.14.0
## [107] multcomp_1.4-20 GenomeInfoDbData_1.2.9
## [109] parallel_4.2.1 hms_1.1.2
## [111] rmarkdown_2.17 shiny_1.7.3
## [113] restfulr_0.0.15