The chromVAR R package was originally developed by Alicia Schep and colleagues from the Greenleaf lab (Schep et al. 2017). Although originally designed for single-cell ATAC-seq data, it has been shown to be highly sensitive for bulk data as well (Gerbaldo et al. 2024). The aim of the method is to infer, based on the accessibility of motif matches, the relative activity of transcription factors (TFs) in each sample or cell, adjusting for technical biases (GC content and enrichment bias). It is recommended that you read the chromVAR documentation before using this package.
betterChromVAR is first and foremost a considerably faster, analytical re-implementation of the original method (it is also considerably faster than the C++ reimplementation in ArchR (Granja et al. 2021). Contrarily to the original chromVAR, it is entirely deterministic, and achieves much higher efficiency by computing expectations and variance at the level of bias bins, instead of in the peak-space.
In addition, betterChromVAR includes a few additions, such as simpler weighted expectations, bias shrinkage, and an ATAC-seq normalization method based on the chromVAR logic. Importantly, however, not all functionalities of chromVAR have been reimplemented here: betterChromVAR chiefly focuses on the key task of efficiently computing motif deviations.
betterChromVARbetterChromVAR
is a R package available via the Bioconductor repository. It can be
installed using the following commands in your R
session:
betterChromVAR
takes two primary inputs: 1) the counts in peaks across cells or
samples, and 2) an annotation of which TFs/motifs match which peak1. The peak
counts should be provided as a RangedSummarizedExperiment
object2,
while the annotation can either be in that format too or provided as a
(sparse) matrix.
There are multiple ways of generating your peak counts; the original
chromVAR
package includes such a function (getCounts()), and the
epiwraps
package has some with more options (functions
peakCountsFromBAM() and
peakCountsFromFrags()). Similarly, the motifmatchr
package can be used to generate the motif matching annotation. An
important consideration is that the peaks or regions used for the
purpose of this analysis should have similar widths. It is thus highly
recommended, before generating the count and annotation matrices, to
resize your regions (e.g. using
peaks <- resize(peaks, width=300, fix="center")). The
exact size can be something not too large (otherwise the
presence/absence of a motif becomes meaningless) and ideally close to
the median size of your original peaks. In addition, is it advisable to
restrict your peaks to those that are on standard chromosomes3.
Here we’ll use dummy data as example:
suppressPackageStartupMessages({
library(SummarizedExperiment)
library(betterChromVAR)
})
attach(getDummyData())
counts## class: RangedSummarizedExperiment
## dim: 500 10
## metadata(0):
## assays(1): counts
## rownames: NULL
## rowData names(1): bias
## colnames(10): sample1 sample2 ... sample9 sample10
## colData names(0):
## 6 x 5 sparse Matrix of class "dgCMatrix"
## motif1 motif2 motif3 motif4 motif5
## [1,] . . . . .
## [2,] . . 1 . .
## [3,] . . 1 . .
## [4,] . . . . .
## [5,] . . . . .
## [6,] . . . . .
In this case, we can see that rowData(counts) already
has a bias column indicating the GC content of the
regions:
## DataFrame with 500 rows and 1 column
## bias
## <numeric>
## 1 0.425797
## 2 0.509541
## 3 0.467713
## 4 0.536100
## 5 0.479600
## ... ...
## 496 0.376064
## 497 0.510291
## 498 0.498850
## 499 0.462378
## 500 0.528177
Had this not been the case, we would first need to add this using:
where my_genome is a BSgenome
object or similar (e.g. an FaFile4).
Once we have this, we can launch the computation of the deviations :
## class: SummarizedExperiment
## dim: 5 10
## metadata(0):
## assays(2): deviations z
## rownames(5): motif1 motif2 motif3 motif4 motif5
## rowData names(5): N total variability var.pval var.adjPval
## colnames(10): sample1 sample2 ... sample9 sample10
## colData names(0):
The resulting dev object is a SummarizedExperiment with
the same columns (and colData) as the original counts
object, but with the motifs as rows (instead of the original peaks). The
values can be interpreted as the relative activity, across samples or
cells, of the corresponding TFs5. When the function is used with default
arguments, the results are virtually the same as the original chromVAR
(averaging over the noise coming from the random background selection of
the original method).
The object contains two assays: the deviations assay
contains the bias-adjusted deviations from the expectation (by default,
the average), i.e. the difference to the expectation divided by the
expectation, and the z assay contains z-scores, i.e. the
difference to the expectation divided by the variance of the
expectation.
If your samples/cells have similar library sizes, it is recommended
that you use the z assay for downstream analysis (such as
differential TF activity using limma). If the samples have
very different library sizes, the z scores will be
influenced by that, and it might be preferable to use the
deviations assay.
The variability of each motif across the dataset is stored in the rowData of the object :
## DataFrame with 5 rows and 5 columns
## N total variability var.pval var.adjPval
## <numeric> <numeric> <numeric> <numeric> <numeric>
## motif1 84 38595 0.616974 0.944995 0.993556
## motif2 60 30463 0.531893 0.979615 0.993556
## motif3 66 37568 0.461789 0.992681 0.993556
## motif4 75 43217 0.454014 0.993556 0.993556
## motif5 83 45241 0.646705 0.926245 0.993556
If bootstrap confidence intervals of the variability are additionally
desired, one can simply use the computeVariability function
of chromVAR
on the present output (i.e. dev).
The betterChromVAR() function is actually a wrapper
around three steps, which can also be executed individually for more
customization, or to avoid repeating some computation multiple
times.
The first step is creating the bias bins and their pairwise sampling probabilities. This is achieved with:
## Creating 50*50=2500 bias bins and computing their sampling distances
## bcvBackground object with 500 peaks,
## split into 50*50 ( 2500 ) bins.
The second step is computing the per-bin background expectations and variances for each sample or cell:
## bcvBackground object with 500 peaks,
## split into 50*50 ( 2500 ) bins.
## Background data filled for 10 samples.
The bg object has now been filled with the additional
information. It can now be used for:
## class: SummarizedExperiment
## dim: 5 10
## metadata(0):
## assays(2): deviations z
## rownames(5): motif1 motif2 motif3 motif4 motif5
## rowData names(5): N total variability var.pval var.adjPval
## colnames(10): sample1 sample2 ... sample9 sample10
## colData names(0):
This last call could be made with different annotations, without needing to recompute the previous steps.
In addition to the enrichment and GC bias taken into account by the
original chromVAR,
betterChromVAR
supports an optional third dimension: fragment length bias (see
?getBackgroundBins() for more information). This requires
the compilation of an additional bias component (the log10-transformed
mean or median fragment length per region). If desired, desired, we
recommended using the counting functions from the epiwraps package,
which can provide this information.
If you are using betterChromVAR (or ChromVAR, for that matter) on
single-cell data with multiple cell types, the way to best run it
depends on whether your are interested in differences within cell types,
or between cell types. If interested in differences between cell types,
run it on the entire dataset, specifying the cell types as
grouping argument to betterChromVAR(). In this
way, rare and abundant cell types will be given the same weight in
computing the expectation.
If you are interested in differences within cell types (e.g. between samples/conditions), you should instead run the method separately for each cell type. In this way, the background bins will be defined based on the enrichment in the given cell type, leading to better capture of the bias. The drawback, however, is that the values won’t be comparable across cell types. To test across conditions, we also recommend using it on pseudobulk data.
The only step of the process that needs to be done across the entire
dataset is the computing of the expectation, i.e. the mean (or weighted
mean) for each region across all cells, and the creation of the
background bins, which depends on it. Everything else can easily be
executed in chunks (of cells), as is done for multi-threading in the
betterChromVAR() function. This can be done manually on the
full dataset, and then passed to downstream functions for computations
on subsets of the data:
## Creating 50*50=2500 bias bins and computing their sampling distances
Then we can apply the next steps only on subset of the data:
bg2 <- computeBackgrounds(counts[,1:3], bg, expectation = ex)
dev2 <- computeDeviationsAnalytic(counts[,1:3], bg2, motifMatches)
# this should be identical to what we had run on the whole object:
identical(assay(dev)[,1:3], assay(dev2))## [1] FALSE
If data is on disk, rather than in memory, and you are adjusting the
expectation based on cell-type abundance using the grouping
argument, getBackgroundBins will load the entire count
matrix into memory. To avoid this, compute the mean per group using
block-wise operations, for instance using the
aggregateAcrossCells() function of the scrapper
package, and then use the rowMeans() of that as
expectation.
The general chromVAR approach, and in particular the deterministic
version implemented here, is also amenable to be used to normalize GC-
and enrichment bias out of bulk ATAC-seq data. This is implemented in
the CVnorm() function. In addition, if given a grouping of
the samples (e.g. experimental conditions), the function applies a
variance-based smoothing of the bias correction, inspired by smoothed
quantile normalization (see the qsmooth
package or Hicks et al. (2018) ). In a
nutshell, if the bias in a certain background bin is explained by
experimental groups, it will be less corrected than if it varies across
samples of the groups6.
Example usage:
# we assign arbitrary groups to the samples:
counts$group <- rep(LETTERS[1:2], each=5)
# we run the smoothed CVnorm:
counts <- CVnorm(counts, grouping=counts$group)
# (the normal CVnorm could be run by omitting the grouping)
counts## class: RangedSummarizedExperiment
## dim: 500 10
## metadata(0):
## assays(2): counts corrected
## rownames: NULL
## rowData names(1): bias
## colnames(10): sample1 sample2 ... sample9 sample10
## colData names(1): group
This adds a corrected assay to the object. Note that
although the assay is corrected for enrichment- and GC-bias, it is not
corrected for library size differences. Rather, it is on the original
count scale (although not integer anymore), so that it is amenable to
use in downstream count-based analysis methods such as edgeR.
## R version 4.5.3 (2026-03-11)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] betterChromVAR_0.99.41 SummarizedExperiment_1.41.1
## [3] Biobase_2.71.0 GenomicRanges_1.63.2
## [5] Seqinfo_1.1.0 IRanges_2.45.0
## [7] S4Vectors_0.49.2 BiocGenerics_0.57.1
## [9] generics_0.1.4 MatrixGenerics_1.23.0
## [11] matrixStats_1.5.0 BiocStyle_2.39.0
##
## loaded via a namespace (and not attached):
## [1] Matrix_1.7-5 jsonlite_2.0.0 crayon_1.5.3
## [4] compiler_4.5.3 BiocManager_1.30.27 Biostrings_2.79.5
## [7] parallel_4.5.3 jquerylib_0.1.4 BiocParallel_1.45.0
## [10] yaml_2.3.12 fastmap_1.2.0 lattice_0.22-9
## [13] R6_2.6.1 XVector_0.51.0 S4Arrays_1.11.1
## [16] knitr_1.51 DelayedArray_0.37.1 maketools_1.3.2
## [19] bslib_0.10.0 rlang_1.2.0 cachem_1.1.0
## [22] xfun_0.57 sass_0.4.10 sys_3.4.3
## [25] SparseArray_1.11.13 cli_3.6.6 digest_0.6.39
## [28] grid_4.5.3 lifecycle_1.0.5 evaluate_1.0.5
## [31] codetools_0.2-20 buildtools_1.0.0 abind_1.4-8
## [34] rmarkdown_2.31 tools_4.5.3 htmltools_0.5.9
The annotation can be binary or probabilistic (i.e. from 0 to 1). If using motif matches, however, it is recommended to input binary matches, as the magnitude of motif scores is generally poorly correlated to actual binding.↩︎
See SummarizedExperiment if you’re not familiar with those↩︎
See keepStandardChromosomes() from the
GenomeInfoDb
package↩︎
Note, however, that similar motifs will be given similar activity estimates, so that it is often hard to know which of a set of highly-similar motifs is in fact responsible for the observed signal.↩︎
Note that this is different from the original qsmooth
approach – see ?CVnorm for more detail.↩︎