filterWindows {csaw} | R Documentation |
Convenience function to compute filter statistics for windows, based on proportions or using enrichment over background.
filterWindows(data, background, type="global", assay.data="counts", assay.back="counts", prior.count=2, scale.info=NULL) scaleControlFilter(data, background)
data |
A RangedSummarizedExperiment object containing window-level counts for |
background |
A RangedSummarizedExperiment object.
For |
type |
a character string specifying the type of filtering to perform; can be any of |
assay.data |
A string or integer scalar specifying the assay containing window/bin counts in |
assay.back |
A string or integer scalar specifying the assay containing window/bin counts in |
prior.count |
a numeric scalar, specifying the prior count to use in |
scale.info |
A list containing the output of |
Proportion-based filtering supposes that a certain percentage of the genome is genuinely bound.
If type="proportion"
, the filter statistic is defined as the ratio of the rank to the total number of windows.
Rank is in ascending order, i.e., higher abundance windows have higher ratios.
Windows are retained that have rank ratios above a threshold, e.g., 0.99 if 1% of the genome is assumed to be bound.
All other values of type
will perform background-based filtering, where abundances of the windows are compared to those of putative background regions.
The filter statistic are generally defined as the difference between window and background abundances, i.e., the log-fold increase in the counts.
Windows can be filtered to retain those with large filter statistics, to select those that are more likely to contain genuine binding sites.
The differences between the methods center around how the background abundances are obtained for each window.
If type="global"
, the median average abundance across the genome is used as a global estimate of the background abundance.
This should be used when background
contains unfiltered counts for large (2 - 10 kbp) genomic bins, from which the background abundance can be computed.
The filter statistic for each window is defined as the difference between the window abundance and the global background.
If background
is not supplied, the background abundance is directly computed from entries in data
.
If type="local"
, the counts of each row in data
are subtracted from those of the corresponding row in background
.
The average abundance of the remaining counts is computed and used as the background abundance.
The filter statistic is defined by subtracting the background abundance from the corresponding window abundance for each row.
This is designed to be used when background
contains counts for expanded windows, to determine the local background estimate.
If type="control"
, the background abundance is defined as the average abundance of each row in background
.
The filter statistic is defined as the difference between the average abundance of each row in data
and that of the corresponding row in background
.
This is designed to be used when background
contains read counts for each window in the control sample(s).
Unlike type="local"
, there is no subtraction of the counts in background
prior to computing the average abundance.
For filterWindows
, a named list is returned containing:
abundances
, a numeric vector containing the average abundance of each row in data
.
filter
, a numeric vector containing the filter statistic for the given type
for each row.
back.abundances
, a numeric vector containing the average abundance of each entry in background
.
Only reported for type!="proportion"
.
For scaleControlFilter
, a named list is returned containing:
scale
, a numeric scalar containing the scaling factor for multiplying the control counts.
data.totals
, a numeric vector containing the library sizes for data
.
back.totals
, anumeric vector containing the library sizes for background
.
Proportion and global background filtering are dependent on the total number of windows/bins in the genome.
However, empty windows or bins are automatically discarded in windowCounts
(exacerbated if filter
is set above unity).
This will result in underestimation of the rank or overestimation of the global background.
To avoid this, the total number of windows or bins is inferred from the spacing.
For background-based methods, the abundances of large bins or regions in background
must be rescaled for comparison to those of smaller windows
- see getWidths
and scaledAverage
for more details.
In particular, the effective width of the window is often larger than width
, due to the counting of fragments rather than reads.
The fragment length is extracted from data$ext
and background$ext
, though users will need to set data$rlen
or background$rlen
for unextended reads (i.e., ext=NA
).
The prior.count
protects against inflated log-fold increases when the background counts are near zero.
A low prior is sufficient if background
has large counts, which is usually the case for wide regions.
Otherwise, prior.count
should be increased to a larger value like 5.
This may be necessary in type="control"
, where background
contains counts for small windows in the control sample.
When type=="control"
, ChIP samples will be compared to control samples to compute the filter statistic.
Composition biases are likely to be present, where increased binding at some loci reduces coverage of other loci in the ChIP samples.
This incorrectly results in smaller filter statistics for the latter loci, as the fold-change over the input is reduced.
To correct for this, a normalization factor between ChIP and control samples can be computed with scaleControlFilter
.
Users should supply two RangedSummarizedExperiment
objects, each containing the counts for large (~10 kbp) bins in the ChIP and control samples.
The difference in the average abundance between the two objects is computed for each bin.
The median of the differences across all bins is used as a normalization factor to correct the filter statistics for each window.
The idea is that most bins represent background regions, such that a systematic difference in abundance between ChIP and control should represent the composition bias.
scaleControlFilter
will also store the library sizes for each object in its output.
This is used to check that the data
and background
supplied to filterWindows
have the same library sizes.
Otherwise, the normalization factor computed with bin-level counts cannot be correctly applied to the window-level counts.
windowCounts
,
aveLogCPM
,
getWidths
,
scaledAverage
bamFiles <- system.file("exdata", c("rep1.bam", "rep2.bam"), package="csaw") data <- windowCounts(bamFiles, filter=1) # Proportion-based (keeping top 1%) stats <- filterWindows(data, type="proportion") head(stats$filter) keep <- stats$filter > 0.99 new.data <- data[keep,] # Global background-based (keeping fold-change above 3). background <- windowCounts(bamFiles, bin=TRUE, width=300) stats <- filterWindows(data, background, type="global") head(stats$filter) keep <- stats$filter > log2(3) # Local background-based. locality <- regionCounts(bamFiles, resize(rowRanges(data), fix="center", 300)) stats <- filterWindows(data, locality, type="local") head(stats$filter) keep <- stats$filter > log2(3) # Control-based, with binning for normalization (pretend rep2.bam is a control). binned <- windowCounts(bamFiles, width=10000, bin=TRUE) chip.bin <- binned[,1] con.bin <- binned[,2] scinfo <- scaleControlFilter(chip.bin, con.bin) chip.data <- data[,1] con.data <- data[,2] stats <- filterWindows(chip.data, con.data, type="control", prior.count=5, scale.info=scinfo) head(stats$filter) keep <- stats$filter > log2(3)