detailRanges {csaw} | R Documentation |
Add detailed exon-based annotation to specified genomic regions.
detailRanges(incoming, txdb, orgdb, dist=5000, promoter=c(3000, 1000), max.intron=1e6, key.field="ENTREZID", name.field="SYMBOL", ignore.strand=TRUE)
incoming |
a |
txdb |
a |
orgdb |
a genome wide annotation object for the genome of interest |
dist |
an integer scalar specifying the flanking distance to annotate |
promoter |
an integer vector of length 2, where first and second values define the promoter as some distance upstream and downstream from the TSS, respectively |
max.intron |
an integer scalar indicating the maximum distance between exons |
key.field |
a character scalar specifying the keytype for name extraction |
name.field |
a character scalar or vector specifying the column(s) to use as the gene name |
ignore.strand |
a logical scalar indicating whether strandedness in |
This function adds exon-based annotations to a given set of genomic regions, in the form of compact character strings specifying the features overlapping and flanking each region.
The aim is to determine the genic context of empirically identified regions.
This allows some basic biological interpretation of binding/marking in those regions.
All neighboring genes within a specified range are reported, rather than just the closest gene to the region.
If a region in incoming
is stranded and ignore.strand=FALSE
, annotated features will only be reported if they lie on the same strand as that region.
If incoming
is missing, then the annotation will be provided directly to the user in the form of a GRanges
object.
This may be more useful when further work on the annotation is required.
Exon numbers are provided in the metadata with promoters and gene bodies labelled as 0 and -1, respectively.
Overlaps to introns can be identified by finding those regions that overlap with gene bodies but not with any of the corresponding exons.
If incoming
is not provided, a GRanges
object will be returned containing ranges for the exons, promoters and gene bodies.
Gene keys (e.g., Entrez IDs) are stored as row names.
Gene symbols, exon numbers and internal groupings (for exons of genes with multiple genomic locations) are also stored as metadata.
If incoming
is a GRanges
object, a list will be returned with overlap
, left
and right
elements.
Each element is a character vector of length equal to the number of ranges in incoming
.
Each non-empty string records the gene symbol, the overlapped exons and the strand.
For left
and right
, the gap between the range and the annotated feature is also included.
For annotated features overlapping a region, the character string in the overlap
output vector will be of the form GENE|EXONS|STRAND
.
GENE
is the gene symbol by default, but reverts to <XXX>
if no symbol is defined for a gene with the Entrez ID XXX
.
The EXONS
indicate the exon or range of exons that are overlapped.
The STRAND
is, obviously, the strand on which the gene is coded.
For annotated regions flanking the region within a distance of dist
, the character string in the left
or right
output vectors will have an additional [DIST]
value.
This represents the gap between the edge of the region and the closest exon for that gene.
Exons are numbered in order of increasing start or end position for genes on the forward or reverse strands, respectively.
Exon ranges in EXON
are reported from as a comma-separated list where stretches of consecutive exons are summarized into a range.
Promoters are defined around any annotated TSS in txdb
, and are marked as exon 0.
Genes can have multiple TSS, but an overlap to multiple promoters will only be reported once.
If the region overlaps an intron, it is labelled with I
in EXON
.
Intronic overlaps are not reported if there is an exonic overlap.
Note that promoter and intronic annotations are only reported for the overlap
vector to reduce redundancy in the output.
For example, it makes little sense to report that the region is both flanking and overlapping an intron.
Similarly, the value of DIST
is more relevant when it is reported to the nearest exon rather than to an intron.
In cases where the distance is reported to the first exon, it can be used to refine the choice of promoter
.
The max.intron
value is necessary to deal with genes that have ambiguous locations on the genome.
If a gene has exons on different chromosomes, its location is uncertain and the gene is partitioned into two sets of exons for separate processing.
However, this is less obvious when the ambiguous locations belong to the same chromosome.
The max.intron
value protects against excessively large genes that may occur from considering those locations as a single transcriptional unit.
Exons are partitioned into two (or more) internal groupings for further processing.
The default settings for key.field
and name.field
will work for human and mouse genomes, but may not work for other organisms.
The key.field
should refer to the key type in the OrgDb
object, and also correspond to the GENEID
of the TxDb
object.
For example, in S. cerevisiae, key.field
is set to "ORF"
while name.field
is set to "GENENAME"
.
If multiple entries are supplied in name.field
, the value of GENE
is defined as a semicolon-separated list of each of those entries.
Aaron Lun
library(org.Mm.eg.db) library(TxDb.Mmusculus.UCSC.mm10.knownGene) current <- readRDS(system.file("exdata", "exrange.rds", package="csaw")) output <- detailRanges(current, orgdb=org.Mm.eg.db, txdb=TxDb.Mmusculus.UCSC.mm10.knownGene) head(output$overlap) head(output$right) head(output$left) detailRanges(txdb=TxDb.Mmusculus.UCSC.mm10.knownGene, orgdb=org.Mm.eg.db) ## Not run: output <- detailRanges(current, txdb=TxDb.Mmusculus.UCSC.mm10.knownGene, orgdb=org.Mm.eg.db, name.field=c("ENTREZID")) head(output$overlap) output <- detailRanges(current, txdb=TxDb.Mmusculus.UCSC.mm10.knownGene, orgdb=org.Mm.eg.db, name.field=c("SYMBOL", "ENTREZID")) head(output$overlap) ## End(Not run)