A common approach in analyzing gene expression profiles was identifying differential expressed genes that are deemed interesting. The enrichment analysis we demonstrated in Disease enrichment analysis vignette were based on these differential expressed genes. This approach will find genes where the difference is large, but it will not detect a situation where the difference is small, but evidenced in coordinated way in a set of related genes. Gene Set Enrichment Analysis (GSEA)1 directly addresses this limitation. All genes can be used in GSEA; GSEA aggregates the per gene statistics across genes within a gene set, therefore making it possible to detect situations where all genes in a predefined set change in a small but coordinated way. Since it is likely that many relevant phenotypic differences are manifested by small but consistent changes in a set of genes.
Genes are ranked based on their phenotypes. Given a priori defined set of gens S (e.g., genes shareing the same DO category), the goal of GSEA is to determine whether the members of S are randomly distributed throughout the ranked gene list (L) or primarily found at the top or bottom.
There are three key elements of the GSEA method:
We implemented GSEA algorithm proposed by Subramanian1. Alexey Sergushichev implemented an algorithm for fast GSEA analysis in the fgsea2 package.
In DOSE3, user can use GSEA algorithm implemented in DOSE
or fgsea
by specifying the parameter by="DOSE"
or by="fgsea"
. By default, DOSE use fgsea
since it is much more fast.
Leading edge analysis reports Tags
to indicate the percentage of genes contributing to the enrichment score, List
to indicate where in the list the enrichment score is attained and Signal
for enrichment signal strength.
It would also be very interesting to get the core enriched genes that contribute to the enrichment.
DOSE supports leading edge analysis and report core enriched genes in GSEA analysis.
gseDO
fuctionIn the following example, in order to speedup the compilation of this document, only gene sets with size above 120 were tested and only 100 permutations were performed.
library(DOSE)
data(geneList)
y <- gseDO(geneList,
nPerm = 100,
minGSSize = 120,
pvalueCutoff = 0.2,
pAdjustMethod = "BH",
verbose = FALSE)
head(y, 3)
## ID Description setSize enrichmentScore NES
## DOID:5679 DOID:5679 retinal disease 299 -0.3676313 -1.578615
## DOID:8398 DOID:8398 osteoarthritis 171 -0.3555851 -1.395431
## DOID:8466 DOID:8466 retinal degeneration 216 -0.3467372 -1.392964
## pvalue p.adjust qvalues rank
## DOID:5679 0.01265823 0.07977642 0.03904579 1768
## DOID:8398 0.01428571 0.07977642 0.03904579 1667
## DOID:8466 0.01428571 0.07977642 0.03904579 1465
## leading_edge
## DOID:5679 tags=24%, list=14%, signal=21%
## DOID:8398 tags=26%, list=13%, signal=23%
## DOID:8466 tags=21%, list=12%, signal=19%
## core_enrichment
## DOID:5679 2878/3791/23247/80184/6750/7450/596/9187/2034/482/948/1490/1280/5737/4314/4881/3426/187/629/6403/6785/2934/5176/7078/5950/727/10516/4311/2247/1295/358/10203/582/10218/57125/585/1675/6310/2202/4313/2944/4254/3075/2099/3480/4653/6387/1471/857/4016/1909/4053/6678/1296/4915/55812/1191/5654/10631/2697/2952/6935/2200/3479/2006/10451/9370/771/652/4693/5346/1524
## DOID:8398 3554/5590/2034/8840/1280/4314/8633/1235/573/219699/4322/1902/7048/10216/30008/1735/1277/5468/51314/9365/3952/11096/4313/2191/2099/6387/7079/388/2690/10418/5654/3551/2487/6863/4982/7177/7049/9370/1311/652/4148/2922/54829
## DOID:8466 948/5737/4314/4881/3426/629/6403/6785/2934/5176/7078/727/10516/1295/358/10203/582/10218/585/1675/2202/2944/3075/2099/4653/1471/857/4016/1909/4053/6678/1296/4915/1191/5654/2697/2952/6935/2200/2006/10451/771/652/1524
gseNCG
fuctionncg <- gseNCG(geneList,
nPerm = 100,
minGSSize = 120,
pvalueCutoff = 0.2,
pAdjustMethod = "BH",
verbose = FALSE)
ncg <- setReadable(ncg, 'org.Hs.eg.db')
head(ncg, 3)
## ID Description setSize enrichmentScore NES pvalue
## lung lung lung 173 -0.3880662 -1.600783 0.01136364
## breast breast breast 133 -0.4869070 -1.912514 0.01298701
## lymphoma lymphoma lymphoma 188 0.2999589 1.347263 0.07142857
## p.adjust qvalues rank leading_edge
## lung 0.03896104 0.02734108 2775 tags=31%, list=22%, signal=25%
## breast 0.03896104 0.02734108 2930 tags=33%, list=23%, signal=26%
## lymphoma 0.14285714 0.10025063 2087 tags=21%, list=17%, signal=18%
## core_enrichment
## lung SETD2/ATXN3L/LRP1B/BRD3/ARID1A/INHBA/RB1/ADCY1/LYRM9/NF1/CTNNB1/TP53/SATB2/STK11/CTIF/CTNNA3/KDR/COL11A1/FLT3/APC/ADGRL3/FGFR3/NCAM2/DIP2C/APLNR/SLIT2/EPHA3/RUNX1T1/ZMYND10/ZFHX4/GLI3/TNN/PLSCR4/DACH1/ERBB4
## breast KMT2A/ERBB3/SETD2/ARID1A/GPS2/NCOR1/RB1/MAP2K4/NF1/TP53/PIK3R1/STK11/CDKN1B/PTGFR/APC/CCND1/TRAF5/MAP3K1/ESR1/TBX3/FOXA1/GATA3
## lymphoma DUSP2/EZH2/PRDM1/MYC/ZWILCH/IKZF3/PLCG2/IDH2/HIST1H1C/MAGEC3/CD79B/ETV6/HIST1H1E/HIST1H1B/IRF8/CD28/SLC29A2/DUSP9/TNFAIP3/DNMT3A/SYK/TNF/BCR/HIST1H1D/DSC3/UBE2A/PABPC1
gseDGN
fuctiondgn <- gseDGN(geneList,
nPerm = 100,
minGSSize = 120,
pvalueCutoff = 0.2,
pAdjustMethod = "BH",
verbose = FALSE)
dgn <- setReadable(dgn, 'org.Hs.eg.db')
head(dgn, 3)
## ID Description setSize
## umls:C0338656 umls:C0338656 Impaired cognition 342
## umls:C0029456 umls:C0029456 Osteoporosis 375
## umls:C1272641 umls:C1272641 Systemic arterial pressure 318
## enrichmentScore NES pvalue p.adjust qvalues rank
## umls:C0338656 -0.3266625 -1.407659 0.01162791 0.1480263 0.115651 1997
## umls:C0029456 -0.3439046 -1.483566 0.01176471 0.1480263 0.115651 1766
## umls:C1272641 -0.3277208 -1.407614 0.01190476 0.1480263 0.115651 1758
## leading_edge
## umls:C0338656 tags=23%, list=16%, signal=20%
## umls:C0029456 tags=23%, list=14%, signal=20%
## umls:C1272641 tags=23%, list=14%, signal=20%
## core_enrichment
## umls:C0338656 NR3C1/CAPN3/SLC2A10/CREBBP/ZNF224/ITM2B/ELK3/CLN5/GAD1/BACE1/HGF/SERPINA3/MBL2/SST/EGR1/INSR/UTRN/ARL4D/PVALB/EEF1A2/DYM/CD36/RAB40AL/RBMS3/TREM2/PER3/OXTR/TSC1/CDR1/IGFALS/TPPP/SELP/NGF/BCHE/KCNS3/APBB2/TRPM4/RUNX1T1/MME/ABCB1/PPARG/MVP/NME8/SPG11/LPL/SLC26A4/FHL5/KL/LEP/FTO/NAIP/SORL1/ESR1/ABCC8/CST3/LAMA2/HHAT/LRP1/CLU/ALB/SPON1/NTS/HTRA1/GSTT1/GRIA2/MAGI2/IRS1/TAT/COL4A5/AASS/IGF1/ITPR1/BMP4/LRP2/MAPT/ERBB4/GRP
## umls:C0029456 HGF/PTH1R/CYP1A1/JAG1/ROR2/FLT3/CUL9/EEF1A2/THSD4/BCL2/ITGAV/WIF1/GREM2/COL15A1/HPGDS/VGLL3/SLIT3/NRIP1/TMEM135/MGP/PLCL1/OSBPL1A/PIBF1/SELP/SPRY1/MMP13/ID4/SPP2/COL1A2/AOX1/ARHGEF3/GSN/TSC22D3/ATP1B1/NR5A2/ANKH/COL1A1/LEPR/THSD7A/GC/FGF2/PPARG/NOX4/ZNF266/GHRH/BHLHE40/SLC19A2/THBD/FLNB/KL/LEP/HSD17B4/CTSK/FTO/MMP2/ESR1/IGF1R/PTN/IRAK3/HSPA1L/CST3/GHR/SPARC/KDM4B/LRP1/INPP4B/BMPR1B/PTHLH/DPT/FRZB/GSTT1/AR/TNFRSF11B/IRS1/WLS/GSTM3/TGFBR3/TPH1/IGF1/SFRP4/CORIN/BMP4/CHAD/FOXA1/PGR
## umls:C1272641 HGF/SERPINA3/SCNN1A/HECTD4/MBL2/CYP1A1/ENOX1/STK39/INSR/IGFBP7/TENM4/ITGAV/CRBN/FHIT/PRKG1/CD36/CTGF/GULP1/SLC24A3/NPY2R/MMP3/AHR/LONRF1/FBXL7/MEF2C/RPS6KA2/ID4/CDADC1/TGFBR2/VTCN1/CYP7A1/CITED2/SCNN1G/FAM155A/NEFL/JAM2/RGS17/PPARG/CDH13/SYNE1/GRK4/LPL/TRPS1/SEMA5A/KL/UCP1/HSD17B4/KCNMA1/PRLR/CDH11/SORL1/SGCD/NOV/ESR1/IGF1R/ZNF652/ACBD4/FGF18/FRY/ZNF385D/AGTR1/TNFRSF11B/IRS1/DCN/CARTPT/ADRA2A/TBX3/ADIPOQ/CORIN/ADH1B/NOVA1/LRP2
cnetplot(ncg, categorySize="pvalue", foldChange=geneList)
enrichMap(y, n=20)
gseaplot(y, geneSetID = y$ID[1], title=y$Description[1])
1. Subramanian, A. et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America 102, 15545–15550 (2005).
2. S., A. An algorithm for fast preranked gene set enrichment analysis using cumulative statistic calculation. biorxiv doi:10.1101/060012
3. Yu, G., Wang, L.-G., Yan, G.-R. & He, Q.-Y. DOSE: An r/bioconductor package for disease ontology semantic and enrichment analysis. Bioinformatics 31, 608–609 (2015).