BioMM {BioMM} | R Documentation |
End-to-end prediction by BioMM framework using either supervised or unsupervised learning at stage-1, then supervised learning at stage-2.
BioMM(trainData, testData, stratify = c("gene", "pathway", "chromosome"), pathlistDB, featureAnno, restrictUp, restrictDown, minPathSize, supervisedStage1 = TRUE, typePCA, resample1 = "BS", resample2 = "CV", dataMode = "allTrain", repeatA1, repeatA2, repeatB1, repeatB2, nfolds, FSmethod1, FSmethod2, cutP1, cutP2, fdr1, fdr2, FScore = MulticoreParam(), classifier1, classifier2, predMode1, predMode2, paramlist1, paramlist2, innerCore = MulticoreParam(), outFileA2 = NULL, outFileB2 = NULL)
trainData |
The input training dataset. The first column is the label or the output. For binary classes, 0 and 1 are used to indicate the class member. |
testData |
The input test dataset. The first column is the label or the output. For binary classes, 0 and 1 are used to indicate the class member. |
stratify |
The stratification method. Valid options are c('gene', 'pathway', 'chromosome'). |
pathlistDB |
A list of pathways with pathway IDs and their
corresponding genes ('entrezID' is used). This is only used for
pathway-based stratification (only |
featureAnno |
The annotation data stored in a data.frame for probe mapping. It must have at least two columns named 'ID' and 'entrezID'. If it's NULL, then the input probe is from the transcriptomic data. (Default: NULL) |
restrictUp |
The upper-bound of the number of probes or genes in each biological stratified block. |
restrictDown |
The lower-bound of the number of probes or genes in each biological stratified block. |
minPathSize |
The minimal defined pathway size after mapping your
own data to GO database. This is only used for
pathway-based stratification (only |
supervisedStage1 |
A logical value. If TRUE, then supervised learning models are applied; if FALSE, unsupervised learning. |
typePCA |
the type of PCA. Available options are c('regular', 'sparse'). |
resample1 |
The resampling methods at stage-1. Valid options are 'CV' and 'BS'. 'CV' for cross validation and 'BS' for bootstrapping resampling. The default is 'BS'. |
resample2 |
The resampling methods at stage-2. Valid options are 'CV' and 'BS'. 'CV' for cross validation and 'BS' for bootstrapping resampling. The default is 'CV'. |
dataMode |
The mode of data used at stage-1. 'subTrain' or 'allTrain'. This is only applicable for bootstrapping resampling. (Default: allTrain). |
repeatA1 |
The number of repeats N is used during resampling procedure. Repeated cross validation or multiple boostrapping is performed if N >=2. One can choose 10 repeats for 'CV' and 100 repeats for 'BS'. |
repeatA2 |
The number of repeats N is used during resampling prediction. The default is 1 for 'CV'. |
repeatB1 |
The number of repeats N is used for generating stage-2 test data prediction scores. |
repeatB2 |
The number of repeats N is used for test data prediction. The default is 1. |
nfolds |
The number of folds is defined for cross validation. |
FSmethod1 |
Feature selection methods at stage-1. Available options are c(NULL, 'positive', 'wilcox.test', 'cor.test', 'chisq.test', 'posWilcox', or 'top10pCor'). |
FSmethod2 |
Feature selection methods at stage-2. Available options are c(NULL, 'positive', 'wilcox.test', 'cor.test', 'chisq.test', 'posWilcox', or 'top10pCor'). |
cutP1 |
The cutoff used for p value thresholding at stage-1. Commonly used cutoffs are c(0.5, 0.1, 0.05, 0.01, etc). The default is 0.05. Commonly used cutoffs are c(0.5, 0.1, 0.05, 0.01, etc). The default is 0.05. |
cutP2 |
The cutoff used for p value thresholding at stage-2. |
fdr1 |
Multiple testing correction method at stage-1.
Available options are c(NULL, 'fdr', 'BH', 'holm', etc).
See also |
fdr2 |
Multiple testing correction method at stage-2.
Available options are c(NULL, 'fdr', 'BH', 'holm', etc).
See also |
FScore |
The number of cores used for feature selection. |
classifier1 |
Machine learning classifiers at stage-1. |
classifier2 |
Machine learning classifiers at stage-2. |
predMode1 |
The prediction mode at stage-1. Available options are c('probability', 'classification', 'regression'). |
predMode2 |
The prediction mode at stage-2. Available options are c('probability', 'classification', 'regression'). |
paramlist1 |
A list of model parameters at stage-1. |
paramlist2 |
A list of model parameters at stage-2. |
innerCore |
The number of cores used for computation. |
outFileA2 |
The file name of prediction metrics based on resampling with the '.csv' file extension. If it's provided, then the result will be saved. The default is NULL. |
outFileB2 |
The file name of independent test prediction metrics with the '.csv' file extension. If it's provided, then the result will be saved. The default is NULL. |
Stage-2 training data can be learned either using bootstrapping or cross validation resampling methods in the supervised learning settting. Stage-2 test data is learned via independent test set prediction.
The CV or BS prediction performance for the training data and
test set prediction performance if testData
is given.
Chen, J., & Schwarz, E. (2017). BioMM: Biologically-informed Multi-stage Machine learning for identification of epigenetic fingerprints. arXiv preprint arXiv:1712.00336.
Perlich, C., & Swirszcz, G. (2011). On cross-validation and stacking: Building seemingly predictive models on random data. ACM SIGKDD Explorations Newsletter, 12(2), 11-15.
BioMMreconData
; BioMMstage1pca
;
BioMMstage2pred
## Load data methylfile <- system.file('extdata', 'methylData.rds', package='BioMM') methylData <- readRDS(methylfile) ## Annotation files for Mapping CpGs into chromosome probeAnnoFile <- system.file('extdata', 'cpgAnno.rds', package='BioMM') probeAnno <- readRDS(file=probeAnnoFile) supervisedStage1=TRUE classifier1=classifier2 <- 'randForest' predMode1=predMode2 <- 'classification' paramlist1=paramlist2 <- list(ntree=300, nthreads=30) library(BiocParallel) library(ranger) param1 <- MulticoreParam(workers = 2) param2 <- MulticoreParam(workers = 20) ## Not Run ## result <- BioMM(trainData=methylData, testData=NULL, ## stratify='chromosome', pathlistDB, featureAnno=probeAnno, ## restrictUp=10, restrictDown=200, minPathSize=10, ## supervisedStage1, typePCA='regular', ## resample1='BS', resample2='CV', dataMode='allTrain', ## repeatA1=20, repeatA2=1, repeatB1=20, repeatB2=1, ## nfolds=10, FSmethod1=NULL, FSmethod2=NULL, ## cutP1=0.1, cutP2=0.1, fdr1=NULL, fdr2=NULL, FScore=param1, ## classifier1, classifier2, predMode1, predMode2, ## paramlist1, paramlist2, innerCore=param2, ## outFileA2=NULL, outFileB2=NULL)