BioMM {BioMM}R Documentation

BioMM end-to-end prediction

Description

End-to-end prediction by BioMM framework using either supervised or unsupervised learning at stage-1, then supervised learning at stage-2.

Usage

BioMM(trainData, testData, stratify = c("gene", "pathway", "chromosome"),
  pathlistDB, featureAnno, restrictUp, restrictDown, minPathSize,
  supervisedStage1 = TRUE, typePCA, resample1 = "BS",
  resample2 = "CV", dataMode = "allTrain", repeatA1, repeatA2,
  repeatB1, repeatB2, nfolds, FSmethod1, FSmethod2, cutP1, cutP2, fdr1,
  fdr2, FScore = MulticoreParam(), classifier1, classifier2, predMode1,
  predMode2, paramlist1, paramlist2, innerCore = MulticoreParam(),
  outFileA2 = NULL, outFileB2 = NULL)

Arguments

trainData

The input training dataset. The first column is the label or the output. For binary classes, 0 and 1 are used to indicate the class member.

testData

The input test dataset. The first column is the label or the output. For binary classes, 0 and 1 are used to indicate the class member.

stratify

The stratification method. Valid options are c('gene', 'pathway', 'chromosome').

pathlistDB

A list of pathways with pathway IDs and their corresponding genes ('entrezID' is used). This is only used for pathway-based stratification (only stratify is 'pathway').

featureAnno

The annotation data stored in a data.frame for probe mapping. It must have at least two columns named 'ID' and 'entrezID'. If it's NULL, then the input probe is from the transcriptomic data. (Default: NULL)

restrictUp

The upper-bound of the number of probes or genes in each biological stratified block.

restrictDown

The lower-bound of the number of probes or genes in each biological stratified block.

minPathSize

The minimal defined pathway size after mapping your own data to GO database. This is only used for pathway-based stratification (only stratify is 'pathway').

supervisedStage1

A logical value. If TRUE, then supervised learning models are applied; if FALSE, unsupervised learning.

typePCA

the type of PCA. Available options are c('regular', 'sparse').

resample1

The resampling methods at stage-1. Valid options are 'CV' and 'BS'. 'CV' for cross validation and 'BS' for bootstrapping resampling. The default is 'BS'.

resample2

The resampling methods at stage-2. Valid options are 'CV' and 'BS'. 'CV' for cross validation and 'BS' for bootstrapping resampling. The default is 'CV'.

dataMode

The mode of data used at stage-1. 'subTrain' or 'allTrain'. This is only applicable for bootstrapping resampling. (Default: allTrain).

repeatA1

The number of repeats N is used during resampling procedure. Repeated cross validation or multiple boostrapping is performed if N >=2. One can choose 10 repeats for 'CV' and 100 repeats for 'BS'.

repeatA2

The number of repeats N is used during resampling prediction. The default is 1 for 'CV'.

repeatB1

The number of repeats N is used for generating stage-2 test data prediction scores.

repeatB2

The number of repeats N is used for test data prediction. The default is 1.

nfolds

The number of folds is defined for cross validation.

FSmethod1

Feature selection methods at stage-1. Available options are c(NULL, 'positive', 'wilcox.test', 'cor.test', 'chisq.test', 'posWilcox', or 'top10pCor').

FSmethod2

Feature selection methods at stage-2. Available options are c(NULL, 'positive', 'wilcox.test', 'cor.test', 'chisq.test', 'posWilcox', or 'top10pCor').

cutP1

The cutoff used for p value thresholding at stage-1. Commonly used cutoffs are c(0.5, 0.1, 0.05, 0.01, etc). The default is 0.05. Commonly used cutoffs are c(0.5, 0.1, 0.05, 0.01, etc). The default is 0.05.

cutP2

The cutoff used for p value thresholding at stage-2.

fdr1

Multiple testing correction method at stage-1. Available options are c(NULL, 'fdr', 'BH', 'holm', etc). See also p.adjust. The default is NULL.

fdr2

Multiple testing correction method at stage-2. Available options are c(NULL, 'fdr', 'BH', 'holm', etc). See also p.adjust. The default is NULL.

FScore

The number of cores used for feature selection.

classifier1

Machine learning classifiers at stage-1.

classifier2

Machine learning classifiers at stage-2.

predMode1

The prediction mode at stage-1. Available options are c('probability', 'classification', 'regression').

predMode2

The prediction mode at stage-2. Available options are c('probability', 'classification', 'regression').

paramlist1

A list of model parameters at stage-1.

paramlist2

A list of model parameters at stage-2.

innerCore

The number of cores used for computation.

outFileA2

The file name of prediction metrics based on resampling with the '.csv' file extension. If it's provided, then the result will be saved. The default is NULL.

outFileB2

The file name of independent test prediction metrics with the '.csv' file extension. If it's provided, then the result will be saved. The default is NULL.

Details

Stage-2 training data can be learned either using bootstrapping or cross validation resampling methods in the supervised learning settting. Stage-2 test data is learned via independent test set prediction.

Value

The CV or BS prediction performance for the training data and test set prediction performance if testData is given.

References

Chen, J., & Schwarz, E. (2017). BioMM: Biologically-informed Multi-stage Machine learning for identification of epigenetic fingerprints. arXiv preprint arXiv:1712.00336.

Perlich, C., & Swirszcz, G. (2011). On cross-validation and stacking: Building seemingly predictive models on random data. ACM SIGKDD Explorations Newsletter, 12(2), 11-15.

See Also

BioMMreconData; BioMMstage1pca; BioMMstage2pred

Examples

 
## Load data    
methylfile <- system.file('extdata', 'methylData.rds', package='BioMM')  
methylData <- readRDS(methylfile)    
## Annotation files for Mapping CpGs into chromosome  
probeAnnoFile <- system.file('extdata', 'cpgAnno.rds', package='BioMM')  
probeAnno <- readRDS(file=probeAnnoFile)   
supervisedStage1=TRUE
classifier1=classifier2 <- 'randForest'
predMode1=predMode2 <- 'classification'
paramlist1=paramlist2 <- list(ntree=300, nthreads=30)   
library(BiocParallel)
library(ranger)
param1 <- MulticoreParam(workers = 2)
param2 <- MulticoreParam(workers = 20)
## Not Run 
## result <- BioMM(trainData=methylData, testData=NULL,
##                 stratify='chromosome', pathlistDB, featureAnno=probeAnno, 
##                 restrictUp=10, restrictDown=200, minPathSize=10, 
##                 supervisedStage1, typePCA='regular', 
##                 resample1='BS', resample2='CV', dataMode='allTrain', 
##                 repeatA1=20, repeatA2=1, repeatB1=20, repeatB2=1, 
##                 nfolds=10, FSmethod1=NULL, FSmethod2=NULL, 
##                 cutP1=0.1, cutP2=0.1, fdr1=NULL, fdr2=NULL, FScore=param1, 
##                 classifier1, classifier2, predMode1, predMode2, 
##                 paramlist1, paramlist2, innerCore=param2,  
##                 outFileA2=NULL, outFileB2=NULL)

[Package BioMM version 1.0.0 Index]