getDataAfterFS {BioMM}R Documentation

Return the data after feature selection

Description

Get the new data set after performing feature selection on the input training and test data.

Usage

getDataAfterFS(trainData, testData, FSmethod, cutP = 0.1, fdr = NULL,
  FScore = MulticoreParam())

Arguments

trainData

The input training dataset. The first column is the label.

testData

The input test dataset. The first column is the label.

FSmethod

Feature selection methods. Available options are c(NULL, 'positive', 'wilcox.test', 'cor.test', 'chisq.test', 'posWilcox', or 'top10pCor'). 'positive' is the positively outcome-associated features using Pearson correlation method. 'posWilcox' is the positively outcome-associated features using Pearson correlation method together with 'wilcox.text' method. 'top10pCor' is the top 10 outcome-associcated features. This is useful when no features can be picked during stringent feature selection procedure.

cutP

The cutoff used for p value thresholding. It can be any value between 0 and 1. Commonly used cutoffs are c(0.5, 0.1, 0.05, 0.01, etc.). The default is 0.1.

fdr

Multiple testing correction method. Available options are c(NULL, 'fdr', 'BH', 'holm' etc). See also p.adjust. The default is NULL.

FScore

The number of cores used for some feature selection methods. The default is 10.

Details

Parallel computing is helpful if your input data is high dimensional. For 'cutP', a soft thresholding of 0.1 may be favorable than more stringent p value cutoff because the features with small effect size can be taken into consideration for downstream analysis. However, for high dimensional (e.g. p > 10,000) data, many false positive features may exist, thus, rigorous p value thresholding should be applied. 'chisq.test' is suggested for GWAS data due to the binary/discrete input and output.

Value

Both training and test data (if provided) with reduced number of features in the data are returned if feature selection method is applied. If no feature can be found during feature selection procedure, then the output is NULL.

Author(s)

Junfang Chen

Examples

 
## Load data  
methylfile <- system.file('extdata', 'methylData.rds', package='BioMM')  
methylData <- readRDS(methylfile)  
trainIndex <- sample(nrow(methylData), 20)
trainData = methylData[trainIndex,]
testData = methylData[-trainIndex,]
## Feature selection
library(BiocParallel)
param <- MulticoreParam(workers = 2)
datalist <- getDataAfterFS(trainData, testData, FSmethod=NULL, 
                           cutP=0.1, fdr=NULL, FScore=param)
trainDataSub <- datalist[[1]] 
testDataSub <- datalist[[2]] 
print(dim(trainData))
print(dim(trainDataSub))

[Package BioMM version 1.0.0 Index]