selectFeatures {PAA} | R Documentation |
Performs a multivariate feature selection using frequency-based feature selection (based on RF-RFE, RJ-RFE or SVM-RFE) or ensemble feature selection (based on SVM-RFE).
selectFeatures(elist = NULL, n1 = NULL, n2 = NULL, label1 = "A", label2 = "B", log=NULL, cutoff = 10, selection.method = "rf.rfe", preselection.method = "mMs", subruns = 100, k = 10, subsamples = 10, bootstraps = 10, candidate.number = 300, above=1500, between=400, panel.selection.criterion="accuracy", importance.measure="MDA", ntree = 500, mtry = NULL, plot = FALSE, output.path = NULL, verbose = FALSE, method = "frequency")
elist |
|
n1 |
integer indicating the sample number in group 1 (mandatory). |
n2 |
integer indicating the sample number in group 2 (mandatory). |
label1 |
class label of group 1 (default: "A"). |
label2 |
class label of group 2 (default: "B"). |
log |
indicates whether the data is in log scale (mandatory; note: if TRUE log2 scale is expected). |
cutoff |
integer indicating how many features will be selected (default: 10). |
selection.method |
string indicating the feature selection method:
|
preselection.method |
string indicating the feature preselection
method: |
subruns |
integer indicating the number of resampling repeats to be
performed (default: 100). Has no effect when |
k |
integer indicating the number of k-fold cross validation subsets (default: 10, i.e., 10-fold CV). |
subsamples |
integer indicating the number of subsamples for ensemble
feature selection (default: 10). Has no effect when
|
bootstraps |
integer indicating the number of bootstrap samples for
ensemble feature selection (default: 10). Has no effect when
|
candidate.number |
integer indicating how many features shall be
preselected. Default is |
above |
mMs above parameter (integer). Default is |
between |
mMs between parameter (integer). Default is |
panel.selection.criterion |
indicating the panel selection
criterion: |
importance.measure |
string indicating the random forest importance
measure: |
ntree |
random forest parameter ntree (default: |
mtry |
random forest parameter mtry (default: |
plot |
logical indicating whether performance plots shall be plotted (default: FALSE). |
output.path |
string indicating the results output folder (optional). |
verbose |
logical indicating whether additional information shall be printed to the console (default: FALSE). |
method |
the feature selection method: "frequency" (default) for frequency-based or "ensemble" for ensemble feature selection. |
This function takes an EListRaw
or EList
object, group-specific
sample numbers, group labels and parameters choosing and configuring a
multivariate feature selection method (frequency-based or ensemble feature
selection) to select a panel of differential features. When an output path is
defined (via output.path
) results will be saved on the hard disk and
when verbose
is TRUE additional information will be printed to the
console.
Frequency-based feature selection (method="frequency"
): The whole data is
splitted in k cross validation training and test set pairs. For each training
set a multivariate feature selection procedure is performed. The resulting k
feature subsets are tested using the corresponding test sets (via
classification). As a result, selectFeatures()
returns the average k-fold
cross validation classification accuracy as well as the selected feature panel
(i.e., the union set of the k particular feature subsets). As multivariate
feature selection methods random forest recursive feature elimination (RF-RFE),
random jungle recursive feature elimination (RJ-RFE) and support vector machine
recursive feature elimination (SVM-RFE) are supported. To reduce running times,
optionally, univariate feature preselection can be performed (control via
preselection.method
). As univariate preselection methods mMs
("mMs"
), Student's t-test ("tTest"
) and mRMR ("mrmr"
) are
supported. Alternatively, no preselection can be chosen ("none"
). This
approach is similar to the method proposed in Baek et al.
Ensemble feature selection (method="ensemble"
): From the whole data the
previously defined number of subsamples is drawn defining pairs of training and
test sets. Moreover, for each training set a previously defined number of
bootstrap samples is drawn. Then, for each bootstrap sample SVM-RFE is performed
and a feature ranking is obtained. To obtain a final ranking for a particular
training set, all associated bootstrap rankings are aggregated to a single
ranking. To score the cutoff
best features, for each subsample a
classification of the test set is performed (using a svm trained with the
cutoff
best features from the training set) and the classification
accuracy is determined. Finally, the stability of the subsample-specific panels
is assessed (via Kuncheva index, Kuncheva LI, 2007), all subsample-specific
rankings are aggregated, the top n features (defined by cutoff
) are
selected, the average classification accuracy is computed, and all these results
are returned in a list. This approach has been proposed in Abeel et al.
If method
is "frequency"
, the results list contains the following
elements:
accuracy |
average k-fold cross validation accuracy. |
sensitivity |
average k-fold cross validation sensitivity. |
specificity |
average k-fold cross validation specificity. |
features |
selected feature panel. |
all.results |
complete cross validation results. |
If method
is "ensemble"
, the results list contains the following
elements:
accuracy |
average accuracy regarding all subsamples. |
sensitivity |
average sensitivity regarding all subsamples. |
specificity |
average specificity regarding all subsamples. |
features |
selected feature panel. |
all.results |
all feature ranking results. |
stability |
stability of the feature panel (i.e., Kuncheva index for the subrun-specific panels). |
Michael Turewicz, michael.turewicz@rub.de
Baek S, Tsai CA, Chen JJ.: Development of biomarker classifiers from high- dimensional data. Brief Bioinform. 2009 Sep;10(5):537-46.
Abeel T, Helleputte T, Van de Peer Y, Dupont P, Saeys Y: Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics. 2010 Feb 1;26(3):392-8.
Kuncheva, LI: A stability index for feature selection. Proceedings of the IASTED International Conference on Artificial Intelligence and Applications. February 12-14, 2007. Pages: 390-395.
cwd <- system.file(package="PAA") load(paste(cwd, "/extdata/Alzheimer.RData", sep="")) elist <- elist[elist$genes$Block < 10,] c1 <- paste(rep("AD",20), 1:20, sep="") c2 <- paste(rep("NDC",20), 1:20, sep="") pre.sel.results <- preselect(elist=elist, columns1=c1, columns2=c2, label1="AD", label2="NDC", log=FALSE, discard.threshold=0.1, fold.thresh=1.9, discard.features=TRUE, method="tTest") elist <- elist[-pre.sel.results$discard,] selectFeatures.results <- selectFeatures(elist, n1=20, n2=20, label1="AD", label2="NDC", log=FALSE, subsamples=2, bootstraps=1, candidate.number=20, method="ensemble")