featureEvaluate {BioSeqClass} | R Documentation |
Feature sets from different feature coding schemas are used as input of classification models, and the model performance are given in the result.
featureEvaluate(seq, classLable, fileName, ele.type, featureMethod, cv=10, classifyMethod="libsvm", group=c("aaH", "aaV", "aaZ", "aaP", "aaF", "aaS", "aaE"), k, g, hydro.methods=c("kpm", "SARAH1"), hydro.indexs=c("hydroE", "hydroF", "hydroC"), aaindex.name, n, d, w=0.05, start.pos, stop.pos, psiblast.path, database.path, hmmpfam.path, pfam.path, Evalue=10^-5, na.type="all", na.strand="all", diprodb.method="all", diprodb.type="all", svm.kernel="linear", svm.scale=FALSE, svm.path, svm.options="-t 0", knn.k=1, nnet.size=2, nnet.rang=0.7, nnet.decay=0, nnet.maxit=100)
seq |
a string vector for the protein, DNA, or RNA sequences. |
classLable |
a factor or vector for the class lable of sequences in seq. |
fileName |
a string for the output file name. |
ele.type |
a string for the type of biological sequence. This must be one of the strings "rnaBase", "dnaBase", "aminoacid" or "aminoacid2". |
featureMethod |
a string vector for the name of feature coding. The alternative names are "Binary", "CTD", "FragmentComposition", "GapPairComposition", "CKSAAP", "Hydro", "ACH", "AAindex", "ACI", "ACF", "PseudoAAComp", "PSSM", "DOMAIN", "BDNAVIDEO", and "DIPRODB". |
classifyMethod |
a string for the classification method. This must be one of the strings "libsvm", "svmlight", "NaiveBayes", "randomForest", "knn", "tree", "nnet", "rpart", "ctree", "ctreelibsvm", "bagging". |
cv |
an integer for the time of cross validation, or a string "leave\_one\_out" for the jacknife test. |
group |
a string vector for the group of amino acids. This alternative groups are: "aaH", "aaV", "aaZ", "aaP", "aaF", "aaS" or "aaE". |
k |
an integer indicating the length of sequence fragment (k>=1). |
g |
an integer indicating the distance between two aminoacids/bases (g>=0). |
hydro.methods |
a string vector for the methods of coding protein hydrophobic effect. This alternative groups are: "kpm" or "SARAH1". |
hydro.indexs |
a string vector for the methods of coding protein hydrophobic effect. This alternative groups are: "hydroE", "hydroF" or "hydroC". |
aaindex.name |
a string for the name of physicochemical and biochemical properties in AAindx. |
n |
an integer used as paramter of |
d |
an integer used as paramter of |
w |
a numeric value for the weight factor of sequence order effect in
|
start.pos |
a integer vector denoting the start position of the fragment window. If it is missing, it is 1 by default. |
stop.pos |
a integer vector denoting the stop position of the fragment window. If it is missing, it is the length of sequence by default. |
psiblast.path |
a string for the path of PSI-BLAST program blastpgp. blastpgp will be employed to iteratively search database and generate position-specific scores for each position in the alignment. |
database.path |
a string for the path of formatted protein database. Database can be formatted by formatdb program. |
hmmpfam.path |
a string for the path of hammpfam program in HMMER. hammpfam will be employed to predict domains using models in Pfam database. |
pfam.path |
a string for the path of pfam domain database. |
Evalue |
a numeric value for the E-value cutoff of perdicted Pfam domain. |
na.type |
a string for nucleic acid type. It must be "DNA", "DNA/RNA", "RNA", or "all". |
na.strand |
a string for strand information. It must be "double", "single", or "all". |
diprodb.method |
a string for mode of property determination. It can be "experimental", "calculated", or "all". |
diprodb.type |
a string for property type. It can be "physicochemical", "conformational", "letter based", or "all". |
svm.kernel |
a string for kernel function of SVM. |
svm.scale |
a logical vector indicating the variables to be scaled. |
svm.path |
a character for path to SVMlight binaries (required, if path is unknown by the OS). |
svm.options |
Optional parameters to SVMlight. For further details see: "How to use" on http://svmlight.joachims.org/. (e.g.: "-t 2 -g 0.1")) |
nnet.size |
number of units in the hidden layer. Can be zero if there are skip-layer units. |
nnet.rang |
Initial random weights on [-rang, rang]. Value about 0.5 unless the inputs are large, in which case it should be chosen so that rang * max(|x|) is about 1. |
nnet.decay |
parameter for weight decay. |
nnet.maxit |
maximum number of iterations. |
knn.k |
number of neighbours considered in function |
featureEvaluate
can test feature coding methods for short
peptide, protein, DNA or RNA.
It returns a ranked list based on the accuracy of classification result.
Each element in the list has three components: "data", "model", and "performance".
"data" is a data.frame object, which stores feature matrix and its last column
is the class label. "model" is a vector for feature coding method, which
contains 6 elements: "Feature\_Function", "Feature\_Parameter",
"Feature\_Number", "Model", "Model\_Parameter", and "Cross_Validataion".
"performance" is a vector for the performance result of classification model,
which contains 10 elements: "tp", "tn", "fp", "fn", "prcc", "sn", "sp", "acc",
"mcc", "pc".
Hong Li
## read positive/negative sequence from files. tmpfile1 = file.path(path.package("BioSeqClass"), "example", "acetylation_K.pos40.pep") tmpfile2 = file.path(path.package("BioSeqClass"), "example", "acetylation_K.neg40.pep") posSeq = as.matrix(read.csv(tmpfile1,header=FALSE,sep="\t",row.names=1))[,1] negSeq = as.matrix(read.csv(tmpfile2,header=FALSE,sep="\t",row.names=1))[,1] seq=c(posSeq,negSeq) classLable=c(rep("+1",length(posSeq)),rep("-1",length(negSeq)) ) if(interactive()){ ## test various feature coding methods. ## it may be time consuming. fileName = tempfile() testFeatureSet = featureEvaluate(seq, classLable, fileName, ele.type="aminoacid", featureMethod=c("Binary", "CTD", "FragmentComposition", "GapPairComposition", "Hydro"), cv=5, classifyMethod="libsvm", group=c("aaH", "aaV", "aaZ", "aaP", "aaF", "aaS", "aaE"), k=3, g=7, hydro.methods=c("kpm", "SARAH1"), hydro.indexs=c("hydroE", "hydroF", "hydroC") ) summary = read.csv(fileName,sep="\t",header=T) fix(summary) ## Evaluate features from different feature coding functions feature.index = 1:5 tmp <- testFeatureSet[[1]]$data colnames(tmp) <- paste(testFeatureSet[[feature.index[1]]]$model["Feature_Function"],testFeatureSet[[feature.index[1]]]$model["Feature_Parameter"],colnames(tmp),sep=" ; ") data <- tmp[,-ncol(tmp)] for(i in 2:length(feature.index) ){ tmp <- testFeatureSet[[feature.index[i]]]$data colnames(tmp) <- paste(testFeatureSet[[feature.index[i]]]$model["Feature_Function"],testFeatureSet[[feature.index[i]]]$model["Feature_Parameter"],colnames(tmp),sep=" ; ") data <- data.frame(data, tmp[,-ncol(tmp)] ) } name <- colnames(data) data <- data.frame(data, tmp[,ncol(tmp)] ) ## feature forward selection by 'cv_FFS_classify' ## it is very time consuming. combineFeatureResult = fsFFS(data,stop.n=50,classifyMethod="knn",cv=5) tmp = sapply(combineFeatureResult,function(x){c(length(x$features),x$performance["acc"])}) plot(tmp[1,],tmp[2,],xlab="featureNumber",ylab="Accuracy",main="result of FFS_KNN",pch=19) lines(tmp[1,],tmp[2,]) ## compare the prediction accuracy based on different feature coding methods and different classification models. ## it is very time consuming. testResult = lapply(c("libsvm", "randomForest", "knn", "tree"), function(x){ tmp = featureEvaluate(seq, classLable, fileName = tempfile(), ele.type="aminoacid", featureMethod=c("Binary", "CTD", "FragmentComposition", "GapPairComposition", "Hydro"), cv=5, classifyMethod=x, group=c("aaH", "aaV", "aaZ", "aaP", "aaF", "aaS", "aaE"), k=3, g=7, hydro.methods=c("kpm", "SARAH1"), hydro.indexs=c("hydroE", "hydroF", "hydroC") ); sapply(tmp,function(y){c(y$model[["Feature_Function"]], y$model[["Feature_Parameter"]], y$model[["Model"]], y$performance[["acc"]])}) }) tmpFeature = as.factor(c(sapply(testResult,function(x){apply(x[1:2,],2,function(y){paste(y,collapse="; ")})}))) tmpModel = as.factor(c(sapply(testResult,function(x){x[3,]}))) tmp1 = data.frame(as.integer(tmpFeature), as.integer(tmpModel), as.numeric(c(sapply(testResult,function(x){x[4,]}))) ) require(scatterplot3d) s3d=scatterplot3d(tmp1,color=c("red","blue","green","yellow")[tmp1[,2]],pch=19, xlab="Feature Coding", ylab="Classification Model", zlab="Accuracy under 5-fold cross validation",lab=c(10,6,7), y.ticklabs=c("",as.character(sort(unique(tmpModel))),"") ) }