designSampleSizeClassification {MSstats}R Documentation

Estimate the optimal size of training data for classification problem

Description

For classification problem (such as disgnosys of disease), calculate the mean predictive accuray under different size of training data for future experiments of a Selected Reaction Monitoring (SRM), Data-Dependent Acquisition (DDA or shotgun), and Data-Independent Acquisition (DIA or SWATH-MS) experiment based on simulation.

Usage

designSampleSizeClassification(data, n_sample = 5, sample_incr = 20,
  protein_desc = 0.2, iter = 10)

Arguments

data

output from function dataProcess

n_sample

number of different sample size to simulate. Default is 5

sample_incr

number of samples per condition to increase at each step. Default is 20

protein_desc

the fraction of proteins to reduce at each step. Proteins are ranked based on their mean abundance across all the samples. Default is 0.2. If protein_desc = 0.0, protein number will not be changed.

iter

number of times to repeat simulation experiments. Default is 10

Details

The function fits intensity-based linear model on the input prelimiary data data and uses variance components and mean abundance to simulate new training data with different sample size and protein number. Random forest model is fitted on simulated train data and used to predict the input preliminary data data. The above procedure is repeated iter times. Mean predictive accuracy and variance under different size of training data are reported.

Value

meanPA is the mean predictive accuracy matrix under different size of training data.

varPA is variance of predictive accuracy under different size of training data.

Author(s)

Ting Huang, Meena Choi, Olga Vitek.

Maintainer: Meena Choi (mnchoi67@gmail.com)

References

T. Huang et al. TBD 2018

Examples

# Consider the training set from a colorectal cancer study
# Subjects are from control group or colorectal cancer group
# 72 proteins were targeted with SRM
require(MSstatsBioData)
set.seed(1235)
data(SRM_crc_training)
QuantCRCSRM <- dataProcess(SRM_crc_training, normalization = FALSE)
# estimate the mean predictive accuray under different sizes of training data
# n_sample is the number of different sample size to simulate
# Datasets with 10 different sample size and 3 different protein numbers are simulated 
result.crc.srm <- designSampleSizeClassification(data=QuantCRCSRM, 
n_sample = 10, 
sample_incr = 10, 
protein_desc = 0.33, 
iter = 50)
result.crc.srm$meanPA # mean predictive accuracy

[Package MSstats version 3.14.1 Index]